RAVEL: an annotated corpus for training robots with audiovisual abilities

Xavier Alameda-Pineda¹,
Jordi Sanchez-Riera¹,
Johannes Wienke²,
Vojtěch Franc³,
Jan Čech¹,
Kaustubh Kulkarni¹,
Antoine Deleforge¹ &
…
Radu Horaud¹

1459 Accesses
Explore all metrics

Abstract

We introduce Ravel (Robots with Audiovisual Abilities), a publicly available data set which covers examples of Human Robot Interaction (HRI) scenarios. These scenarios are recorded using the audio-visual robot head POPEYE, equipped with two cameras and four microphones, two of which being plugged into the ears of a dummy head. All the recordings were performed in a standard room with no special equipment, thus providing a challenging indoor scenario. This data set provides a basis to test and benchmark methods and algorithms for audio-visual scene analysis with the ultimate goal of enabling robots to interact with people in the most natural way. The data acquisition setup, sensor calibration, data annotation and data content are fully detailed. Moreover, three examples of using the recorded data are provided, illustrating its appropriateness for carrying out a large variety of HRI experiments. The Ravel data are publicly available at: http://ravel.humavips.eu/.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

RoboMNIST: A Multimodal Dataset for Multi-Robot Activity Recognition Using WiFi Sensing, Video, and Audio

Article Open access 22 February 2025

The Robot Soundscape

DualKeepon: a human–robot interaction testbed to study linguistic features of speech

Article 12 October 2018

Notes

References

Alameda-Pineda X, Khalidov V, Horaud R, Forbes F (2011) Finding audio-visual events in informal social gatherings. In: Proceedings of the ACM/IEEE international conference on multimodal interaction
Arnaud E, Christensen H, Lu Y-C, Barker J, Khalidov V, Hansard M, Holveck B, Mathieu H, Narasimha R, Taillant E, Forbes F, Horaud RP (2008) The CAVA corpus: synchronised stereoscopic and binaural datasets with head movements. In: Proceedings of the ACM/IEEE international conference on multimodal interfaces. http://perception.inrialpes.fr/CAVA_Dataset/
Bailly-Baillire E, Bengio S, Bimbot F, Hamouz M, Kittler J, Mariéthoz J, Matas J, Messer K, Porée F, Ruiz B (2003) The BANCA database and evaluation protocol. In: Proceedings of the international conference on audio and video-based biometric person authentication. Springer, Berlin, pp 625–638 (2003)
Bouguet J-Y (2008) Camera calibration toolbox for Matlab. http://www.vision.caltech.edu/bouguetj/calib_doc/
Brookes M. Voicebox: speech processing toolbox for matlab. http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html
Brugman H, Russel A, Nijmegen X (2004) Annotating multi-media/multimodal resources with ELAN. In: Proceedings of the international conference on language resources and evaluation, pp 2065–2068
Cech J, Sanchez-Riera J, Horaud RP (2011) Scene flow estimation by growing correspondence seeds. In: Proceedings of the IEEE international conference on computer vision and pattern recognition (2011)
Cherry EC (1953) Some experiments on the recognition of speech, with one and with two ears. J Acoust Soc Am 25(5):975–979
Article Google Scholar
Cooke M, Barker J, Cunningham S, Shao X (2007) An audio-visual corpus for speech perception and automatic speech recognition (l). Speech Commun 49(5):384–401
Article Google Scholar
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proceedings of the IEEE international conference on computer vision and pattern recognition
Gorelick L, Blank M, Shechtman E, Irani M, Basri R (2007) Actions as space-time shapes. IEEE Trans Pattern Anal Mach Intell 29(12):2247–2253. http://www.wisdom.weizmann.ac.il/vision/SpaceTimeActions.html
Google Scholar
Hansard M, Horaud RP (2008) Cyclopean geometry of binocular vision. J Opt Soc Am 25(9):2357–2369
Article Google Scholar
Hartley RI, Zisserman A (2004) Multiple view geometry in computer vision, 2nd edn. Cambridge University Press, Cambridge. ISBN:0521540518
Haykin S, Chen Z (2005) The cocktail party problem. J Neural Comput 17:1875–1902
Article Google Scholar
Hazen TJ, Saenko K, La C-H, Glass JR (2004) A segment-based audio-visual speech recognizer: data collection, development, and initial experiments. In: Proceedings of the ACM international conference on multimodal interfaces, ICMI ’04. ACM, New York, pp 235–242 (2004)
Hoai M, Zhong Lan Z, De la Torre F (2011) Joint segmentation and classification of human actions in video. In: Proceedings of the IEEE international conference on computer vision and pattern recognition
Kalal Z, Mikolajczyk K, Matas J (2012) Tracking-learning-detection. IEEE Trans Pattern Anal Mach Intell 34(7):1409–1422
Article Google Scholar
Khalidov V, Forbes F, Horaud R (2011) Conjugate mixture models for clustering multimodal data. J Neural Comput 23(2):517–557
Article MathSciNet MATH Google Scholar
Kim HD, Suk Choi J, Kim M (2007) Human-robot interaction in real environments by audio-visual integration. Int J Control Autom Syst 5(1):61–69
Google Scholar
Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123
Google Scholar
Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Proceedings of the IEEE international conference on computer vision and pattern recognition
Lathoud G, Odobez J, Gatica-Pérez D (2005) AV16.3: an audio-visual corpus for speaker localization and tracking. In: Proceedings of the workshop on machine learning and multimodal interaction. Springer, Berlin (2005)
Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos “in the wild”. In: Proceedings of the IEEE international conference on computer vision and, pattern recognition
Luo R, Kay M (1989) Multisensor integration and fusion in intelligent systems. IEEE Trans Syst Man Cybern 19(5):901–931
Article Google Scholar
Marcel S, McCool C, Matejka P, Ahonen T, Cernocky J (2010) Mobile biometry (MOBIO) face and speaker verification evaluation. Idiap-RR Idiap-RR-09-2010, Idiap
Marszalek M, Laptev I, Schmid C (2009) Actions in context. In: Proceedings of the IEEE international conference on computer vision and pattern recognition
Messer K, Matas J, Kittler J, Jonsson K (1999) XM2VTSDB: the extended M2VTS database. In: Proceedings of the international conference on audio and video-based biometric person authentication, pp 72–77
Messing R, Pal C, Kautz H (2009) Activity recognition using the velocity histories of tracked keypoints. In: Proceedings of the IEEE international conference on computer vision. IEEE Computer Society, Washington
Mohammad Y, Xu Y, Matsumura K, Nishida T (2008) The H3R explanation corpus human-human and base human-robot interaction dataset. In: International conference on intelligent sensors, sensor networks and information processing, pp 201–206
Patterson EK, Gurbuz S, Tufekci Z, Gowdy JN (2002) CUAVE: a new audio-visual database for multimodal human-computer interface research. In: Proceedings of the IEEE international conference on acoustics speech and signal processing, pp 2017–2020
Pigeon S (1996) M2vts database. http://www.tele.ucl.ac.be/PROJECTS/M2VTS/
Rybok L, Friedberger S, Hanebeck UD, Stiefelhagen R (2011) The KIT Robo-Kitchen data set for the evaluation of view-based activity recognition systems. In: Proceedings of the IEEE-RAS international conference on humanoid robots
Schüldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local svm approach. In: Proceedings of the international conference on pattern recognition, pp 32–36
Shi Q, Wang L, Cheng L, Smola A (2011) Discriminative human action segmentation and recognition using SMMs. Int J Comput Vis 93(1):22–32
Article MATH Google Scholar
T. O Project (2011) http://www.opportunity-project.eu/
Tenorth M, Bandouch J, Beetz M (2009) The TUM Kitchen Data Set of everyday manipulation activities for motion tracking and action recognition. In: Proceedings of the IEEE international workshop on tracking humans for the evaluation of their motion in image sequences in conjunction with the international conference on computer vision
Vedula S, Baker S, Rander P, Collins R (2005) Kanade T (2005) Three-dimensional scene flow. IEEE Trans Pattern Anal Mach Intell 27(3):475–480
Article Google Scholar
Weinland D, Ronfard R, Boyer E (2006) Free viewpoint action recognition using motion history volumes. J Comput Vis Image Understanding 104(2):249–257. http://4drepository.inrialpes.fr/public/viewgroup/6
Google Scholar
Willems G, Becker JH, Tuytelaars T (2009) Exemplar-based action recognition in video. In: Proceedings of the British machine vision conference
Zivkovic Z, Booij O, Krose B, Topp E, Christensen H (2008) From sensors to human spatial concepts: an annotated data set. IEEE Trans Rob 24(2):501–505
Article Google Scholar

Download references

Acknowledgments

This work was supported by the EU project HUMAVIPS FP7-ICT-2009-247525.

Author information

Authors and Affiliations

INRIA Grenoble Rhône-Alpes, 655, Avenue de l’Europe, 38330 , Montobonnot, France
Xavier Alameda-Pineda, Jordi Sanchez-Riera, Jan Čech, Kaustubh Kulkarni, Antoine Deleforge & Radu Horaud
Universität Bielefeld, Universitätsstraße 25, 33615 , Bielefeld, Germany
Johannes Wienke
Czech Technical University, Technická 2, 166 27 , Prague, Czech Republic
Vojtěch Franc

Authors

Xavier Alameda-Pineda
View author publications
You can also search for this author in PubMed Google Scholar
Jordi Sanchez-Riera
View author publications
You can also search for this author in PubMed Google Scholar
Johannes Wienke
View author publications
You can also search for this author in PubMed Google Scholar
Vojtěch Franc
View author publications
You can also search for this author in PubMed Google Scholar
Jan Čech
View author publications
You can also search for this author in PubMed Google Scholar
Kaustubh Kulkarni
View author publications
You can also search for this author in PubMed Google Scholar
Antoine Deleforge
View author publications
You can also search for this author in PubMed Google Scholar
Radu Horaud
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xavier Alameda-Pineda.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Alameda-Pineda, X., Sanchez-Riera, J., Wienke, J. et al. RAVEL: an annotated corpus for training robots with audiovisual abilities. J Multimodal User Interfaces 7, 79–91 (2013). https://doi.org/10.1007/s12193-012-0111-y

Download citation

Received: 29 February 2012
Accepted: 08 July 2012
Published: 07 September 2012
Issue Date: March 2013
DOI: https://doi.org/10.1007/s12193-012-0111-y

RAVEL: an annotated corpus for training robots with audiovisual abilities

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

RoboMNIST: A Multimodal Dataset for Multi-Robot Activity Recognition Using WiFi Sensing, Video, and Audio

The Robot Soundscape

DualKeepon: a human–robot interaction testbed to study linguistic features of speech

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

RAVEL: an annotated corpus for training robots with audiovisual abilities

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

RoboMNIST: A Multimodal Dataset for Multi-Robot Activity Recognition Using WiFi Sensing, Video, and Audio

The Robot Soundscape

DualKeepon: a human–robot interaction testbed to study linguistic features of speech

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now