Abstract
In our research, we recorded 298 min (1049 sentences) of speech audio data and the motion capture data of the accompanying gestures from two 25-year-old male participants aiming for future usage in deep learning concerning gesture and speech. The data was recorded in form of an interview, the participant explaining a topic prepared in advance, using a headset microphone and the motion capture software . The speech audio was stored in mp3, and the motion data was stored in bvh, related as data from the same sentence. We aimed to mainly acquire metaphoric gestures and iconic, as categorized by McNiel. For the categories of the recorded gestures, metaphoric gestures appeared the most, 68.41% of all gestures, followed by 23.73% beat gestures, 4.76% iconic gestures, and 3.11% deictic gestures.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
In recent years, virtual characters with a similar body structure with humans, often referred to as virtual humans, have gained much interest. Implementing these virtual humans in a system allows you to make use of non-verbal information, which is frequently used in face-to-face communication to clarify one’s intentions or the context of the words spoken [2], in a system-human interaction. Especially, gestures play an important role in aiding comprehension of the content presented. Many researches concerning the extent and actual effects of gestures in interactions have been carried out [3].
In the current state, there are two common ways of making the gestures to be implemented with a virtual human: using the data collected from actual humans using motion capture technology, and creating animation for the virtual human’s model manually. However, these ways are considered costly, the former requiring financial costs for actually purchasing a motion capture system, and the latter requiring professional knowledge and experience. In research level, utilizing Behavior Markup Language (BML) [5] has also been a popular method, although this too requires expertise for application. Many attempts have been made to automatically generate gestures from text or speech data, but not many have utilized deep learning in doing so. However, there are no datasets with speech and gesture paired that could be used for such learning. Therefore in this paper, we aim to create a dataset of pairs of speech data and motion data of the accompanying gestures that could be used for future deep learning (Fig. 1).
2 Method
In this section, we will describe the details of the data and how it was acquired. A total of 298 min (1049 sentences) of speech audio data with the motion data of the accompanying gestures were recorded. Additionally, video data of the participants was also recorded so the validity of the two data could be checked afterward.
In our recordings, we aimed to mainly acquire metaphoric gestures and iconic gestures, as categorized by McNiel [3]. Metaphoric gestures are gestures in which an abstract meaning is visually expressed as if it had a physical form, such as showing an empty palm as to indicate one is ‘presenting an idea’. Iconic gestures are used to illustrate physical, concrete items or acts, like expressing how large an object is or rapidly moving one’s hand up and down to indicate the action of chopping something. These gestures aid listeners in comprehending the structure and events or objects depicted in the speech, and have many potential uses in explanation, learning, and teaching [4]. Deictic gestures, used to indicate real/imaginary objects, people, directions, etc. around the speaker, were considered inappropriate for usage in deep learning aiming to learn the association between speech and gesture, heavily depending on the speaker’s surrounding environment rather than the actual context of the speech. Also, beat gestures, used for emphasis and expressing the rhythm of conversations, have little relation to the actual context of speech and were not considered to be viable to be used in the learning.
2.1 Devices
Motion data was acquired using the software Motive:Tracker by SPICE Inc., along with a motion capture suit with 49 markers and 8 OptiTrack Prime 17 W cameras, placed in an \(850 \times 850\) m area. The recorded motion data was exported to bvh format, in which motion data is described as the hierarchy and initial pose of the skeleton and time sequence data of each joint’s rotation angle.
Speech data was acquired using a headset, as to not hinder the subject’s movement. The recorded speech data was stored in mp3 format. Video data was acquired using a stationary video camera and stored in mp4 format.
2.2 Participants and Procedure
The participants were 2 male undergraduate students, both at the age of 25.
The data was recorded in form of an interview, where the participant explains a topic prepared and thought about beforehand. Several other methods were attempted, but these methods were considered unsuited for the recording.
First, when having the participant read a transcript out loud, valid gestures did not appear. This is thought to be because the speaker has to have a concrete enough image about the context of what they were talking about for gestures to naturally appear during speaking.
Second, when having the participant make a presentation using a slide show, deictic gestures appeared with too much frequency, since the speaker tended to point at his presentation slide while explaining.
Third, when having the read a transcript of easy context such as fairy tales, and instructing the participant to concentrate on using plausible gestures while speaking, the participant often used gestures too frequently, and gestures that were too exaggerated. Putting too much emphasis on doing gestures led the gesture usage to be unnatural, and having such gestures in the dataset would have a negative effect on the learning.
Recording took place in a comfortably large quiet room, where only the subject of the motion/speech data and the person operating the recording devices were allowed to enter so that the recorded speech contains as less static as possible. One participant was to wear the headset and motion capture suit and make sure there are no problems with the positions and number of markers. Then, the participant was to take a T-pose so the recorder can make sure that the motion tracking was calibrated correctly. After checking, the recorder starts recording. Before proceeding to speak, the participant claps his hands once so that portion could be used to sync the speech, motion, and video data. When finished, the participant goes into a T-pose once again.
2.3 Creating the Dataset
The recorded motion, speech, and video data were split per sentence and saved to a server through a simple Ruby on Rails web program. The paths to the bvh motion data, mp3 speech data, mp4 video data, along with meta information concerning them such as the information of the actor, date recorded, the topic which the sentence belongs to, and tags concerning which type of gesture and how much of each type appeared during the sentence, were stored in a MySQL database on the server. The motion, speech, and video data could be previewed when uploading, and played back after saving, as shown in Fig. 2. Also, the data could be played simultaneously with any combination, to verify that the data were synced properly.
3 Results
The categories of the recorded gestures were as shown in Table 1, metaphoric gestures appearing the most, 68.41% of all gestures. Beat gestures were next most common, appearing 23.73% of all gestures. Iconic and Deictic gestures appeared very scarcely, both being under 5% of the gestures recorded.
4 Discussion and Conclusions
We aimed to create a dataset of speech data and motion data of the gestures accompanying the spoken content that could be used for incorporating deep learning methods to learn the relation between speech and gesture. We were able to keep the number of deictic gestures seen to a very small limit. This is because the method of our data recording did not involve having to rely on any physical objects in speaking about the presented topic. As long as the spoken topic itself does not have too much relation with having to indicate objects or places, it should be able to suppress the number of deictic gestures to a fair extent. Also, it is not surprising that beat gestures also appeared frequently, as it is known to be a common type of gesture in communication, appearing unconsciously even in situations where the speaker cannot see the listener [1]. Although the gesture itself does not have any semantic meaning, researches state that beat gestures can have a positive effect on the semantic processing of the accompanying words [6]. Despite this, we believe that focusing mainly on learning beat gestures would not be productive, because it would be a hard task to determine if the results are appropriate, since judging if the short-baton like movements are correct for the accompanying content in viewpoints other than timing would be too vague of a judgment. If there is a need to include beat gestures in learning gestures, one must take into account the prosodic features and/or pitch of the speech data, because points of emphasis and the rhythm of the speech usually cannot be determined by the context of the speech alone. Reducing the number of beat gestures remains as an issue.
5 Future Work
As future work, we aim to increase the number of data in the dataset if possible, and use it to train a Recurrent Neural Network to learn the association between speech features and gestures, as they are both sequential data, so that the result can be used to automatically generate appropriate gestures for the input speech. We plan on using Mel-Frequency Cepstrum Coefficients (MFCC), often used in speech recognition, to vectorize the speech data to act as input data, and use the time sequential joint rotation data of motion bvh data as label data.
References
Alibali, M.W., Heath, D.C., Myers, H.J.: Effects of visibility between speaker and listener on gesture production: some gestures are meant to be seen. J. Mem. Lang. 44(2), 169–188 (2001)
Knapp, M.L., Hall, J.A., Horgan, T.G.: Nonverbal Communication in Human Interaction. Cengage Learning, Boston (2013)
McNeill, D.: Hand and Mind: What Gestures Reveal About Thought. University of Chicago Press, Chicago (1992)
Roth, W.M.: Gestures: their role in teaching and learning. Rev. Educ. Res. 71(3), 365–392 (2001)
Vilhjálmsson, H., et al.: The behavior markup language: recent developments and challenges. In: Pelachaud, C., Martin, J.-C., André, E., Chollet, G., Karpouzis, K., Pelé, D. (eds.) IVA 2007. LNCS (LNAI), vol. 4722, pp. 99–111. Springer, Heidelberg (2007). doi:10.1007/978-3-540-74997-4_10
Wang, L., Chu, M.: The role of beat gesture and pitch accent in semantic processing: an ERP study. Neuropsychologia 51(13), 2847–2855 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Takeuchi, K., Kubota, S., Suzuki, K., Hasegawa, D., Sakuta, H. (2017). Creating a Gesture-Speech Dataset for Speech-Based Automatic Gesture Generation. In: Stephanidis, C. (eds) HCI International 2017 – Posters' Extended Abstracts. HCI 2017. Communications in Computer and Information Science, vol 713. Springer, Cham. https://doi.org/10.1007/978-3-319-58750-9_28
Download citation
DOI: https://doi.org/10.1007/978-3-319-58750-9_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-58749-3
Online ISBN: 978-3-319-58750-9
eBook Packages: Computer ScienceComputer Science (R0)