1 Introduction
The next wave of social robots will have increased emotional capabilities and social interaction skills for establishing long-term relationships with users [
32]. However, expression capabilities in current humanoid robots are still limited. One of the limitations is the low number of available
degrees of freedom (DoF) of robots compared to humans. There are doubts that robots will ever be able to imitate humans perfectly in terms of their appearances and movements [
1]. Slight discrepancies in the appearance and movements of a robot could be perceived as unnatural and evoke the uncanny valley problem [
35]. In light of this limitation, there is a need for experimenting and developing new ways and processes of human-robot interaction to optimize the social robot’s added value.
Research on sonic interactions that allow robots to express intentions and emotions through sounds is an emerging field in
Human-Robot Interaction (
HRI). On the one side, sound has been used paralinguistically, expressing intentions and emotions using short bursts of sound with intonation and rhythms informed by verbal communication. Such uses have been grouped under the umbrella term
Semantic-Free Utterances (
SFUs) and have been reviewed extensively by Yilmazyildiz et al. [
52]. On the other side, research has also been focused on the mechanical sounds inherent to a robot (i.e., consequential sounds [
29]). The sounds of servo motors—with which robots’ movement are realized—have been considered noise and to negatively interfere with interactions [
18,
23,
34,
47]. To resolve this issue, several strategies of masking or alteration have been explored to make the sound intentional, thus improving the interaction. In [
53], the amplitude and pitch of the recorded noises of a UR5e robot arm are modified (i.e., to be louder, quieter, higher-pitched, and lower-pitched); in [
47], recorded noises of an ARMAR-IIIb robot are masked with a music clip; in [
54], videos depicting various robots (Cozmo, NAO, TurtleBot 2, Baxter, and UR5e) are overlaid with
“transformative sound”; in [
40], video recordings of a Fetch robot are overlaid with additional
“artificial sound”; and in [
17], recorded sounds of a NAO robot are used to inform the sound models in a method called blended sonification (defined in [
48]).
Our aim with the studies presented in this article is to further contribute to the strategies for compensating for limitations in robot communicative channels. Here, we probe aesthetics strategies of robot sound produced by movement sonification of a robot, i.e., by mapping the robot’s movements to sound models in real time. More precisely, we wanted to provide the answer to the question
“What are the preferences in terms of the complexity and the materiality of robot sound in movement sonification?” For the complexity, we compared the use of the sawtooth waveform (one of the basic sound waves in sound synthesis) and a more complex sound synthesis based on a feedback chain. For the materiality, we compared the use of sound synthesis informed by materials associated with a robot in motion: the internal mechanisms (i.e., motor/engine sound), the outer appearance (i.e., metallic sound), and the motion (i.e., whoosh/wind vortices sound). A detailed explanation of the sound models is presented in Section
3.2, followed by the explanation of the sonification mapping strategy in Section
3.3.
Three studies were conducted. The first study (presented in Section
3.4) explored how the first set of sound models could influence the perception of expressive gestures of a Pepper robot. Through an online survey, participants were asked to rate a set of video stimuli in terms of perceived affective states. In the second study (Section
3.5), an experiment was carried out in a museum installation with a Pepper robot presented in two scenarios: (1) while welcoming patrons to a restaurant and (2) while providing information to visitors in a shopping center. The two sets of sound models were used to complement the robot’s gesture, and museum visitors were asked to choose their preferred sound for the different scenarios. Finally, in the third study (Section
3.6), we conducted an online survey with stimuli similar to those used in the second study. The study aimed to compare participants’ preferences for sound aesthetics when watching online video stimuli with those made by visitors at the museum installation.
3 Studies
In this section, first we describe the sound models developed for the studies (Section
3.2), followed by the description of the robot’s movement and how it informed the sound model (Section
3.3). Afterward, the three studies will be presented (Sections
3.4–
3.6).
3.1 Compliance with Ethics Standard
The studies were conducted in accordance with the Declaration of Helsinki. At the time that the experiments were conducted, no ethics approval was required from our institution for perceptual studies such as the ones reported in this article. Data collected are processed in compliance with the General Data Protection Regulation in the European Union (EU GDPR). For the management of participants’ personal data, we followed regulations according to the KTH Royal Institute of Technology’s Ethics Officer.
Participation in the studies is voluntary, no compensation is provided to the participants. Participants in all of the studies sign a consent form and fill in a demographic questionnaire with information about their age, nationality, previous encounters with robots, and music experiences. For children and teenagers, the consent forms have been signed by their guardians. No names or other personal identifiers have been collected.
3.2 Audio Stimuli
Two sets of audio stimuli were developed for the three experiments presented in this article: one focusing on sound complexity and the other on sound materials. A brief overview of all audio stimuli used in the experiments is presented in Table
1, while the spectra visualizations of the audio stimuli are presented in Figure
1.
13.2.1 Sound Complexity.
The sound models focusing on the sound complexity were implemented in SuperCollider
2 (note that these two sound models were developed as Sound-S2 and Sound-S3, while the original recorded robot movement sound was labeled as Sound-S1). The sound models were originally realized for the sonification of robot movements in video stimuli as in Study 1; they were then adapted to be able to handle the real-time sonification of Study 2 and Study 3. For details on how the robot movement data are integrated into the sound models, see Section
3.3.
The sound model used for Sound-S2 is straightforward and based on a filtered sawtooth. It can be shaped with a parametric envelope and can be granularized. The output is then sent into a reverberation module. This sound model has previously been utilized to depict the category of pitched sound in [
38]. Prior studies have already proven the efficiency of these basic sounds in sonic HRI, as shown in the
BEST (
Bremen Emotional Sound Toolkit) database [
26].
The sound model used for Sound-S3 is more complex and refined, and is based on feedback chains. This synthesis was chosen because it had been previously developed with clear aesthetics aims rather than to fulfill a specific functionality, thus meaning its intrinsic sound quality and timbral search were in the foreground. In this sound model, a feedback chain is excited by an impulse of a non-band-limited pulse oscillator. The impulse is written on an internal bus and filtered by a bandpass filter that center frequency can dynamically move between two extremes. Therefore, spectral processes are applied to the signal: shifting of the phases and partial randomization of the bins’ order. The window size chosen for the
fast Fourier transform (
FFT) analysis affects the rhythmicity of the result. After spectral manipulations, the result is converted back into time-domain audio data using an
inverse FFT (
IFFT). At this point, the obtained signal is not only sent to the output but also sent back into the chain by multiplying it by a feedback factor and re-injecting it into the bandpass filter. Finally, an envelope follower on the resulting sound is used as a negative feedback control signal to prevent saturation. A simplified diagram of the feedback sound model’s signal flow is shown in Figure
2: the dashed arrows going into the sound model indicate the parameters, namely, the range for the center band frequency, the FFT window size, the scrambling coefficient for the FFT bins (which is a value that can go from 0 to 1.0) and the final sound amplitude. These parameters can be set and controlled in real time (i.e., mapped to the robot arms’ movements) to obtain variations in the output produced by this sound model. The overall resulting sound from this sound model is characterized by smoother attacks and, importantly, by richer spectra with partials distributed in a more complex way compared to the ones generated with the sound model based on sawtooth waveforms, possibly resulting in more pleasant timbres thanks to spectral fusion.
3.2.2 Sound Material.
The sound models focusing on the sound material were developed using the
Sound Design Toolkit (
SDT) in MaxMSP.
3 SDT is a
“collection of physically informed sound synthesis models” developed by Baldan et al. [
3], chosen for its realism in synthesizing physical sound and therefore responding to physics parameters such as the velocity and the acceleration of movements. The sound models were developed for the purpose of real-time sonification in Study 2 and Study 3, driven by real-time data from the Pepper robot.
The choice of materials was based on a robot in motion from three perspectives: internal mechanisms, physical appearance, and arms displacement. In the first sound model (Sound-M1), the presence of a robot’s internal mechanism (e.g., servomotor) was highlighted using the SDT synthesis models [sdt.motor] and [sdt.dcmotor].
For the second sound model (Sound-M2) with an emphasis on the robot’s physical appearance, basic solid interaction from SDT was used. As described in [
3, p. 258], this interaction is
“a nonlinear impact between one [sdt.inertial] resonator and one [sdt.modal] resonator, according to the characteristics of the collision set in the [sdt.impact] interactor.” Since the Pepper robot is designed with a curved shape and cute appearance, a softer tone of metal sound was chosen to get the feel of a light impact on a hollow metal object. The third sound model (Sound-M3) was aimed to highlight the arms’ displacements with whoosh sounds. From the SDT, the object used was
[sdt.windkarman] that simulates vortices (i.e., whoosh sound) caused by airflows across thin obstacles, such as a tree branch or a suspended wire.
3.3 Robot Movement Data to Sonification
The robot employed in these studies is a humanoid social robot, Pepper.
4 Pepper is a 120-cm-tall robot with 20 degrees of freedom designed with aesthetics considerations to be deployed as a social robot in a real-world situation, e.g., in education and businesses.
The sound models were not developed with emotional content. Instead, the emotional content is found in the robot’s movement (e.g., showing excitement by raising both hands in a fast movement). The purpose of movement sonification is to communicate the movement and the intention behind the movement.
In Study 1, the sound models described in Section
3.2.1 were used to create the movement sonification of the video-recorded stimuli. In order to create appropriate sounds for the gestures, temporal marks were inserted into the video recordings in correspondence with the robot arms’ movements. The timestamps were then used to create a score on SuperCollider’s client side using the
Task, which is a plausible process. An array was initialized with the timestamps, and was used to schedule all the sound events, specifying the parameter variations in the sound models in terms of pitch, envelope, and filtering. The choice of the parameters was informed by the characteristic values of musical variables used in the communication of emotions in music performance as reported in previous research [
9]. The resulting sonifications were then recorded as audio tracks and finally superimposed on the videos.
In Studies 2 and 3, the sound models were driven by Pepper’s Movement Data in real time. A Python script has been developed utilizing the NAOqi Framework.
5 Specifically, the
ALMemory module was used to access the current state of all the actuators (i.e., servo motors) of the robot. For our sound models, we tracked the state of the robot’s arms (
ShoulderRoll,
ShoulderPitch, and
ElbowRoll). More precisely, the tracked data were the rotational angle of these actuators (in radians) that were then used to calculate the position of the robot’s hands in space. The information was then streamed through the
OpenSoundControl (
OSC) communication channel
6 to the sound models in SuperCollider and MaxMSP.
In the case of the models developed for sound complexity (Section
3.2.1), we focused on the position in the space of the robot’s right hand. The hand position was used to control the sound models’ parameters: for the sawtooth model, it was mapped to the fundamental sawtooth frequency (from 300 to 500 Hz) and to the cutoff frequency of the filter (from 1,000 to 1,100 Hz). In this way, a raising hand is associated with a sound raising in pitch and therefore becoming more brilliant. The first derivative of the hand position, scaled to the range
\([0, 0.5]\), was instead used to control the overall sound amplitude: in this way, when a gesture slows down, the amplitude decreases accordingly, and when it stops, the amplitude becomes zero, corresponding to silence. For the feedback sound model, the hand position was mapped to the lower and upper bounds of the bandpass center frequency (from 300 to 2,000 Hz, respectively, and from 500 to 2,500 Hz in order to also take into account the non-linearity of sound height perception), so as to have the same perceptual association between raising hand and raising pitches. Similarly to the sawtooth sound model, the first derivative of the hand position was also used to control the overall amplitude of the feedback sound model. The FFT window size and scrambling coefficient were instead kept fixed to the values of
\(N = 1,024\) and 0.1, respectively.
In the case of the models developed for sound material (Section
3.2.2), the sound models’ parameters were mapped to the arms’ movement velocity: for the internal mechanism sound model, the robot arms’ movements were mapped to the rpm of the motors; for the physical appearance sound model, the impact between the inertial and the modal resonator was repeated with varying amplitude based on the robot arms’ movement velocity; and lastly, for the whoosh sound model, the robot arms’ movements were mapped to the speed of the airflow. In this way, faster movement is associated with a sound raising in intensity, similar to the behavior of physical objects (e.g., how a higher motor rpm is associated with a more intense engine sound).
3.4 Study 1. Perceptual Evaluation of Sound Complexity
In Study 1 (pilot study, previously presented in [
31]), we explored how different sets of sounds designed for expressive robot movements of a Pepper robot could influence the perception of emotional intentions. A perceptual rating experiment was carried out with the two sound models presented in Section
3.2.1: one based on sawtooth waveforms and a more complex one based on feedback chains.
Stimuli for Study 1
The stimuli for Study 1 were prepared using video recordings of a Pepper robot, in which the sounds inherently produced by the moving parts of the robot were preserved and mixed with new sound synthesis. Four gestures were recorded showing the robot’s expression of emotion adapted from the emotional postures previously defined in [
16]. The four emotions (anger, excitement, relaxation, and sadness; see Figure
3) were chosen to cover each of the four quadrants of the circumplex space of emotions [
41].
Three audio strategies were implemented for each gesture: Sound-S1, Sound-S2, and Sound-S3 as described in Table
1. The stimuli were further expanded into video-only (V), audio-only (S1, S2, S3), and audio-video (VS1, VS2, VS3) for a total of 28 stimuli.
Procedure for Study 1
The study was conducted through an online survey with participants recruited among students and colleagues at KTH Royal Institute of Technology. Participants were asked to rate each of the 28 stimuli on four semantic differential scales corresponding to the four emotions, and on five semantic opposites for describing other qualities of the stimuli (pleasantness, trustworthiness, typicality, efficiency, and likeability). The five qualities were inspired by a previous study in which researchers identified properties relevant for describing vacuum cleaner sounds [
28] and by studies on trustworthiness in HRI [
20,
43].
While the emotional ratings and two of the quality ratings (pleasantness and trustworthiness) were aimed at the robot (i.e., to measure the effect of sound on the perception), we were also interested in assessing the sounds in terms of their typicality, efficiency, and likeability. To mitigate confounding variables, the questions were specified (e.g., “How pleasant is the robot?” and “How typical is the sound?”) and grouped accordingly. Participants started with the video-only (V) stimuli with the questions of typicality, efficiency, and likeability omitted. In randomized order, participants were then continued with the other stimuli groups (S1, S2, S3, VS1, VS2, and VS3). The rating scales were going from “not at all” (0) to “very much” (5.0) with 0.1 step size (e.g., from “not at all sad” to “very much sad” and from “unpleasant” to its opposite “pleasant”).
Results for Study 1
A total of 17 participants (6 F, 11 M), aged between 23 and 43 years old (median age 29), completed the experiment. Concerning the participants’ musical background, two participants had no experience, two had little experience, four had some experience, three defined themselves as semi-professional, and six were experts. Statistical analysis shows that the participants’ backgrounds do not affect the results.
Perceptual ratings of emotions for each emotion stimuli category (i.e., how well each emotion is perceived) are shown in Figure
4. The result shows that sounds can clarify the communication of a visual gesture. In the perception of anger stimuli shown in Figure
4(a), while the video-only anger stimuli (V) is mistakenly perceived as excitement, all three of the audio-video stimuli VS1, VS2, and VS3 are correctly perceived as anger.
In some conditions, our sound models improve the perception of emotion when the robot is not visible (i.e., audio-only stimuli). This can be observed in Figure
4(b) where S2 performs better than S1 in communicating excitement; as well as in Figure
4(d), both S2 and S3 performs better than S1 in communicating sadness. It is also possible that sound-S2 can be more suitable for communicating sadness of the robot, and can help in communicating excitement as well. Figure
4(d) shows that for sadness stimuli, S2 and VS2 were correctly rated as sad. ANOVA analysis shows that the effect of the interaction sound * video is significant;
F(3, 48) = 13.48,
p = 0.000. Meanwhile, for the relaxation stimuli (Figure
4(c)), both S2 and VS2 were also rated as sad. ANOVA analysis shows that the effect of the interaction sound * video is significant;
F(3, 48) = 14.29,
p = 0.000. Figure
4(b) shows that for excitement stimuli, S2 was clearly identified as excited without the video stimuli. ANOVA analysis shows that the effect of the interaction sound * video is significant;
F(3, 48) = 24.74,
p < 0.001.
The perceptual ratings concerning the qualities of the stimuli are shown in Figure
5. Figure
5(a) shows the perceptual ratings of the robot in terms of the pleasantness of the gesture and trustworthiness, and Figure
5(b) displays the perceptual ratings of the sound in terms of typicality, efficiency, and likeability. Results show that sound-S3 improves the pleasantness of the robot’s gesture, as both S3 and VS3 are rated higher than their counterparts. Stimuli S3 is also rated more typical and likable than S1 and S2, while at the same time, VS3 is rated more typical and likeable than VS2.
To summarize, Study 1 shows that while a simple sound design (e.g., using sawtooth waveforms) can be more effective in communicating some emotion, more complex sound has been shown to improve the pleasantness of the gesture. The complex sound has also been rated as more typical and likable. Study 1 also suggests that generally, sounds can improve the communication of a robot’s emotional gesture, both when the robot is not visible or when the visual communication is ambiguous.
3.5 Study 2. Museum Installation Experiment
This study aims to evaluate robot sounds with random participants in an everyday context to verify how the synthesized sounds of robot movements would be perceived in a noisy environment. Therefore, as the location for the experiment in Study 2 we chose the Swedish National Museum of Science and Technology (Tekniska) in Stockholm.
The experiments ran for 4 days during a school holiday week in the Fall of 2021. During this week, the museum is visited daily by thousands of people. Museum visitors typically consisted of families (kids with their parents/guardians/grandparents) or groups of teenagers and young adults. The experiment concerning the sound complexity was conducted on days 1 and 2, while the experiment concerning the sound materials was conducted on days 3 and 4. The setup for both experiments was the same, except for the sound models used.
Setup for Study 2
The experiments were presented as an exhibition in a large, dimly lit room, with a Pepper robot positioned in front of a large projection screen wall (see Figure
6(b)). Museum visitors could enter the room at any time through the doors on both sides of the projection wall. The projection wall showed photos of the location where the interaction scenarios occur, complemented by the ambient sound of the location coming from four speakers on the ceiling. Two additional speakers (ESI aktiv 05) for the synthesized robot sound were placed behind the robot. A touchscreen station was available in front of the robot, where museum visitors could initiate Pepper’s action in the experiment and answer some questions afterward. A diagram of the technical setup is pictured in Figure
6(a).
While information about the experiment was provided through signs at the museum entrance, many participants encountered it by chance as they explored the museum. Even though the experiment setup was designed to run independently without requiring an operator, an assistant was always present near the robot to provide information and invite visitors to participate. Many visitors interacted with the exhibition, but for the analysis presented in this study we included only those agreeing to participate by signing a consent form.
A Voltcraft SL-400 Sound Level Meter was used throughout the day to measure the background noise level. The base background noise level when the museum was empty (i.e., in the morning and at the end of the day) was 45 dB SPL(A), while during the peak hours (i.e., around 13:00 to 15:00) the background noise level was between 60 dB SPL(A) and 70 dB SPL(A). For reference, 60 dB SPL(A) is the average sound level of normal conversation and 70 dB SPL(A) is the average sound level of a washing machine or a dishwasher.
7 One particularly loud distraction was the museum’s
PA (
Public Address) system. When announcements were broadcast through the PA system, the participants were asked to pause and wait for the announcements to finish.
Stimuli for Study 2
Two possible scenarios were selected as the background set at the museum to simulate an in-the-wild encounter with a robot in a public space. The scenarios were (1) Pepper robot standing next to an information desk in a shopping mall and (2) Pepper robot as a receptionist in a restaurant. The scenarios were complemented by corresponding background photos and ambient sounds. For the shopping mall scenario, the ambient sounds were distant music, footsteps, and conversation in a vast reverberant space; for the restaurant scenario, the ambient sounds were the conversation of the patrons overlaid with the occasional sound of utensils and dishes clanging. Each scenario consisted of four gestures, with each gesture mapped to sound models explained in Section
3.2. For details of the gestures, see Table
2. In total, 48 stimuli were prepared (two experiments
\(\times\) two scenarios
\(\times\) four gestures
\(\times\) three sound models) and grouped into sets of stimuli. One set of stimuli consisted of one gesture in one scenario repeated three times with three different sound models. All the sounds were real-time sonification of the robot gestures, realized using the sound models described in Sections
3.2.1 and
3.2.2.
The experiment duration per participant was designed to be flexible: participants could choose to do only one set of stimuli (randomized for each participant) or more. The flexibility was added to accommodate the candidness of the participation in this study, and at the same time it further underlined a chance encounter with a robot in a public space.
Procedure for Study 2
When museum visitors approached the exhibition, an assistant informed them of the experiment. If the visitors agreed to participate, they signed a printed consent form and filled in a demographic survey (see Section
3.1 concerning compliance with ethics standards).
The experiment started by visualizing three buttons on a touchscreen (Figure
7(a)) and by projecting one of the two scenarios on the large wall behind the robot. The three buttons triggered the same gesture of the Pepper robot but with a different real-time sonification. The sound models used in the sonification were randomly associated with the three buttons. The gesture presented was also randomly selected. After having experienced the three stimuli triggered by the three buttons, questions were presented on the screen (Figure
7(b)). Listening to one set of stimuli and answering the related questions took 1–2 minutes. After responding to the questions, participants could choose to terminate the experiment or to continue with the next set of stimuli. Most participants responded to only one or two sets of stimuli, with only a few responding to more.
Results for Study 2
The results concerning the sound complexity are presented first, followed by the results concerning sound material.
The experiment concerning the sound complexity garnered responses from 46 participants (21 F, 24 M, 1 did not specify) aged between 6 and 72 years old (average age 22.6 and median 13 years old, indicating a prevalence of families with children). Participants were of several nationalities, with a substantial prevalence of people from Sweden (39) and one each from the Netherlands, Germany, Bosnia, Italy, Syria, India, and Ukraine. Some participants stated that they had interacted with robots in the past: two participants had already interacted with a Pepper robot, two with a telepresence robot, three with vacuum cleaning robots, and three mentioned toy robots. Concerning their musical background, 23 participants had no experience, 21 had some experience, and two defined themselves as semi-professional. Statistical analysis shows that the participants’ backgrounds do not affect the results.
During the experiments, we noticed that several groups of visitors were answering the questions as a group (i.e., discussing with each other) even though only one was registered.
The results concerning the sound complexity are shown in Table
3. All responses were analyzed using the chi-square goodness-of-fit test (
\(\chi ^2\)) to determine whether the response distribution is statistically significant. For each test, the null hypothesis assumes no significant difference between the observed value and the expected value of the equal distribution (
\(\alpha = 0.05\)).
Overall, the results show that sound-S1 was preferred the most (i.e., no additional synthesized sound). However, regarding the suitability, sound-S1 and sound-S2 received the same amount of responses. The analysis also shows no significant difference between the Mall and Restaurant scenarios.
The second experiment concerning the sound material garnered responses from 53 participants (26 F, 26 M, one non-binary) aged between 7 and 75 years old (average age 27.2; median 25). There was a prevalence of participants from Sweden (36 participants), with six from Indonesia, three from Turkey, two from Italy, and one each from Russia, Serbia, Panama, Chile, Greece, and Iran. Fourteen participants stated that they have previously interacted with robots: three participants stated many different robots, three with NAO or Pepper robot, three with Furhat, two with Lego Mindstorms robot, two with vacuum cleaning robots, and one with Asimo. Concerning their musical background, 14 participants had no experience, 30 had some experience, and 9 defined themselves as semi-professional. Statistical analysis shows that the participants’ backgrounds do not affect the results.
The results concerning the sound material are shown in Table
4. The results show that sound-M3 was preferred the most and deemed to be the most suitable in the restaurant scenario. However, in the shopping mall scenario sound-M2 and sound-M3 received the same number of votes, suggesting the effect of the environment on the suitability of sound material. Further detailed analysis also shows that when the robot voice was not present (i.e., no speech, only waving movement of the robot), sound-M1 was preferred the most, as shown in Figure
8. The difference in the results associated with the presence of robot speech suggests a difference in the preferences of the sound material. Sound-M3 was designed to highlight the arm displacements with whoosh sounds, while sound-M1 focuses on the robot’s internal mechanisms with motor sounds. A possible interpretation is that subtler whoosh sounds are preferred when the robot is speaking, while the more pronounced motor sounds are preferred without the speech. This finding also suggests the need to properly time synthesized gesture sounds so that they do not mask the robot’s speech.
To summarize, Study 2 shows that concerning sound complexity, there are differences between subjective perception of preferences and objective perception of suitability, and that these differences are not affected by the interaction scenarios. Study 2 also shows that a particular sound material can be more preferred and deemed to be more suitable in some scenarios, highlighting the importance of choosing the suitable sound material for specific purposes.
During the exhibition in the museum, after finishing the experiments, randomly selected participants were immediately asked for comments by the authors. Some participants stated that they did not notice the presence of the movement sonification before being informed by the experiment assistant. A possible cause is that the museum’s crowd can at times mask the added sound, or that there are too many distracting noises. It can also be that participants ignore the sounds made by robot gestures firsthand, similar to the typical case when people do not notice the Foley sounds associated with actors’ movements in movies. In order to investigate these hypotheses, we designed Study 3.
3.6 Study 3. Perceptual Evaluation on Sound Material
This study aims to investigate the preferences toward the stimuli in Study 2 when they are presented in a video format (i.e., through an online survey) without the environmental sounds present at the museum during very busy visiting hours.
Stimuli and Procedure for Study 3
Video stimuli for Study 3 were created by recording Pepper’s gesture from Study 2 and the sound models described in Table
1. The recording took place in a
\(4\times 4\) meter room in our lab. Specifically, the gestures used were of the robot (i) waving and bowing, (ii) welcoming patrons in a restaurant, and (iii) greeting visitors in a shopping mall (see Table
2) with the ambient sound included. In total, six sets of stimuli were prepared, each containing three videos of the same gesture with different sound models. Screen capture from the video stimuli can be seen in Figure
9(a).
The study was conducted through an online survey. Each survey page contained three videos from a set of stimuli that the participants could observe in any order. After observing the stimuli, participants were asked to select the sound they liked the most (see Figure
9(b)). At the end of the survey, participants were given the opportunity to leave comments regarding their choices.
Participants in this study were recruited from the authors’ colleagues and personal networks. Several participants have also participated in Study 1; however, none of them participated in Study 2.
Results for Study 3
A total of 21 participants (12 F, 9 M) aged between 23 and 49 years old (average age 31.6; median 28) completed the survey. The nationality distribution was nine from Indonesia, five from Sweden, two from Turkey, and one each from India, Italy, France, Germany, and China. Concerning their musical background, 5 participants have no experience, 12 have little or some experience, 3 defined themselves as semi-professional, and 2 defined themselves as experts or full professionals. Statistical analysis shows that the participants’ backgrounds do not affect the results.
Similar to Study 2, responses to each question were analyzed using the chi-square goodness-of-fit test (\(\chi ^2\)) where the null hypothesis assumes no significant difference between the observed value and the expected value of the equal distribution (\(\alpha = 0.05\)).
The results are shown in Table
5. Regarding the sound complexity, it appears that Sound-S1 (i.e., no additional sound) was liked the most regardless of the presence of speech. Concerning the sound material, Sound-M2 was significantly liked the most when Pepper was speaking (both in the restaurant and the mall scenarios). However, when there was no speech (i.e., only the sound generated by the robot’s movement), Sound-M1 and Sound-M2 were almost equally preferred. Comments from participants reveal that they preferred subtler or natural sounds and that they felt distracted by the additional sounds played while the robot was speaking.
To summarize, Study 3 has shown contrasting results compared to the previous studies (i.e., Sound-S3 is rated as the most typical and likable in Study 1, and Sound-M3 is preferred the most in Study 2). Further discussion of these results is presented in the next section.
4 Discussion
Our interaction with the world is naturally multimodal. Concerning the perception of motion, it has been accepted that the presence of sound can improve or even change the perception of the same visual event. In [
45], Sekuler has shown that when two identical visual targets moving toward each other can be visually perceived either to bounce off or to pass through each other, adding a brief click sound in the right moment would ensure that the motion is perceived strongly as bouncing. Similarly, our Study 1 has shown that sound can clarify the communication of a gesture. This finding further strengthens the importance of a good sound design in multimodal interaction with a robot, even more when the robot’s movements are hindered due to its limited movement capabilities. In this article, we aim to contribute to the discussion regarding good sound design practices by investigating the perception of and the preference for the complexity and materiality of robot sounds through movement sonification.
Concerning the sound complexity, results from Study 1 show that participants show preferences toward sounds that are developed with a focus on their intrinsic sound quality, with Sound-S3 rated as more typical, likable, and it improves the pleasantness of the robot’s gesture. At the same time, outcomes from Study 1 also show that sounds produced with sawtooth waveform (Sound-S2) are effective in communicating the sadness emotion of the robot. These results suggest that differences are prominent between the subjective perception of preferences and how effective the sounds are in communicating emotion. These differences are also evident in Study 2 and Study 3 where the original sound of the Pepper robot (Sound-S1) is preferred the most, while on the other hand, both Sound-S1 and Sound-S2 are rated to be equally suitable in Study 2.
A possible explanation is the nature of the interaction. In Study 2 and Study 3, the robot Pepper is displayed in a formal setting where Pepper is employed (i.e., “working”) at the restaurant and shopping mall. In contrast, in Study 1, Pepper shows expressive gestures casually. This interpretation suggests that it might be preferred to have subtler sounds in a formal setting to avoid distraction, while a casual setting opens up for more expression of emotions. There is no doubt that simple sound designs such as sawtooth waveforms or beep sounds can be used in our interaction with robots, as these sounds can be very functional. However, as the appearance of robots is getting more and more sophisticated in the future, as well as further integration of robots into our daily life, employing a more complex design focusing on their sound quality would ensure a more pleasant interaction.
Concerning the materiality, there is a significant difference between Study 2 and Study 3: Sound-M3 (i.e., whooshes sound) was preferred the most in Study 2 while Sound-M2 (i.e., metallic sound) was preferred the most in Study 3. A plausible explanation of this difference is the effect of ambient noise, i.e., how the museum crowd noise masked the movement sonification as can be observed in the spectrograms plotted in Figure
10. The spectrogram on the right is a stimulus used in Study 3, recorded in our lab; it can be observed that the speech (“Hello, my name is Pepper. How can I help you today?”) is protruding on a lower frequency band. At the same time, the robot sound blended with the movement sonification is visible at higher frequencies with observable rhythm (see short pauses toward the end of the spectrogram). The spectrogram on the left—recording of the same stimuli in the museum. i.e., what participants would have heard, with the ambient sound included—shows noises on the lower frequency range muddling with the robot’s speech, while at higher frequencies the movement sonification is much more reverberant. This observation was supported by the participants’ commentaries that they preferred subtler sounds that blend well with the ambient sound (i.e., less distracting).
Aside from preferences toward subtle sounds, another descriptor mentioned by the participants regarding their preferences was “natural sound”. Unfortunately, the sound naturally generated by a robot’s mechanical actuation has been regarded as noise and has been found to negatively interfere with the interactions [
18,
23,
34,
47]. Regardless, any strategies of masking and alteration should incorporate sounds that can communicate the characteristics of a robot. Our previous study demonstrated that a robot’s sonic presence in sci-fi films communicates the visual characteristics of the robot, related to both the physical appearance and the movement [
30]. As many studies regarding sound in HRI have mentioned influences from sci-fi films (e.g., in [
6,
21,
24,
39,
44]), it is plausible that these films have also influenced the general expectations of a natural robot sound.
Another observation made in Study 1 is that some sounds improve the perception of emotion when the robot is not visible. This observation supports previous findings in [
17] suggesting sonification can improve communication through robot sounds in auditory-only conditions. In a real-world situation, this will be beneficial in situations where robots and humans share a space (e.g., in offices), where interactions could happen without facing the robot.
5 Conclusions
In this article, we have presented three studies focusing on the aesthetics strategies for robot sound, specifically on the complexity and materiality of sound in movement sonification. Two sets of sound models were developed: one to investigate the perception of synthesized robot sounds depending on their complexity, and another to investigate the “materiality” of sound.
The first study explores how the first set of sound models can influence the perception of the expressive gestures of a Pepper robot. Through an online survey, 17 participants rated a set of video stimuli of a Pepper robot in terms of perceived affective states. In the second study, an experiment was carried out in a museum installation with a Pepper robot presented in two scenarios: (1) while welcoming patrons to a restaurant and (2) while providing information to visitors in a shopping center. The two sets of sound models were used as a complement to the robot’s gesture. A total of 99 museum visitors participated by choosing their preferred sound for the different scenarios; 46 participants for the first set of sound models and 53 participants for the latter. In the final study, 21 participants responded to an online survey with stimuli similar to the ones used in the second study. The study aimed to compare participants’ preferences for sound aesthetics when watching online video stimuli with those made by visitors at the museum installation.
Results from our studies suggest that while there were differences between the subjective perception of preferences and how effective the sounds are in communicating emotion, it was possible to identify preferences toward the use of more refined sound models in the robot movements sonification. Regarding materiality, participants preferred subtle sounds that blend well with ambient sounds (i.e., less distracting) and natural sounds where the sound source matches the visual characteristics of the robot, related to both its physical appearance and movement.
Regarding the feasibility of a real-time sound synthesis strategy for a robot in a real-world acoustic environment, the exhibition in the museum for Study 2 offers us some insights. First, the system we developed was fairly robust to be deployed for 4 days in the museum. Secondly, there are limitations: (a) the system did not properly time the synthesized gesture sounds so that they do not mask the speech, and (b) the issue with the ambient noise level that could be compensated by dynamically adjusting the level of the synthesized sounds. Thus, to further utilize movement sonification as a strategy to compensate for limitations in robot communicative channels, the robot’s intention and environmental awareness have to be taken into consideration.