A Speech-Driven Embodied Communication System Based on an Eye Gaze Model in Interaction-Activated Communication

Yoshihiro Sejima¹⁴,
Koki Ono¹⁵ &
Tomio Watanabe¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10273))

Included in the following conference series:

International Conference on Human Interface and the Management of Information

Abstract

Line-of-sight such as gaze and eye-contract plays an important role to enhance the embodied interaction and communication through avatars. In addition, many gaze models and communication systems with the line-of-sight using avatars have been proposed and developed. However, the gaze behaviors by generating the above-mentioned models are not considered to enhance the embodied interaction such as activated communication, because the models stochastically generate the eyeball movements based on the human gaze behavior. Therefore, we analyzed the interaction between the human gaze behavior and the activated communication by using line-of-sight measurement devices. Then, we proposed an eye gaze model based on the above-mentioned analysis. In this study, we develop an advanced avatar-mediated communication system in which the proposed eye gaze model is applied to speech-driven embodied entrainment characters called “InterActor.” This system generates the avatar’s eyeball movements such as gaze and looking away based on the activated communication, and provides a communication environment wherein the embodied interaction is promoted. The effectiveness of the system is demonstrated by means of sensory evaluations of 24 pairs of subjects involved in avatar-mediated communication.

You have full access to this open access chapter, Download conference paper PDF

Transcendent Telepresence: Tele-Communication Better Than Face to Face Interaction

Elderly Speech-Gaze Interaction

Alternative and Augmentative Communication for People with Disabilities and Language Problems: An Eye Gaze Tracking Approach

Keywords

1 Introduction

With the advancements in the field of information technology, it is now becoming possible for humans to use CG characters called avatars to communicate in a 3D virtual space over a network. Furthermore, many researches that support remote communication using CG characters such as avatars and agent are performed [1]. However, current systems do not simulate embodied sharing using synchrony of embodied rhythms, such as the nodding and body movements in human face-to-face communication, because the CG characters express nonverbal behavior based on the key commands. In human face-to-face communication, not only verbal messages but also nonverbal behavior such as nodding, body movement, line-of-sight and facial expression are rhythmically related and mutually synchronized between talkers [2]. This synchrony of embodied rhythms in communication is called entrainment, and it enhances the sharing of embodiment and empathy unconsciously in human interaction and accelerates the activated communication in which nonverbal behaviors such as body movements and speech activity increase, and the embodied interaction is activated [3].

In our previous work, we analyzed the entrainment between a speaker’s speech and a listener’s nodding motion in face-to-face communication, and developed iRT (InterRobot Technology), which generates a variety of communicative actions and movements such as nodding and blinking and movements of the head, arms, and waist that are coherently related to voice input [4]. In addition, we developed an interactive CG character called “InterActor” which has functions of both speaker and listener, and demonstrated that InterActor can effectively support human interaction and communication [4]. Moreover, we developed an estimation model of interaction-activated communication based on the heat conduction equation and demonstrated the effectiveness of the model by the evaluation experiment [5].

On the other hand, body movements as well as line-of-sight such as eye contact and gaze duration play an important role in smooth human face-to-face communication [6]. Moreover, it is reported that smooth communication via avatars is realized by expressing the avatar’s gaze. For example, Ishii et al. developed a communication system that controls an avatar’s gaze based on an estimated line-of-sight model and demonstrated that utterance is facilitated between talkers using this model in an avatar-mediated communication [7]. Also, we analyzed human eyeball movement through avatars by using an embodied virtual communication system with a line-of-sight measurement device, and proposed an eyeball movement model, consisting of an eyeball delay movement model and a gaze withdrawal model [8]. In addition, we developed an advanced avatar-mediated communication system by applying our proposed eyeball movement model to InterActors, and demonstrated that the developed system is effective for supporting the embodied interaction and communication. These systems generate the avatar’s eyeball movement by a statistical model based on face-to-face communication characteristics. However, from the viewpoint of promoting the line-of-sight interaction, it is difficult for these systems to enhance the line-of-sight interaction, because the dynamic characteristics of human line-of-sight in the activated communication have not yet been designed. Therefore, in our previous research, we analyzed the interaction between activated communication and human gaze behavior by using a line-of-sight measurement device [8]. On the basis of this analysis, we proposed an eye gaze model, consisting of an eyeball delay movement model and a look away model.

In this paper, we develop an advanced avatar-mediated communication system by applying the proposed eye gaze model to InterActors. This system generates the avatar’s eyeball movements such as gaze and looking away based on the proposed model by using only speech input, and provides a communication environment wherein the embodied interaction is promoted. The effectiveness of the proposed and communication system is demonstrated by means of sensory evaluations in an avatar-mediated communication system.

2 A Speech-Driven Embodied Communication System Based on an Eye Gaze Model

2.1 InterActor

In order to support human interaction and communication, we developed a speech-driven embodied entrainment character called InterActor, which has the functions of both speaker and listener [4]. The configuration of InterActor is shown in Fig. 1. InterActor has a virtual skeleton structure such as head, eyes, mouth, neck, shoulders, elbows, hands (Fig. 1(a)). The texture puts on the 3D surface model including the virtual skeleton structure (Fig. 1(b)). In addition, the various facial expressions are realized by applying the smile model in which the previous research was developed (Fig. 1(c)) [9, 10].

The listener’s interaction model includes a nodding reaction model which estimates the nodding timing from a speech ON-OFF pattern and a body reaction model linked to the nodding reaction model [4]. The timing of nodding is predicted using a hierarchy model consisting of two stages; macro and micro (Fig. 2). The macro stage estimates whether a nodding response exists or not in a duration unit which consists of a talkspurt episode T(i) and the following silence episode S(i) with a hangover value of 4/30 s. The estimator M _u(i) is a moving-average (MA) model, expressed as the weighted sum of unit speech activity R(i) in Eqs. (1) and (2). When M _u(i) exceeds a threshold value, nodding M(i) is also a MA model, estimated as the weighted sum of the binary speech signal V(i) in Eq. (3).

$$ M_{u} (i) = \sum\limits_{j = 1}^{J} {a(j)R(i - j) + u(i)} $$

(1)

$$ R(i) = \frac{T(i)}{T(i) + S(i)} $$

(2)

a(j): linear prediction coefficient
T(i): talkspurt duration in the i th duration unit
S(i): silence duration in the i th duration unit
u(i): noise
i: number of frame

$$ M(i) = \sum\limits_{j = 1}^{K} {b(j)V(i - j) + w(i)} $$

(3)

b(j): linear prediction coefficient
V(i): voice
w(i): noise

The body movements are related to the speech input in that the neck and one of the wrists, elbows, arms, or waist is operated when the body threshold is exceeded. The threshold is set lower than that of the nodding prediction of the MA model, which is expressed as the weighted sum of the binary speech signal to nodding. In other words, when InterActor functions as a listener for generating body movements, the relationship between nodding and other movements is dependent on the threshold values of the nodding estimation.

2.2 Eye Gaze Model

We proposed an eye gaze model that generates a gaze movement and looking away movement for enhancing embodied communication based on the characteristics of the analysis of human eyeball movement. The proposed model consists of the previous eyeball delay movement model [8] and look away model. The outline of the proposed model is indicated as follows:

(1) Eyeball Delay Movement Model

The eyeball delay movement model consists of a delay of 0.13 s with respect to the avatar’s head movement. First, the angle of the avatar’s gaze direction for the viewpoint in virtual space is calculated using Eq. 4 (Fig. 3(a)). Then, the avatar’s gaze is generated by adding the angle of the avatar’s head movement to the angle of the avatar’s gaze direction in the fourth previous frame at a frame rate of 30 fps (Eq. 5). Figure 3(b) shows an example of the eyeball delay movement model in an avatar. If the avatar’s head moves, the eyeball moves with a delay of 0.13 s with respect to the head movement in the opposite direction.

$$ \theta_{AG} = \tan^{ - 1} \frac{{A_{Ex} - P_{x} }}{{A_{Ey} - P_{y} }} $$

(4)

θ _AG: Rotation angle of gaze direction
A _Ex, A _Ey: eyeball postion of InterActor
P _x, P _y: position of view point in virtual space

$$ \theta_{G} (i) = \theta_{AH} (i) + \theta_{AG} (i - 4) $$

(5)

θ _G (i): Rotation angle of eyeball movement
θ _AH (i): Rotation angle of InterActor’s head movement
i: number of frame

(2) Look Away Model

The previous analysis of the human eyeball indicates that direct gaze is limited to about 80% of total conversation time [8]. Therefore, the look away model in this study generates eyeball movement for other gazes such as gaze withdrawal and blinking based on the previous analysis. The avatar’s eyeball of looking away is moved at the horizontal direction greatly (Fig. 4), and the effectiveness of this movement was confirmed in a preliminary experiment. When a value which is estimated the degree of interaction-activated communication falls below a threshold value, the looking away movement is generated by the proposed model (Fig. 5). The avatar’s gaze would be modulated such that staring is prevented and impressions of the conversation such as unification and vividness are enhanced.

2.3 Developed System

We developed an advanced communication system in which the proposed model was used with InterActors (Fig. 6). The virtual space was generated by Microsoft DirectX 9.0 SDK (June 2010) and a Windows 7 workstation (CPU: Corei7 2.93 GHz, Memory: 8 GB, Graphics: NVIDIA Geforce GTS250). The voice was sampled using 16 bits at 11 kHz via a headset (Logicool H330). InterActors were represented at a frame rate of 30 fps.

When Talker1 speaks to Talker2, InterActor2 responds to Talker1’s utterance with appropriate timing through body movements, including nodding, blinking, and actions, in a manner similar to the body motions of a listener. A nodding movement is defined as the falling-rising movement in the front-back direction at a speed of 0.15 rad/frame. In addition, InterActor2 generates an eyeball movement based on the proposed model. Here, a looking away movement is defined as the left-right motion of eyeballs at a speed of 0.15 rad/frame based on the preliminary experiment. Also, InterActor1 generates communicative actions and movements and avatar’s eyeball movements as a speaker by using the MA model and eye gaze model. In this manner, two remote talkers can enjoy a conversation via InterActors within a communication environment in which the sense of unity is shared by embodied entrainment.

3 Communication Experiment

In order to evaluate the developed system, a communication experiment was carried out using the developed system.

3.1 Experimental Method

The experiment was performed on talkers engaged in a free conversation. In this experiment, the following three modes were compared: mode (A) with neither eyeball movement nor facial expression, mode (B) with smile model only, and mode (C) with combined smile model and eye gaze model. We recorded the communication experiment scene using two video cameras and screens as shown in Fig. 7. The subjects were 12 pairs of talkers (12 males and 12 females).

The experimental procedure is described as follows. First, the subjects used the system for around 3 min. Next, they were instructed to perform a paired comparison of modes in which, based on their preferences, they selected the better mode. Finally, they were urged to talk in a free conversation for 3 min in each mode. The questionnaire used a seven-point bipolar rating scale from −3 (not at all) to 3 (extremely), where a score of 0 denotes “moderately.” The conversational topics were not specified in both experiments. Each pair of talkers was presented with the two modes in a random order.

3.2 Result

The results of the paired comparison are summarized in Table 1. In this table, the number of winner is shown. For example, the number of mode (A)’s winner is six for mode (B), and the number of total winner is nine. Figure 8 shows the calculated results of the evaluation provided in Table 1 based on the Bradley-Terry model given in Eqs. (6) and (7) [11].

Table 1. Result of paired comparison.

Full size table

$$ p_{ij} = \frac{{\pi_{i} }}{{\pi_{i} + \pi_{j} }} $$

(6)

$$ \sum\limits_{i} {\pi_{i} } = const.( = 100) $$

(7)

π _i: Intensity of i
p _ij: probability of judgment that i is better than j

The consistency of mode matching was confirmed by performing a goodness of fit test $ (x^{2}(1,0.05) = 3.84 > x_{0}^{2} = 0.28) $ and a likelihood ratio test $ (x^{2}(1,0.05) = 3.84 > x_{0}^{2} = 0.27) $. The proposed mode (C), with both smile model and eye gaze model, was evaluated as the best; followed by mode (B), smile model only; and mode (A), no movement.

The questionnaire results are shown in Fig. 9. From the results of the Friedman signed-rank test and the Wilcoxon signed rank test, all categories showed a significance level of 1% among modes (A), (B), and (C). In addition, “Enjoyment,” “Interaction-activated communication,” “Vividness,” and “Natural line-of-sight” had a significance level of 5% between modes (B) and (C).

In both experiments, mode (C) of the proposed eye gaze model was evaluated as the best for avatar-mediated communication. These results indicate the effectiveness of the proposed eye gaze model. These results demonstrate that the combined model is effective.

4 Conclusion

In this paper, we developed an advanced avatar-mediated communication system in which our proposed eye gaze model is used by speech-driven embodied entrainment characters called InterActors. The proposed model consists of an eyeball delay movement model and a look away model. The communication system generates eyeball movement based on this model by generating the entrained head and body motions of InterActors using only speech input. Sensory evaluations in an avatar-mediated communication system showed the effectiveness of the proposed eye gaze model and communication system.

References

Ishii, K., Taniguchi, Y., Osawa, H., Nakadai, K., Imai, M.: Merging viewpoints of user and avatar in telecommunication using image and sound projector. Trans. Inf. Process. Soc. Jpn. 54(4), 1413–1421 (2013)
Google Scholar
Condon, W.S., Sander, L.W.: Neonate movement is synchronized with adult speech. Science 183, 99–101 (1974)
Article Google Scholar
Watanabe, T.: Human-entrained embodied interaction and communication technology. In: Fukuda, S. (ed.) Emotional Engineering, pp. 161–177. Springer, Heidelberg (2011)
Chapter Google Scholar
Watanabe, T., Okubo, M., Nakashige, M., Danbara, R.: InterActor: speech-driven embodied interactive actor. Int. J. Hum.-Comput. Interact. 17(1), 43–60 (2004)
Article Google Scholar
Sejima, Y., Watanabe, T., Jindai, M.: Development of an interaction-activated communication model based on a heat conduction equation in voice communication. In: Proceedings of the 23rd IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN 2014), pp. 832–837 (2014)
Google Scholar
Argyle, M., Dean, J.: Eye contact, distance and affiliation. Sociometry 41(3), 289–304 (1965)
Article Google Scholar
Ishii, R., Miyajima, T., Fujita, K.: Avatar’s gaze control to facilitate conversation in virtual-space multi-user voice chat system. Trans. Hum. Interface Soc. 10(3), 87–94 (2007)
Google Scholar
Sejima, Y., Watanabe, T., Jindai, M.: An embodied communication system using speech-driven embodied entrainment characters with an eyeball movement model. Trans. Jpn. Soc. Mech. Eng. Ser. C 76(762), 340–350 (2010)
Google Scholar
Sejima, Y., Ono, K., Yamamoto, M., Ishii, Y., Watanabe, T.: Development of an embodied communication system with line-of-sight model for speech-driven embodied entrainment character. In: Proceedings of the 25th JSME Design and Systems Conference, no. 1110, pp. 1–9 (2015)
Google Scholar
Yamamoto, M., Takabayashi, N., Ono, K., Watanabe, T., Ishii, Y.: Development of a nursing communication education support system using nurse-patient embodied avatars with a smile and eyeball movement model. In: Proceedings of the 2014 IEEE/SICE International Symposium on System Integration (SII 2014), pp. 175–180 (2014)
Google Scholar
Luce, R.D.: Individual Choice Behavior: A Theoretical Analysis. Wiley, New York (1959)
MATH Google Scholar

Download references

Acknowledgments

This work was supported by JSPS KAKENHI Grant Numbers JP16K01560, JP26280077.

Author information

Authors and Affiliations

Faculty of Computer Science and System Engineering, Okayama Prefectural University, Kuboki 111, Soja-shi, Okayama, Japan
Yoshihiro Sejima & Tomio Watanabe
Benesse InfoShell Co., Ltd., Takayanagihigashimachi 10-1, Kita-ku, Okayama, Japan
Koki Ono

Authors

Yoshihiro Sejima
View author publications
You can also search for this author in PubMed Google Scholar
Koki Ono
View author publications
You can also search for this author in PubMed Google Scholar
Tomio Watanabe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yoshihiro Sejima .

Editor information

Editors and Affiliations

Tokyo University of Science, Tokyo, Japan
Sakae Yamamoto

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sejima, Y., Ono, K., Watanabe, T. (2017). A Speech-Driven Embodied Communication System Based on an Eye Gaze Model in Interaction-Activated Communication. In: Yamamoto, S. (eds) Human Interface and the Management of Information: Information, Knowledge and Interaction Design. HIMI 2017. Lecture Notes in Computer Science(), vol 10273. Springer, Cham. https://doi.org/10.1007/978-3-319-58521-5_48

Download citation

DOI: https://doi.org/10.1007/978-3-319-58521-5_48
Published: 18 May 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-58520-8
Online ISBN: 978-3-319-58521-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Speech-Driven Embodied Communication System Based on an Eye Gaze Model in Interaction-Activated Communication

Abstract

Similar content being viewed by others