Keywords

1 Introduction

Face-to-face conversation is an essential form of communications in our daily life. Analyzing human conversation scenes has been acknowledged as an emerging research field which interests both computer science and human science researchers [20]. Comparing to dyadic conversation, multiparty conversation is more common and complex and researches on it have been growing in recent years. Turn-taking is one of the most important aspects seen in conversation analysis [24]. It is a transitional mechanism in the organization of conversation which comes into play as speaker’s change. It is taken for granting one person speaks at a time and people have to take the turn of speaking. A speaking turn is automatically taken by a person who know when and how to do it appropriately. Therefore, turn-taking should be viewed as a kind of communication skills which people naturally acquire along with the acquisition of language competence and sociability.

On the other hand, in face-to-face conversations, people exchange information not only by verbal messages but also by nonverbal behaviors, such as eye gaze, head and body gestures, facial expression, and prosody. Among those nonverbal behaviors, gaze is the most useful one for predicting turn-taking [5, 13, 14, 24]. In spoken dialog system or human-computer interaction systems, predicting where the turn-taking (speaker change) occurs is especially important.

In this study, we focus on the relationship between gaze behaviors and turn-taking. We propose a prediction model for predicting turn-taking based on analyzing gaze transition patterns in multiparty conversations. We treat each gaze transition using the concept of the n-gram model: turn-taking occurrence doesn’t depend on the sequence of whole gaze transition movement events, but only depend on the present n gaze movement states of all participants. Thus, We define meaningful gaze labels of speaker’s and listener’s gaze movements and then model a gaze transition pattern to a two-label pattern. After that, we analyze those meaningful gaze transition patterns using a corpus of four-participant conversations. The corpus is a multimodal corpus that we collected, includes gaze behaviors and the total time of it is 240 min. Finally, we build up a probabilistic prediction model for predicting turn-taking based on those meaningful gaze transition patterns in multiparty conversation. We compare our models with Ishii’s models [12], which uses much more gaze transition patterns. The evaluation results demonstrate that the prediction results obtained by our models are better than Ishii’s which are considered as the state-of-the-art in this field.

The rest of our paper is organized as follows. We begin by reviewing related work in Sect. 2. Then, we introduce our multiparty conversation corpus in details in Sect. 3. Next, we discuss the prediction model of turn-taking by analyzing gaze transition pattern and set up different experiments to show the effectiveness of our model in Sect. 4. Finally, we conclude the paper and mention future work in the last section.

2 Related Work

Researches on analyzing multiparty conversation have been studied in both human science and computer science for years.

Various facets of turn-taking in conversation analysis have been studied in the psycholinguistics, sociolinguistics and conversational analysis communities for years. Sacks, Schegoloff and Jefferson first proposed a basic model for the organization of turns in conversation [24]. They focused on the notions of turn constructional units, separated by transition relevance places (TRPs) where speaker changes may happen. Based on Sacks’ model, subsequent researches began to reveal the importance of nonverbal behaviors including gaze, gesture, and other nonverbal communication channels in regulating turn-taking in interaction. For instance, Duncan [5] focused on the role of nonverbal behaviors in turn-taking, and proposed that turn-taking is mediated by both verbal and nonverbal cues. Wiemann and Mark [19] surveyed a number of previous studies on turn-taking and performed a quantitative analysis of turn-taking cues in dyadic conversations. Goodwin [8] also discussed various aspects of the relationship between turn-taking and attention. Among different nonverbal behaviors, gaze shows a strong relationship to turn-taking. Kendon [14] suggested that speaker’s gaze is a “turn-yielding cue”, which means if a speaker gazes at a listener at the end of an utterance, it is possible that the speaker will yield the turn to the listener. He also mentioned that mutual gaze is important in turn-taking. Jokinen et al. [13] also reported a similar results in multiparty settings.

Related work in computer science field mainly concentrate on conversation scene analysis [7, 9, 16, 21, 22] and spoken dialog systems [1, 3, 23, 26,27,28]. These researches aim to provide the automatic description of conversation scenes from the multimodal nonverbal behaviors of participants, which are captured with cameras and microphones or develop more comprehensive computational models and architectures for managing turn-taking in both dyadic and multiparty settings. Prediction of turn-taking is a key factor for these researches. Methods of predicting turn-taking in multiparty conversations can be divided into two groups: one uses speech processing [6, 17, 18, 25] and the other uses nonverbal behaviors such as gaze [2, 12, 15] and other behaviors [2, 4, 11, 15]. In this study, we focus on the prediction of turn-taking.

3 Multiparty Conversation Corpus

Because turn-taking behaviors of persons engaged in equal relationships, for example, those between classmates, friends and colleagues, are desired in conversation analysis, all conversation participants are 20–30 years old college students. Our corpus contains 16 natural 4-participant conversations by 8 groups (4 Chinese college student groups and 4 Japanese college student groups) with two different conversation topics: one is “What class is good class?” and the other is “How did you spend your spring holidays?”. Each conversation is 15 min and the total time of the corpus is 240 min.

3.1 Recording Multiparty Conversation

We use camera arrays and microphones to record multiparty conversations in our corpus. Figure 1 shows the camera layout and a sample scene of recording multiparty conversations. We use two video cameras to record the whole scene of conversation to obtain the nonverbal behaviors including gaze, head movement, gesture, and other body movements. We also set up four web cameras facing four participants in the centering table to get all participants’ facial expressions for further researches on other nonverbal behaviors in the future. A voice recorder is also used to obtain the voice data of all the conversations. The frame rate of all videos is 30 Hz in our corpus.

Fig. 1.
figure 1

Camera layout and sample scene.

3.2 Annotation

We use EUDICO Linguistic Annotator tool (ELAN)Footnote 1 to make annotations in our corpus. It is an annotation tool that allows you to create, edit, visualize and search annotations for video and audio data. Inter-pausal units (IPUs) were automatically identified by making reference to the time stamps in the word-segmented transcriptions. A stretch of speech followed by a pause longer than 200 ms were recognized as an IPU in this study. Then, we remove back-channel feedbacks from detected IPUs by skilled human annotators to get the IPU candidates. Finally, we obtained 2285 IPUs including 549 turn-taking and 1736 turn-keeping candidates for the whole corpus. Moreover, nine different gaze labels will be defined and all labels are marked by human in our corpus.

3.3 Coding Gaze Transition Pattern

Because gaze transition is important for predicting turn-taking, we focus on the gaze transition patterns which are coded as a sequence of gaze direction shifts in an examining time interval. We consider gaze directions to other participants (speaker or listener) and mutual gaze in the definitions of labels here. For easy understanding and comparison, we use the same symbols as in [12] and define nine gaze labels including:  

S::

The listener gazes at the current speaker without mutual gaze (the current speaker doesn’t gaze at the listener).

\(S_M\)::

The Listener gazes at the current speaker with mutual gaze (the current speaker also gazes at the listener).

\(L(L_1,L_2,L_3)\)::

The participant gaze at a listener without mutual gaze. \(L_1,L_2,\) and \(L_3\) present different listeners.

\(L_M(L_{1M}, L_{2M},L_{3M})\)::

The participant gaze at a listener with mutual gaze. \(L_{1M},L_{2M},\) and \(L_{3M}\) denote different listeners.

X::

The participant doesn’t gaze at other participants but gazes at other places.

  We found that gazing at which listener is not important in turn-taking prediction, so we merge \(L_1,L_2,L_3\) into L and \(L_{1M}, L_{2M},L_{3M}\) into \(L_M\) in this study. The number of labels can be reduced from 9 to 5 by the emergence, and the number of gaze transition patterns can be reduced much more. The experiment will show that the mergence helps to improve the performance of the prediction models.

The examining time interval for gaze transition is 1200 ms including 1000 ms before and 200 ms after the time when the current IPU is over in this study. We analyze the gazing transitions in this interval and treat the contiguous gaze behavior sequence using the concept of n-gram model. Notice that, unlike in [12], we consider only two gaze labels in modelling the transition patterns. If more than two gaze label obtained in the time interval, we take the last two gaze labels before and after the time when current IPU is over. If only one gaze behavior occurs in the examining time interval, we treat it as two repeating gaze labels before and after the time when current IPU is over. That means our model is based on the concept of 2-gram model and all coded transition patterns are two-label patterns.

We give a labelling sample in Fig. 2 to show how to construct gaze transition patterns by our model and Ishii’s [12]. \(P1-P4\) means four different participants in the conversation. As shown in the figure, P1 is always gazing at other places in the examining time interval. P2 gazes at P4 after making eye contact with P1. P3 doesn’t gaze at any participant but looks at other places after looking at P1. P4 gazes at P2 and then P3 after gazing at other places. Table 1 shows the coding results by our model and Ishii’s. We can see that the transition patterns of P3 are the same. For transition patterns of P1, our pattern is repeating as \(X-X\) because we use two-label patterns. For transition pattern of P2, because we merged different listeners’ gaze labels, our pattern is \(S_M-L_{M}\). Finally, our transition pattern of P4 is \(L_{M}-L\) but \(X-L_{1M}-L_2\) obtained in Ishii’s work because only last two gaze labels are taken into account in our model.

Fig. 2.
figure 2

Labelling process of our and Ishii’s models.

Table 1. Coding transition patterns using gaze labels.

4 Prediction Model of Turn-Taking

We will analyze speaker’s and listener’s gaze transition patterns occur at the time of turn-taking and turn-keeping appear. Then, we will introduce the proposed prediction model based on the analysis and finally evaluate the performance of our model.

4.1 Gaze Transition Patterns and Turn-Taking

If frequent gaze transition patterns in turn-taking are different from those in turn-keeping, the patterns are useful for constructing prediction model of turn-taking. In prediction of turn-taking, we use emerged five gaze labels as defined previously for prediction.

First, we analyze the relation between speaker’s gaze transition pattern and turn-taking. Because label S and \(S_M\) don’t appear in speaker’s gaze transition, we totally have nine patterns in this case. The occurrence numbers and frequencies of nine transition patterns of speaker are shown in Fig. 3. We then make a chi square test on the frequencies and use p value, which is defined as the probability that the deviation of the observed from that expected is due to chance alone, to determine if a gaze transition is significantly different in turn-keeping or turn-taking [10]. If \(p<0.01\), the gaze transition is considered to be significantly different and favoured for our prediction model. Here, p values of three patterns \(X-X\), \(L-L\) and \(L_M-L_M\) are smaller than 0.01. We can understand the following things from the results:

  1. 1.

    The frequency of turn-keeping is significantly higher than turn-taking when \(X-X\) pattern occurs. That means if a speaker keeps looking at other places, the probability of occurrence of turn-keeping is higher than turn-taking.

  2. 2.

    The frequency of turn-taking is significantly higher than turn-keeping when \(L-L\) or \(L_M-L_M\) pattern occurs. That means if a speaker is continually looking at a listener with or without mutual gaze, the probability of occurrence of turn-taking is higher than turn-keeping.

Fig. 3.
figure 3

Occurrences and frequencies of speaker’s gaze transition patterns.

Similarly, we analyze the relationship between listener’s gaze transition patterns and turn-taking. We have 25 different gaze transition patterns and then calculate the occurrence numbers and frequencies of them. We merge 13 patterns whose frequency of turn-keeping or turn-taking is less than \(1\%\) into a new pattern class named as Others here. The occurrence numbers and frequencies of different 13 transition patterns of speaker are shown in Fig. 4. As in analysis of speaker’s patterns, we also make a chi square test on the result and use p value to determine if a gaze transition is different in turn-keeping and turn-taking. If \(p<0.01\), the gaze transition is considered to be significantly different in turn-keeping and turn-keeping. Here, p values of \(S-S\), \(S_M-S_M\), \(X-S\), \(X-S_M\), and Others are smaller than 0.01, we can obtain the following conclusions from the results:

  1. 1.

    The frequency of turn-keeping is significantly higher than turn-taking when \(S-S\) or \(S_M-S_M\) pattern occurs. That means if a listener is continually gazing at the listener with or without mutual gaze, the probability of occurrence of turn-keeping is higher than turn-taking.

  2. 2.

    The frequency of turn-taking is significantly higher than turn-keeping when \(X-S\), or \(X-S_M\), or Others pattern occurs. That means if a listener looked at other places and then started to look at the speaker with or without mutual gaze, the probability of occurrence of turn-taking is higher than turn-keeping. Because Others includes too many patters, we don’t discuss it here.

Comparing to patterns obtained in work [12], the number of our patterns is less but our patterns are more useful. From the conclusions above, our patterns is more convincable according to many conclusions about the turn-taking and gaze behaviors in human science as mentioned previously. It will be shown that the prediction model based on our patterns perform better than those in work [12] which is considered as the state-of-art in this field.

Fig. 4.
figure 4

Occurrences and frequencies of listener’s gaze transition patterns.

4.2 Prediction Model

We construct the prediction model of turn-taking using the frequencies of speaker’s and listener’s gaze transition patterns obtained previously. In particular, we assume each participant’s gaze transition pattern is independent and a Naive Bayes classifier is defined as

$$\begin{aligned} P(y\mid x_S, x_{L_1},x_{L_2},x_{L_3} )\varpropto P(x_S\mid y)\cdot \prod _{i=1}^3P(x_{L_i})\cdot P(y) \end{aligned}$$
(1)

for prediction. Here, \(y=1\), \(y=0\), and \(x_S\) denote the turn-taking, turn-keeping and speaker’s gaze transition pattern, respectively; \(x_{L_i}\) means listener \(L_i\)’s gaze transition pattern(\(i\in 1,2,3\)). \(P(y=1)\), and \(P(y=0)\) are defined as the occurrence probabilities of turn-taking and turn-keeping, respectively; \(P(x_S\mid y=1)\), and \(P(x_S\mid y=0)\) indicate the conditional probabilities of speaker’s gaze transition pattern in turn-taking case and listener’s gaze transition pattern in turn-keeping case, respectively; \(P(x_{L_i}\mid y=1)\), and \(P(x_{L_i}\mid y=0)\) are the conditional probabilities of speaker’s gaze transition pattern in turn-keeping case and listener’s gaze transition pattern in turn-taking case, respectively; For an IPU to be predicted, we calculate \(P(y=1\mid x_S,x{L_1},x{L_2},x{L_3})\) and \(P(y=0\mid x_S,x{L_1},x{L_2},x{L_3})\); if \(P(y=1\mid x_S,x{L_1},x{L_2},x{L_3}) > P(y=0\mid x_S,x{L_1},x{L_2},x{L_3})\), the prediction result will be turn-taking; otherwise, it will be turn-keeping.

4.3 Evaluation Experiment

We will evaluate the performances of our prediction model and compare it to Ishii’s model [12]. We divide corpus data of eight groups into a training set including seven groups and a validation set including the rest one group. We then carry leave-one-out cross-validation and the average prediction result are calculated as the final prediction result. Here, we estimate precision, recall and F-measure and the results are shown in Table 2. We can see that our model obtains better performance than Ishii’s model. The improvement of F-measure is from 0.484 to 0.537 and from 0.764 to 0.815 in turn-taking and turn-keeping by using our model. Because we use the same naive Bayes classifier as in Ishii’s model, the improvement is considered from our proposed gaze transition patterns and coding method. Moreover, because the number of our gaze transition patterns and coded patterns are smaller than Ishii’s, our model could save much time on annotation and coding process.

Table 2. Prediction performances by different models.

5 Conclusions and Future Work

We presented a prediction model for turn-taking by analyzing gaze transition patterns in multiparty conversations in this study. We treated the gaze transition using the concept of 2-gram model and defined compact labels for coding gaze transition patterns. The prediction models based on the patterns are also convincible. Compared with the state-of-the-art, our model show its superiorities in prediction of turn-taking. Moreover, our models are compact and need less time on annotation and coding process. Future work will aim at extending our model to predictions of next speaker and the time when the speaker begin to speak, integrating more nonverbal behaviors, and exploring real applications in human-machine interaction using our model.