Predicting Turn-Taking by Compact Gazing Transition Patterns in Multiparty Conversation

Li Tian¹⁶,
Qi Jia¹⁷ &
Zhen Zhu¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10749))

Included in the following conference series:

Pacific-Rim Symposium on Image and Video Technology

1586 Accesses

Abstract

Gaze behavior plays an important role for analyzing turn-taking in multiparty conversation. In this study, we propose a general and powerful model for predicting turn-taking by analyzing gaze transition patterns in four-participant conversation. We propose gaze labels of different speaker’s and listener’s gaze movements and then code every gaze transition pattern to a two-label pattern. After that, we analyze the gaze transition patterns by quantitative analysis to confirm their effectiveness. Finally, we build up a prediction model for predicting turn-taking based on these gaze transition patterns. Experiments demonstrate that the prediction results obtained by our model are superior to the state-of-the-art.

You have full access to this open access chapter, Download conference paper PDF

Effectiveness of Gaze-Based Engagement Estimation in Conversational Agents

Data-Driven Generation of Eyes and Head Movements of a Social Robot in Multiparty Conversation

Gaze from and Toward the Silent Third Participant in a Triadic Conversation

Keywords

1 Introduction

Face-to-face conversation is an essential form of communications in our daily life. Analyzing human conversation scenes has been acknowledged as an emerging research field which interests both computer science and human science researchers [20]. Comparing to dyadic conversation, multiparty conversation is more common and complex and researches on it have been growing in recent years. Turn-taking is one of the most important aspects seen in conversation analysis [24]. It is a transitional mechanism in the organization of conversation which comes into play as speaker’s change. It is taken for granting one person speaks at a time and people have to take the turn of speaking. A speaking turn is automatically taken by a person who know when and how to do it appropriately. Therefore, turn-taking should be viewed as a kind of communication skills which people naturally acquire along with the acquisition of language competence and sociability.

On the other hand, in face-to-face conversations, people exchange information not only by verbal messages but also by nonverbal behaviors, such as eye gaze, head and body gestures, facial expression, and prosody. Among those nonverbal behaviors, gaze is the most useful one for predicting turn-taking [5, 13, 14, 24]. In spoken dialog system or human-computer interaction systems, predicting where the turn-taking (speaker change) occurs is especially important.

In this study, we focus on the relationship between gaze behaviors and turn-taking. We propose a prediction model for predicting turn-taking based on analyzing gaze transition patterns in multiparty conversations. We treat each gaze transition using the concept of the n-gram model: turn-taking occurrence doesn’t depend on the sequence of whole gaze transition movement events, but only depend on the present n gaze movement states of all participants. Thus, We define meaningful gaze labels of speaker’s and listener’s gaze movements and then model a gaze transition pattern to a two-label pattern. After that, we analyze those meaningful gaze transition patterns using a corpus of four-participant conversations. The corpus is a multimodal corpus that we collected, includes gaze behaviors and the total time of it is 240 min. Finally, we build up a probabilistic prediction model for predicting turn-taking based on those meaningful gaze transition patterns in multiparty conversation. We compare our models with Ishii’s models [12], which uses much more gaze transition patterns. The evaluation results demonstrate that the prediction results obtained by our models are better than Ishii’s which are considered as the state-of-the-art in this field.

The rest of our paper is organized as follows. We begin by reviewing related work in Sect. 2. Then, we introduce our multiparty conversation corpus in details in Sect. 3. Next, we discuss the prediction model of turn-taking by analyzing gaze transition pattern and set up different experiments to show the effectiveness of our model in Sect. 4. Finally, we conclude the paper and mention future work in the last section.

2 Related Work

Researches on analyzing multiparty conversation have been studied in both human science and computer science for years.

Various facets of turn-taking in conversation analysis have been studied in the psycholinguistics, sociolinguistics and conversational analysis communities for years. Sacks, Schegoloff and Jefferson first proposed a basic model for the organization of turns in conversation [24]. They focused on the notions of turn constructional units, separated by transition relevance places (TRPs) where speaker changes may happen. Based on Sacks’ model, subsequent researches began to reveal the importance of nonverbal behaviors including gaze, gesture, and other nonverbal communication channels in regulating turn-taking in interaction. For instance, Duncan [5] focused on the role of nonverbal behaviors in turn-taking, and proposed that turn-taking is mediated by both verbal and nonverbal cues. Wiemann and Mark [19] surveyed a number of previous studies on turn-taking and performed a quantitative analysis of turn-taking cues in dyadic conversations. Goodwin [8] also discussed various aspects of the relationship between turn-taking and attention. Among different nonverbal behaviors, gaze shows a strong relationship to turn-taking. Kendon [14] suggested that speaker’s gaze is a “turn-yielding cue”, which means if a speaker gazes at a listener at the end of an utterance, it is possible that the speaker will yield the turn to the listener. He also mentioned that mutual gaze is important in turn-taking. Jokinen et al. [13] also reported a similar results in multiparty settings.

Related work in computer science field mainly concentrate on conversation scene analysis [7, 9, 16, 21, 22] and spoken dialog systems [1, 3, 23, 26,27,28]. These researches aim to provide the automatic description of conversation scenes from the multimodal nonverbal behaviors of participants, which are captured with cameras and microphones or develop more comprehensive computational models and architectures for managing turn-taking in both dyadic and multiparty settings. Prediction of turn-taking is a key factor for these researches. Methods of predicting turn-taking in multiparty conversations can be divided into two groups: one uses speech processing [6, 17, 18, 25] and the other uses nonverbal behaviors such as gaze [2, 12, 15] and other behaviors [2, 4, 11, 15]. In this study, we focus on the prediction of turn-taking.

3 Multiparty Conversation Corpus

Because turn-taking behaviors of persons engaged in equal relationships, for example, those between classmates, friends and colleagues, are desired in conversation analysis, all conversation participants are 20–30 years old college students. Our corpus contains 16 natural 4-participant conversations by 8 groups (4 Chinese college student groups and 4 Japanese college student groups) with two different conversation topics: one is “What class is good class?” and the other is “How did you spend your spring holidays?”. Each conversation is 15 min and the total time of the corpus is 240 min.

3.1 Recording Multiparty Conversation

We use camera arrays and microphones to record multiparty conversations in our corpus. Figure 1 shows the camera layout and a sample scene of recording multiparty conversations. We use two video cameras to record the whole scene of conversation to obtain the nonverbal behaviors including gaze, head movement, gesture, and other body movements. We also set up four web cameras facing four participants in the centering table to get all participants’ facial expressions for further researches on other nonverbal behaviors in the future. A voice recorder is also used to obtain the voice data of all the conversations. The frame rate of all videos is 30 Hz in our corpus.

3.2 Annotation

We use EUDICO Linguistic Annotator tool (ELAN)^{Footnote 1} to make annotations in our corpus. It is an annotation tool that allows you to create, edit, visualize and search annotations for video and audio data. Inter-pausal units (IPUs) were automatically identified by making reference to the time stamps in the word-segmented transcriptions. A stretch of speech followed by a pause longer than 200 ms were recognized as an IPU in this study. Then, we remove back-channel feedbacks from detected IPUs by skilled human annotators to get the IPU candidates. Finally, we obtained 2285 IPUs including 549 turn-taking and 1736 turn-keeping candidates for the whole corpus. Moreover, nine different gaze labels will be defined and all labels are marked by human in our corpus.

3.3 Coding Gaze Transition Pattern

Because gaze transition is important for predicting turn-taking, we focus on the gaze transition patterns which are coded as a sequence of gaze direction shifts in an examining time interval. We consider gaze directions to other participants (speaker or listener) and mutual gaze in the definitions of labels here. For easy understanding and comparison, we use the same symbols as in [12] and define nine gaze labels including:

S::: The listener gazes at the current speaker without mutual gaze (the current speaker doesn’t gaze at the listener).
$S_M$::: The Listener gazes at the current speaker with mutual gaze (the current speaker also gazes at the listener).
$L(L_1,L_2,L_3)$::: The participant gaze at a listener without mutual gaze. $L_1,L_2,$ and $L_3$ present different listeners.
$L_M(L_{1M}, L_{2M},L_{3M})$::: The participant gaze at a listener with mutual gaze. $L_{1M},L_{2M},$ and $L_{3M}$ denote different listeners.
X::: The participant doesn’t gaze at other participants but gazes at other places.

We found that gazing at which listener is not important in turn-taking prediction, so we merge $L_1,L_2,L_3$ into L and $L_{1M}, L_{2M},L_{3M}$ into $L_M$ in this study. The number of labels can be reduced from 9 to 5 by the emergence, and the number of gaze transition patterns can be reduced much more. The experiment will show that the mergence helps to improve the performance of the prediction models.

The examining time interval for gaze transition is 1200 ms including 1000 ms before and 200 ms after the time when the current IPU is over in this study. We analyze the gazing transitions in this interval and treat the contiguous gaze behavior sequence using the concept of n-gram model. Notice that, unlike in [12], we consider only two gaze labels in modelling the transition patterns. If more than two gaze label obtained in the time interval, we take the last two gaze labels before and after the time when current IPU is over. If only one gaze behavior occurs in the examining time interval, we treat it as two repeating gaze labels before and after the time when current IPU is over. That means our model is based on the concept of 2-gram model and all coded transition patterns are two-label patterns.

We give a labelling sample in Fig. 2 to show how to construct gaze transition patterns by our model and Ishii’s [12]. $P1-P4$ means four different participants in the conversation. As shown in the figure, P1 is always gazing at other places in the examining time interval. P2 gazes at P4 after making eye contact with P1. P3 doesn’t gaze at any participant but looks at other places after looking at P1. P4 gazes at P2 and then P3 after gazing at other places. Table 1 shows the coding results by our model and Ishii’s. We can see that the transition patterns of P3 are the same. For transition patterns of P1, our pattern is repeating as $X-X$ because we use two-label patterns. For transition pattern of P2, because we merged different listeners’ gaze labels, our pattern is $S_M-L_{M}$. Finally, our transition pattern of P4 is $L_{M}-L$ but $X-L_{1M}-L_2$ obtained in Ishii’s work because only last two gaze labels are taken into account in our model.

Table 1. Coding transition patterns using gaze labels.

Full size table

4 Prediction Model of Turn-Taking

We will analyze speaker’s and listener’s gaze transition patterns occur at the time of turn-taking and turn-keeping appear. Then, we will introduce the proposed prediction model based on the analysis and finally evaluate the performance of our model.

4.1 Gaze Transition Patterns and Turn-Taking

If frequent gaze transition patterns in turn-taking are different from those in turn-keeping, the patterns are useful for constructing prediction model of turn-taking. In prediction of turn-taking, we use emerged five gaze labels as defined previously for prediction.

First, we analyze the relation between speaker’s gaze transition pattern and turn-taking. Because label S and $S_M$ don’t appear in speaker’s gaze transition, we totally have nine patterns in this case. The occurrence numbers and frequencies of nine transition patterns of speaker are shown in Fig. 3. We then make a chi square test on the frequencies and use p value, which is defined as the probability that the deviation of the observed from that expected is due to chance alone, to determine if a gaze transition is significantly different in turn-keeping or turn-taking [10]. If $p<0.01$, the gaze transition is considered to be significantly different and favoured for our prediction model. Here, p values of three patterns $X-X$, $L-L$ and $L_M-L_M$ are smaller than 0.01. We can understand the following things from the results:

1.
The frequency of turn-keeping is significantly higher than turn-taking when $X-X$ pattern occurs. That means if a speaker keeps looking at other places, the probability of occurrence of turn-keeping is higher than turn-taking.
2.
The frequency of turn-taking is significantly higher than turn-keeping when $L-L$ or $L_M-L_M$ pattern occurs. That means if a speaker is continually looking at a listener with or without mutual gaze, the probability of occurrence of turn-taking is higher than turn-keeping.

Similarly, we analyze the relationship between listener’s gaze transition patterns and turn-taking. We have 25 different gaze transition patterns and then calculate the occurrence numbers and frequencies of them. We merge 13 patterns whose frequency of turn-keeping or turn-taking is less than $1\%$ into a new pattern class named as Others here. The occurrence numbers and frequencies of different 13 transition patterns of speaker are shown in Fig. 4. As in analysis of speaker’s patterns, we also make a chi square test on the result and use p value to determine if a gaze transition is different in turn-keeping and turn-taking. If $p<0.01$, the gaze transition is considered to be significantly different in turn-keeping and turn-keeping. Here, p values of $S-S$, $S_M-S_M$, $X-S$, $X-S_M$, and Others are smaller than 0.01, we can obtain the following conclusions from the results:

1.
The frequency of turn-keeping is significantly higher than turn-taking when $S-S$ or $S_M-S_M$ pattern occurs. That means if a listener is continually gazing at the listener with or without mutual gaze, the probability of occurrence of turn-keeping is higher than turn-taking.
2.
The frequency of turn-taking is significantly higher than turn-keeping when $X-S$, or $X-S_M$, or Others pattern occurs. That means if a listener looked at other places and then started to look at the speaker with or without mutual gaze, the probability of occurrence of turn-taking is higher than turn-keeping. Because Others includes too many patters, we don’t discuss it here.

Comparing to patterns obtained in work [12], the number of our patterns is less but our patterns are more useful. From the conclusions above, our patterns is more convincable according to many conclusions about the turn-taking and gaze behaviors in human science as mentioned previously. It will be shown that the prediction model based on our patterns perform better than those in work [12] which is considered as the state-of-art in this field.

4.2 Prediction Model

We construct the prediction model of turn-taking using the frequencies of speaker’s and listener’s gaze transition patterns obtained previously. In particular, we assume each participant’s gaze transition pattern is independent and a Naive Bayes classifier is defined as

$$\begin{aligned} P(y\mid x_S, x_{L_1},x_{L_2},x_{L_3} )\varpropto P(x_S\mid y)\cdot \prod _{i=1}^3P(x_{L_i})\cdot P(y) \end{aligned}$$

(1)

for prediction. Here, $y=1$, $y=0$, and $x_S$ denote the turn-taking, turn-keeping and speaker’s gaze transition pattern, respectively; $x_{L_i}$ means listener $L_i$’s gaze transition pattern($i\in 1,2,3$). $P(y=1)$, and $P(y=0)$ are defined as the occurrence probabilities of turn-taking and turn-keeping, respectively; $P(x_S\mid y=1)$, and $P(x_S\mid y=0)$ indicate the conditional probabilities of speaker’s gaze transition pattern in turn-taking case and listener’s gaze transition pattern in turn-keeping case, respectively; $P(x_{L_i}\mid y=1)$, and $P(x_{L_i}\mid y=0)$ are the conditional probabilities of speaker’s gaze transition pattern in turn-keeping case and listener’s gaze transition pattern in turn-taking case, respectively; For an IPU to be predicted, we calculate $P(y=1\mid x_S,x{L_1},x{L_2},x{L_3})$ and $P(y=0\mid x_S,x{L_1},x{L_2},x{L_3})$; if $P(y=1\mid x_S,x{L_1},x{L_2},x{L_3}) > P(y=0\mid x_S,x{L_1},x{L_2},x{L_3})$, the prediction result will be turn-taking; otherwise, it will be turn-keeping.

4.3 Evaluation Experiment

We will evaluate the performances of our prediction model and compare it to Ishii’s model [12]. We divide corpus data of eight groups into a training set including seven groups and a validation set including the rest one group. We then carry leave-one-out cross-validation and the average prediction result are calculated as the final prediction result. Here, we estimate precision, recall and F-measure and the results are shown in Table 2. We can see that our model obtains better performance than Ishii’s model. The improvement of F-measure is from 0.484 to 0.537 and from 0.764 to 0.815 in turn-taking and turn-keeping by using our model. Because we use the same naive Bayes classifier as in Ishii’s model, the improvement is considered from our proposed gaze transition patterns and coding method. Moreover, because the number of our gaze transition patterns and coded patterns are smaller than Ishii’s, our model could save much time on annotation and coding process.

Table 2. Prediction performances by different models.

Full size table

5 Conclusions and Future Work

We presented a prediction model for turn-taking by analyzing gaze transition patterns in multiparty conversations in this study. We treated the gaze transition using the concept of 2-gram model and defined compact labels for coding gaze transition patterns. The prediction models based on the patterns are also convincible. Compared with the state-of-the-art, our model show its superiorities in prediction of turn-taking. Moreover, our models are compact and need less time on annotation and coding process. Future work will aim at extending our model to predictions of next speaker and the time when the speaker begin to speak, integrating more nonverbal behaviors, and exploring real applications in human-machine interaction using our model.

Notes

1.
http://www.mpi.nl/corpus/html/elan/.

References

Bohus, D., Horvitz, E.: Decisions about turns in multiparty conversation: from perception to action. In: Proceedings of International Conference on Multimodal Interfaces, pp. 153–160 (2011)
Google Scholar
Chen, L., Harper, M.P.: Multimodal floor control shift detection. In: Proceedings of International Conference on Multimodal Interfaces, pp. 15–22 (2009)
Google Scholar
Dan, B., Horvitz, E.: Multiparty turn taking in situated dialog: study, lessons, and directions. In: Proceedings of Annual Meeting of the Special Interest Group in Discourse and Dialogue, pp. 98–109 (2011)
Google Scholar
Dielmann, A., Garau, G., Bourlard, H.: Floor holder detection and end of speaker turn prediction in meetings. In: International Conference on Speech and Language Processing, Interspeech (2010)
Google Scholar
Duncan, S.: Some signals and rules for taking speaking turns in conversations. J. Pers. Soc. Psychol. 23(2), 283–292 (1972)
Article MathSciNet Google Scholar
Ferrer, L., Shriberg, E., Stolcke, A.: Is the speaker done yet? faster and more accurate end-of-utterance detection using prosody. In: Proceedings of ICSLP, p. 2002 (2002)
Google Scholar
Gatica-Perez, D.: Analyzing group interactions in conversations: a review. In: 2006 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems, pp. 41–46 (2006)
Google Scholar
Goodwin, C.: Restarts, pauses, and the achievement of a state of mutual gaze at turn beginning. Sociol. Inq. 50, 272–302 (1980)
Article Google Scholar
Gorga, S., Otsuka, K.: Conversation scene analysis based on dynamic Bayesian network and image-based gaze detection. In: Proceedings of International Conference on Multimodal Interfaces (2010)
Google Scholar
Haberman, S.J.: The analysis of residuals in cross-classified tables. Biometrics 29, 205–220 (1973)
Article Google Scholar
Ishii, R., Kumano, S., Otsuka, K.: Predicting next speaker based on head movement in multi-party meetings. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015)
Google Scholar
Ishii, R., Otsuka, K., Kumano, S., Matsuda, M., Yamato, J.: Predicting next speaker and timing from gaze transition patterns in multi-party meetings, pp. 79–86 (2013). http://dl.acm.org/citation.cfm?id=2522848
Jokinen, K., Harada, K., Nishida, M., Yamamoto, S.: Turn-alignment using eye-gaze and speech in conversational interaction. In: Annual Conference of the International Speech Communication Association, pp. 2018–2021 (2010)
Google Scholar
Kendon, A.: Some functions of gaze-direction in social interaction. Acta Psychologica 26(1), 22–63 (1967)
Article Google Scholar
de Kok, I., Heylen, D.: Multimodal end-of-turn prediction in multi-party meetings. In: Proceedings of the 2009 International Conference on Multimodal Interfaces, ICMI-MLMI 2009. ACM, New York, pp. 91–98 (2009). http://doi.acm.org/10.1145/1647314.1647332
Kumano, S., Otsuka, K., Dan, M., Yamato, J.: Recognizing communicative facial expressions for discovering interpersonal emotions in group meetings. In: Proceedings International Conference on Multimodal Interaction, pp. 99–106 (2009)
Google Scholar
Laskowski, K., Edlund, J., Heldner, M.: A single-port non-parametric model of turn-taking in multi-party conversation. In: 1988 International Conference on Acoustics, Speech, and Signal Processing, 1988. ICASSP-88, pp. 5600–5603 (2011)
Google Scholar
Levow, G.A.: Turn-taking in mandarin dialogue: interactions of tone and intonation. In: Proceedings of the SIGHAN Workshop (2005)
Google Scholar
Wiemann, J.M., Mark, L.K.: Turn-taking in conversations. J. Commun. 25(2), 75–92 (1975)
Google Scholar
Otsuka, K.: Conversational scene analysis. IEEE Sig. Process. Mag. 28, 127–131 (2011)
Article Google Scholar
Otsuka, K., Araki, S., Ishizuka, K., Fujimoto, M., Heinrich, M., Yamato, J.: A realtime multimodal system for analyzing group meetings by combining face pose tracking and speaker diarization. In: Proceedings of International Conference on Multimodal Interfaces, pp. 257–264 (2008)
Google Scholar
Otsuka, K., Takemae, Y., Yamato, J.: A probabilistic inference of multiparty-conversation structure based on Markov-switching models of gaze patterns, head directions, and utterances. In: Proceedings of Internetional Conference on Multimodal Interfaces, pp. 191–198 (2005)
Google Scholar
Raux, A., Eskenazi, M.: A finite-state turn-taking model for spoken dialog systems. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 629–637 (2009)
Google Scholar
Sacks, H., Jefferson, G.: A simplest systematics for the organization of turn-taking for conversation. Language 50(4), 696–735 (1974)
Article Google Scholar
Schlangen, D.: From reaction to prediction experiments with computational models of turn-taking. In: Proceedings of Interspeech 2006, Panel on Prosody of Dialogue Acts and Turn-Taking (2006)
Google Scholar
Thrisson, K.R.: Natural turn-taking needs no manual: computational theory and model, from perception to action. In: Granström, B., House, D., Karlsson, I. (eds.) Multimodality in Language and Speech Systems. Text, Speech and Language Technology, vol. 19. Springer, Dordrecht (2002). https://doi.org/10.1007/978-94-017-2367-1_8
Google Scholar
Traum, D., Rickel, J.: Embodied agents for multi-party dialogue in immersive virtual worlds. In: Proceedings of the First International Joint Conference on Autonomous Agents and Multiagent Systems: Part 2, pp. 766–773 (2002)
Google Scholar
Traum, D.R.: A computational theory of grounding in natural language conversation (1994)
Google Scholar

Download references

Author information

Authors and Affiliations

Foshan University, Foshan, China
Li Tian & Zhen Zhu
South China University of Technology, Guangzhou, China
Qi Jia

Authors

Li Tian
View author publications
You can also search for this author in PubMed Google Scholar
Qi Jia
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Li Tian .

Editor information

Editors and Affiliations

School of Computing and Mathematics, Charles Sturt University, Bathurst, New South Wales, Australia
Manoranjan Paul
University of São Paulo, São Paulo, Brazil
Carlos Hitoshi
University of Chinese Academy of Science, Beijing, China
Qingming Huang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tian, L., Jia, Q., Zhu, Z. (2018). Predicting Turn-Taking by Compact Gazing Transition Patterns in Multiparty Conversation. In: Paul, M., Hitoshi, C., Huang, Q. (eds) Image and Video Technology. PSIVT 2017. Lecture Notes in Computer Science(), vol 10749. Springer, Cham. https://doi.org/10.1007/978-3-319-75786-5_35

Download citation

DOI: https://doi.org/10.1007/978-3-319-75786-5_35
Published: 15 February 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75785-8
Online ISBN: 978-3-319-75786-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)