JP2013219495A

JP2013219495A - Emotion-expressing animation face display system, method, and program

Info

Publication number: JP2013219495A
Application number: JP2012087503A
Authority: JP
Inventors: Akina Kato; 明菜加藤
Original assignee: NEC Infrontia Corp
Current assignee: NEC Platforms Ltd
Priority date: 2012-04-06
Filing date: 2012-04-06
Publication date: 2013-10-24

Abstract

PROBLEM TO BE SOLVED: To provide an emotion-expressing animation face display system, method, and program allowing for efficient monitoring of telephone correspondence by plural operators.SOLUTION: The emotion-expressing animation face display system comprises: emotion estimation means which estimates the emotion of each of users who are speaking with each of operators at least with voice, on the basis of the voice of each user used for the speech; animation face synthesis means which synthesizes an animation face expressing an estimated emotion for each user on the basis of the emotion estimated by the emotion estimation means; and display means which arranges and displays the animation face of each user synthesized by the animation face synthesis means such that the animation faces can visually be checked together.

Description

本発明は感情表現アニメーション顔表示システム、方法及びプログラムに関し、特に、コールセンター用の感情表現アニメーション顔表示システム、方法及びプログラムに関するものである。 The present invention relates to an emotion expression animation face display system, method and program, and more particularly to an emotion expression animation face display system, method and program for a call center.

顧客（「クライアント」又は「ユーザ」ともいう）からの電話による問合せや注文などに対応するため、コールセンターが設置されている。通常、コールセンターでは、複数のオペレータがクライアントからの電話に効率よく対応するために、コールセンター用システムが構築、使用されている。コールセンター用システムは、例えば、電話交換機、複数のオペレータ端末、管理者（「スーパーバイザ」ともいう）用の端末などを備える。また、多くのコールセンター用システムでは、クライアントとオペレータとのやり取りをスーパーバイザが通話モニタリングによって監視し、通話の状況を確認し、必要に応じてオペレータにアドバイスを送るための機能が備わっている。 In order to respond to telephone inquiries and orders from customers (also referred to as “clients” or “users”), a call center is set up. Usually, in a call center, a system for a call center is constructed and used so that a plurality of operators can efficiently handle calls from clients. The call center system includes, for example, a telephone exchange, a plurality of operator terminals, a manager (also referred to as “supervisor”) terminal, and the like. In many call center systems, the supervisor monitors the communication between the client and the operator through call monitoring, confirms the state of the call, and has a function for sending advice to the operator as necessary.

特開２００６−３３０９５８号公報JP 2006-330958 A 特開２００３−２４８８４１号公報JP 2003-248841 A

上述した従来のコールセンター用システムでは、オペレータの電話対応に対するスーパーバイザの監視は音声のモニタリングによって行われており、スーパーバイザ１人に対してオペレータ１人の電話対応しか一度に監視することができず、能率的でない、という課題があった。 In the above-described conventional call center system, supervisor monitoring of the operator's telephone response is performed by voice monitoring, and only one operator's telephone response can be monitored at a time for one supervisor. There was a problem that it was not right.

また、オペレータからヘルプ要請がない限り、その時点で通話モニタリングをしていない通話に対してスーパーバイザの助言が必要なのかどうか、スーパーバイザからは判断がつかず、その結果対応の遅れが発生する、という課題があった。 Also, unless there is a request for help from the operator, it is not possible for the supervisor to determine whether the supervisor's advice is necessary for calls that are not being monitored at that time, resulting in delays in response. There was a problem.

したがって、本発明の目的は、上述した課題を解決し、複数のオペレータの電話対応を能率よく監視できる感情表現アニメーション顔表示システム、方法及びプログラムを提供することである。 Accordingly, an object of the present invention is to provide an emotion expression animated face display system, method, and program that can solve the above-described problems and can efficiently monitor telephone correspondences of a plurality of operators.

上記目的を達成するために、本発明のシステムは、複数のユーザであって、そのそれぞれが複数のオペレータのそれぞれと少なくとも音声を用いて通話をしている複数のユーザのそれぞれの感情を前記通話で用いられるユーザの音声を基に推定する感情推定手段と、前記感情推定手段により推定された感情に基づいて、ユーザ毎に、推定された感情が表現されているアニメーション顔を合成するアニメーション顔合成手段と、前記アニメーション顔合成手段により合成されたユーザ毎のアニメーション顔を視覚的にまとめて確認できるように並べて表示する表示手段とを備える。 In order to achieve the above object, the system of the present invention provides a plurality of users, each of which communicates the feelings of a plurality of users who are talking with each of a plurality of operators using at least voice. Emotion estimation means for estimating based on the user's voice used in the game, and animation face synthesis for synthesizing an animation face expressing the estimated emotion for each user based on the emotion estimated by the emotion estimation means Means and display means for displaying the animation faces synthesized by the animation face synthesis means side by side so that they can be visually confirmed together.

また、本発明の方法は、複数のユーザであって、そのそれぞれが複数のオペレータのそれぞれと少なくとも音声を用いて通話をしている複数のユーザのそれぞれの感情を前記通話で用いられるユーザの音声を基に推定する感情推定ステップと、前記感情推定ステップにより推定された感情に基づいて、ユーザ毎に、推定された感情が表現されているアニメーション顔を合成するアニメーション顔合成ステップと、前記アニメーション顔合成ステップにより合成されたユーザ毎のアニメーション顔を視覚的にまとめて確認できるように並べて表示する表示ステップとを含む。 In addition, the method of the present invention provides the voices of the users who are the users of the plurality of users, each of whom is talking with each of the plurality of operators using at least the voice. An emotion estimation step based on the emotion, an animation face synthesis step for synthesizing an animation face expressing the estimated emotion for each user based on the emotion estimated by the emotion estimation step, and the animation face And a display step for displaying the animation faces for each user synthesized in the synthesis step side by side so that they can be visually confirmed together.

また、本発明のプログラムは、コンピュータに、複数のユーザであって、そのそれぞれが複数のオペレータのそれぞれと少なくとも音声を用いて通話をしている複数のユーザのそれぞれの感情を前記通話で用いられるユーザの音声を基に推定する感情推定手順と、前記感情推定手順により推定された感情に基づいて、ユーザ毎に、推定された感情が表現されているアニメーション顔を合成するアニメーション顔合成手順と、前記アニメーション顔合成手順により合成されたユーザ毎のアニメーション顔を視覚的にまとめて確認できるように並べて表示する表示手順とを実行させる。 In addition, the program of the present invention can use the emotions of a plurality of users who are communicating with each of a plurality of operators using a voice at least in the computer. An emotion estimation procedure for estimating based on the user's voice; an animation face synthesis procedure for synthesizing an animation face expressing the estimated emotion for each user based on the emotion estimated by the emotion estimation procedure; A display procedure for displaying the animation faces synthesized for each user by the animation face synthesis procedure side by side so as to be visually confirmed together is executed.

本発明の上記構成により、コールセンターにおける複数のオペレータの電話対応を一度に能率よく監視できる。 With the above configuration of the present invention, it is possible to efficiently monitor the telephone correspondence of a plurality of operators at a call center at a time.

本発明が適用されたシステムの構成例を示す。1 shows a configuration example of a system to which the present invention is applied. 本発明の第１の実施形態のシステム全体の構成を示すブロック図である。1 is a block diagram illustrating a configuration of an entire system according to a first embodiment of this invention. 本発明の第１の実施形態におけるコールセンターシステムの構成を示すブロック図である。It is a block diagram which shows the structure of the call center system in the 1st Embodiment of this invention. アニメーション顔生成部により生成されたアニメーション顔を例示しており、（ａ）、（ｂ）、（ｃ）のアニメーション顔はそれぞれ「喜」、「怒」、「哀」の感情を表現している。The animation faces generated by the animation face generation unit are illustrated, and the animation faces of (a), (b), and (c) represent emotions of “joy”, “anger”, and “sorrow”, respectively. . 本発明の第１の実施形態のスーパーバイザ端末の構成を示すブロック図である。It is a block diagram which shows the structure of the supervisor terminal of the 1st Embodiment of this invention. スーパーバイザの表示装置における表示例である。It is an example of a display in the display apparatus of a supervisor. 本発明の第２の実施形態のシステム全体の構成を示すブロック図である。It is a block diagram which shows the structure of the whole system of the 2nd Embodiment of this invention. 本発明の第２の実施形態におけるオペレータ端末の構成を示すブロック図である。It is a block diagram which shows the structure of the operator terminal in the 2nd Embodiment of this invention. 本発明の第２の実施形態のスーパーバイザ端末の構成を示すブロック図である。It is a block diagram which shows the structure of the supervisor terminal of the 2nd Embodiment of this invention.

以下、図面を参照して本発明を実施するための形態について詳細に説明する。
図１は、本発明が適用されたシステムの構成例を示す。このシステムは、交換機（図示せず）と、コールセンターシステム１０１と、クライアント端末１０５と、オペレータ端末１０７と、スーパーバイザ端末１０９から構成される。クライアント端末１０５、オペレータ端末１０７及びスーパーバイザ端末１０９はそれぞれ複数存在してもよい。クライアント端末１０５とスーパーバイザ端末１０９は、それぞれ、ＳＩＰ(Session Initiation Protocol)制御部１２１、ＲＴＰ(Real-time Transport Protocol)制御部１２３及びインターフェース部１２５を備えている。インターフェース部１２５は、例えば、マイク、スピーカ、機能ボタン、マウス、キーボード等とのインターフェース機能を有し、スーパーバイザ端末１０９については、それらに加え、映像を表示する表示器とのインターフェース機能を備えている。クライアント端末１０５及びスーパーバイザ端末１０９がそれぞれ、マイク、スピーカ、機能ボタン、マウス、キーボード、表示器等を含んで構成されてもよい。 DESCRIPTION OF EMBODIMENTS Hereinafter, embodiments for carrying out the present invention will be described in detail with reference to the drawings.
FIG. 1 shows a configuration example of a system to which the present invention is applied. This system includes an exchange (not shown), a call center system 101, a client terminal 105, an operator terminal 107, and a supervisor terminal 109. A plurality of client terminals 105, operator terminals 107, and supervisor terminals 109 may exist. Each of the client terminal 105 and the supervisor terminal 109 includes a SIP (Session Initiation Protocol) control unit 121, an RTP (Real-time Transport Protocol) control unit 123, and an interface unit 125. The interface unit 125 has an interface function with, for example, a microphone, a speaker, a function button, a mouse, a keyboard, and the like, and the supervisor terminal 109 has an interface function with a display device that displays an image in addition to them. . Each of the client terminal 105 and the supervisor terminal 109 may include a microphone, a speaker, a function button, a mouse, a keyboard, a display, and the like.

図１のシステムにおいて、クライアント端末１０５は、ゲートウェイ１０３を通じてコールセンターシステム１０１と音声のＲＴＰパケットのやり取りを行う。コールセンターシステム１０１は、オペレータ端末１０７及びスーパーバイザ端末１０９とＳＩＰ接続されており、各端末１０７及び１０９はＳＩＰ制御部１２１によってＳＩＰのやりとりを行う機能を備えている。オペレータ端末１０７の備えているＲＴＰ制御部１２３は、インターフェース部１２５を介してマイクから入力された音声をエンコードし、音声のＲＴＰパケットとして目的のアドレスに送信し、また、送られてきたパケットをデコードし、インターフェース部１２５を介してスピーカに送る機能を持つ。スーパーバイザ端末１０９の備えているＲＴＰ制御部１２３は、コールセンターシステム１０１から送られてきた映像のＲＴＰパケットをデコードし、インターフェース部１２５を介して映像表示器（図示せず）に送る機能を持つ。 In the system of FIG. 1, the client terminal 105 exchanges voice RTP packets with the call center system 101 through the gateway 103. The call center system 101 is connected to the operator terminal 107 and the supervisor terminal 109 by SIP, and each of the terminals 107 and 109 has a function of exchanging SIP by the SIP control unit 121. The RTP control unit 123 provided in the operator terminal 107 encodes the voice input from the microphone via the interface unit 125, transmits it as a voice RTP packet to the target address, and decodes the sent packet. And has a function of sending to the speaker via the interface unit 125. The RTP control unit 123 provided in the supervisor terminal 109 has a function of decoding a video RTP packet sent from the call center system 101 and sending it to a video display (not shown) via the interface unit 125.

図１のシステムでは、クライアントが発声した音声が、マイクからクライアント端末１０５に入力され、音声ＲＴＰパケットにエンコードされる。この音声ＲＴＰパケットがゲートウェイ１０３を介してコールセンターシステム１０１に送られる。コールセンターシステム１０１は、クライアント端末１０５から送られた音声ＲＴＰパケットを、担当のオペレータのオペレータ端末１０７に送る。オペレータ端末１０７では、コールセンターシステム１０１から送られて来た音声ＲＴＰパケットがデコードされ、インターフェース部１２５を介してスピーカからオペレータの耳に届けられる。オペレータが発声した音声は、その逆のルートを辿ってクライアント端末１０５に接続されたスピーカからクライアントの耳に届けられる。このようにして、クライアントとオペレータが通話を行う。 In the system of FIG. 1, the voice uttered by the client is input from the microphone to the client terminal 105 and encoded into a voice RTP packet. This voice RTP packet is sent to the call center system 101 via the gateway 103. The call center system 101 sends the voice RTP packet sent from the client terminal 105 to the operator terminal 107 of the operator in charge. In the operator terminal 107, the voice RTP packet sent from the call center system 101 is decoded and delivered from the speaker to the operator's ear via the interface unit 125. The voice uttered by the operator follows the reverse route and is delivered to the client's ear from the speaker connected to the client terminal 105. In this way, the client and the operator make a call.

コールセンターシステム１０１は、後で詳細に説明するように、クライアントからの音声ＲＴＰパケットを用いてクライアントの音声を分析して感情を推定し、推定した感情を表現したアニメーション顔の映像を合成し、それをＲＴＰパケットにエンコードしてスーパーバイザ端末１０９に送る。スーパーバイザ端末１０９では、コールセンターシステム１０１から受信した映像ＲＴＰパケットをＲＴＰ制御部１２３でデコードし、インターフェース部１２５を介して表示器に表示する。スーパーバイザは表示された画像からクライアントとオペレータの通話状況を把握し、必要なら適切な助言を与えることもできる。 As will be described in detail later, the call center system 101 analyzes the voice of the client using voice RTP packets from the client to estimate emotions, and synthesizes an animated face image representing the estimated emotions. Is encoded into an RTP packet and sent to the supervisor terminal 109. In the supervisor terminal 109, the video RTP packet received from the call center system 101 is decoded by the RTP control unit 123 and displayed on the display unit via the interface unit 125. The supervisor can grasp the call status of the client and the operator from the displayed image, and can give appropriate advice if necessary.

スーパーバイザ端末１０９が、コールセンターシステム１０１にクライアントの音声ＲＴＰパケットからアニメーション顔を生成するための要求を行い、コールセンターシステム１０１からの映像ＲＴＰパケット受信が開始されるようにしてもよい。 The supervisor terminal 109 may request the call center system 101 to generate an animation face from the client's voice RTP packet and start receiving the video RTP packet from the call center system 101.

このようにして、クライアントとオペレータがコールセンターシステム１０１を通じて音声通話を開始し、その間にスーパーバイザがクライアントの音声の状態から生成されるアニメーション顔を確認し、クライアントの状況を把握する。 In this way, the client and the operator start a voice call through the call center system 101, and during that time, the supervisor confirms the animated face generated from the voice state of the client and grasps the status of the client.

次に、本発明の第１の実施形態について、図２、３、４、５及び６を参照して詳細に説明する。 Next, a first embodiment of the present invention will be described in detail with reference to FIGS.

図２は、本発明の第１の実施形態の全体構成を示す。本発明の第１の実施形態には、コールセンターシステム１０１と、第１、第２，・・・，第Ｎクライアント端末（１０５−１，１０５−２，・・・，１０５−Ｎ）と、第１、第２，・・・，第Ｎオペレータ端末（１０７−１，１０７−２，・・・，１０７−Ｎ）と、スーパーバイザ端末１０９とが含まれる。図２では、第１クライアント端末１０５−１と第１オペレータ端末１０７−１、第２クライアント端末１０５−２と第２オペレータ端末１０７−２、・・・、第Ｎクライアント端末１０５−Ｎと第Ｎオペレータ端末１０７−Ｎが、コールセンター１０１を介して通話できるように構成されている。クライアント端末とオペレータ端末の組合せは、固定的でもよいし、適宜変更できる構成でもよい。コールセンターシステム１０１には、第１、第２，・・・，第Ｎクライアント端末（１０５−１，１０５−２，・・・，１０５−Ｎ）にそれぞれ対応してアニメーション顔生成部（２２９−１，２２９−２，・・・，２２９−Ｎ）が設けられている。アニメーション顔生成部２２９はそれぞれ対応するクライアント端末１０５から送られるクライアントの音声データに基づいてクライアントの感情を推定し、その感情を表現したアニメーション顔を生成し、生成したアニメーション顔データをスーパーバイザ端末１０９に送る。 FIG. 2 shows the overall configuration of the first embodiment of the present invention. The first embodiment of the present invention includes a call center system 101, first, second,..., Nth client terminals (105-1, 105-2,..., 105-N), The first, second,..., Nth operator terminals (107-1, 107-2,..., 107-N) and the supervisor terminal 109 are included. In FIG. 2, the first client terminal 105-1 and the first operator terminal 107-1, the second client terminal 105-2 and the second operator terminal 107-2,..., The Nth client terminal 105-N and the Nth The operator terminal 107-N is configured to be able to make a call via the call center 101. The combination of the client terminal and the operator terminal may be fixed or may be changed as appropriate. The call center system 101 includes animation face generation units (229-1) corresponding to the first, second,..., Nth client terminals (105-1, 105-2,..., 105-N), respectively. , 229-2,..., 229-N). The animation face generation unit 229 estimates the client's emotion based on the client's voice data sent from the corresponding client terminal 105, generates an animation face expressing the emotion, and sends the generated animation face data to the supervisor terminal 109. send.

図２では、それぞれのクライアント端末から送られるクライアントの音声データと、その相手のオペレータ端末から送られるオペレータの音声データとが足し合された後に選択部２４５に送られる。選択部２４５では、スーパーバイザ端末１０９から送られた選択信号に基づいてスーパーバイザが所望するクライアントとその相手のオペレータの通話音声が選択されてスーパーバイザ端末１０９に送られる。 In FIG. 2, the client voice data sent from each client terminal and the operator voice data sent from the partner operator terminal are added to each other and sent to the selection unit 245. Based on the selection signal sent from the supervisor terminal 109, the selection unit 245 selects the client's desired voice and the call voice of the partner operator and sends it to the supervisor terminal 109.

図３は、本発明の第１の実施形態におけるコールセンターシステム１０１の構成を示す。コールセンターシステム１０１は、音声データ送受信部２２１、デコーダ２２３、エンコーダ２２５、音声データ受入／送出部２２７を備えている。音声データ送受信部２２１は、クライアント端末１０５から送られてきた音声パケットを受信する。また、音声データ送受信部２２１は、オペレータ端末１０７から送られてきた音声データがエンコーダ２２５によりエンコードされた後、クライアント端末１０５に向けて送信する。デコーダ２２３は、クライアント端末から受信した音声パケットをデコードする。エンコーダ２２５は、デコードされた音声データを所定のプロトコルやフォーマットに変換する。音声データ受入／送出部２２７は、エンコーダ２２５により生成された音声パケットを送出すると共に、オペレータ端末１０７から音声データを受け入れる。 FIG. 3 shows the configuration of the call center system 101 according to the first embodiment of the present invention. The call center system 101 includes an audio data transmission / reception unit 221, a decoder 223, an encoder 225, and an audio data reception / transmission unit 227. The voice data transmission / reception unit 221 receives a voice packet transmitted from the client terminal 105. The voice data transmitting / receiving unit 221 transmits the voice data sent from the operator terminal 107 to the client terminal 105 after the encoder 225 encodes the voice data. The decoder 223 decodes the voice packet received from the client terminal. The encoder 225 converts the decoded audio data into a predetermined protocol or format. The voice data receiving / sending unit 227 sends the voice packet generated by the encoder 225 and accepts voice data from the operator terminal 107.

コールセンターシステム１０１は、更に、アニメーション顔生成部２２９、エンコーダ２３７、映像データ送出部２３９を備えている。アニメーション顔生成部２２９は、感情推定部２３１、アニメーション顔合成用データベース２３３及びアニメーション顔合成部２３５を含む。 The call center system 101 further includes an animation face generation unit 229, an encoder 237, and a video data transmission unit 239. The animation face generation unit 229 includes an emotion estimation unit 231, an animation face synthesis database 233, and an animation face synthesis unit 235.

感情推定部２３１は、デコーダ２２３によりデコードされたクライアントの音声データからクライアントの感情を推定する。例えば、クライアントの音声が怒っているのか、笑っているのかなど、例えば予め用意したプログラム等に従ってどの傾向に近いのかを推定する。 The emotion estimation unit 231 estimates client emotions from the client audio data decoded by the decoder 223. For example, it is estimated whether the client's voice is angry or laughing, for example, according to a program prepared in advance or the like.

アニメーション顔合成用データベース２３３は、アニメーション顔を生成するためのデータベースであり、ユーザやシステム管理者が手動でシステムにアップロードした画像ファイルや、システム内に予め用意されたファイルや、過去のビデオ会議の映像をデータベースとして保管しておいたファイルなどを登録して利用してもよい。 The animation face synthesis database 233 is a database for generating an animation face, and is an image file manually uploaded to the system by a user or a system administrator, a file prepared in advance in the system, a past video conference file, or the like. You may register and use a file that stores video as a database.

アニメーション顔合成部２３５は、感情推定部２３１による推定結果と、アニメーション顔合成用データベース２３３に格納されたアニメーション顔合成用データを基に、アニメーション顔を合成する。例えば、現在「怒っている」と推測される場合は、その間怒っているアニメーションを生成し、「笑っている」と推測される場合は、その笑いの大きさに応じて笑っているアニメーションを生成する。 The animation face synthesis unit 235 synthesizes an animation face based on the estimation result from the emotion estimation unit 231 and the animation face synthesis data stored in the animation face synthesis database 233. For example, if you are currently estimated to be angry, generate an angry animation during that time, and if you are estimated to be laughing, generate a laughing animation depending on the size of the laugh To do.

アニメーション顔合成部２３５により合成して得られたアニメーション顔は、エンコーダ２３７により映像ＲＴＰパケットにエンコードされ、映像データ送出部２３９からスーパーバイザ端末１０９に送出される。 The animation face obtained by the synthesis by the animation face synthesis unit 235 is encoded into a video RTP packet by the encoder 237 and sent from the video data sending unit 239 to the supervisor terminal 109.

上述したように、コールセンターシステム１０１から送出された音声パケットはオペレータ端末１０７に送られ、また、映像パケットはスーパーバイザ端末１０９に送られ、各端末のＲＴＰ制御部１２３によりデコードされ、聞こえる、もしくは見える形で再生される。 As described above, the voice packet sent from the call center system 101 is sent to the operator terminal 107, and the video packet is sent to the supervisor terminal 109, decoded by the RTP control unit 123 of each terminal, and can be heard or seen. It is played with.

複数のクライアントの状況をスーパーバイザが把握したいとき、クライアント毎にアニメーション顔生成処理（図２、図３参照）によって作成される複数の映像データが、スーパーバイザを指定する同一のＩＰアドレスまで送信されることになる。 When the supervisor wants to know the status of multiple clients, multiple video data created by the animation face generation process (see FIGS. 2 and 3) for each client is sent to the same IP address that designates the supervisor. become.

次に、感情推定部２３１について説明する。感情推定部２３１による感情の推定では、クライアントの音声についての音強度分析に基づく方法や、音素解析に基づく方法など様々な方法が採用できる。 Next, the emotion estimation unit 231 will be described. In the estimation of emotions by the emotion estimation unit 231, various methods such as a method based on sound intensity analysis of a client's voice and a method based on phoneme analysis can be adopted.

まず、音強度分析に基づく感情推定方法について説明する。この方法では、デコーダ２２３から送られてきたクライアントの音声データに対して音声解析部（図示せず）が音声解析を行い、特に音強度（又はパワー）に関する情報から感情を推定する。感情は、例えば、感情の種類と感情の強度により指定できる。感情の種類は、例えば「喜」、「怒」、「哀」などとしてもよい。感情の強度は、例えば「強」、「中」、「弱」などに分類してもよい。音声解析では、例えば、音声データを所定のフレームごとに時系列に分離し、これらのフレーム間のパワー偏差、パワー差分の平均値及び／又はパワー差分の偏差を求め、その解析結果から音強度の程度やそのパターンなどの感情に関する情報を抽出して、これらの情報に基づいて感情の種類と強度を推定してもよい。 First, an emotion estimation method based on sound intensity analysis will be described. In this method, a voice analysis unit (not shown) performs voice analysis on the client voice data sent from the decoder 223, and in particular, emotions are estimated from information related to sound intensity (or power). The emotion can be specified by, for example, the type of emotion and the strength of the emotion. The type of emotion may be, for example, “joy”, “anger”, “sorrow”, or the like. The intensity of emotion may be classified into, for example, “strong”, “medium”, “weak”, and the like. In the voice analysis, for example, the voice data is separated into time series for each predetermined frame, and the power deviation between these frames, the average value of the power difference and / or the deviation of the power difference are obtained, and the sound intensity is calculated from the analysis result. Information related to emotions such as the degree and the pattern may be extracted, and the type and intensity of emotion may be estimated based on the information.

例えば、感情推定部２３１は、音声解析部（図示せず）から送られてきた音声解析結果を、所定の期間分だけ記憶し、記憶された解析データを用いて感情を推定してもよい。例えば、所定の基準期間の３回分の期間の音声データの強度が、「強・中・強」ならば「喜」、「強・強・強」ならば「怒」、「弱・中・弱」ならば「哀」、とするテンプレートを用意しておく。記憶された解析データのうち、基準期間３回分のデータについて、その強度パターンをテンプレートと比較し、それらのマッチングを調べることにより、その時点での感情を推定できる。 For example, the emotion estimation unit 231 may store the voice analysis result sent from the voice analysis unit (not shown) for a predetermined period and estimate the emotion using the stored analysis data. For example, if the strength of the voice data for three periods of the predetermined reference period is “strong / medium / strong”, “joy”, if “strong / strong / strong”, “anger”, “weak / medium / weak” "If so, prepare a template for" Sorrow. " Of the stored analysis data, for the data for three reference periods, the intensity pattern is compared with the template, and the matching is examined to estimate the emotion at that time.

別法として、記憶した基準期間３回分の音声データに対して、各強度値の差の絶対値の和（ヒルベルト距離）や各強度の差の２乗和（ユークリッド距離）を計算して、最も近いものをその時の感情として判定してもよい。 As an alternative method, the sum of absolute values of differences in intensity values (Hilbert distance) and the sum of squares of differences in intensity (Euclidean distance) are calculated for the stored audio data for three reference periods, You may judge the near thing as the emotion at that time.

次に、音素解析に基づく感情推定方法を説明する。この方法では、感情を表すキーワードを辞書テンプレートとして持っておき、音素解析の結果と辞書テンプレートのマッチングを行って感情の推定を行う。例えば、「怒」の感情の場合、怒りを表す単語（例えば「怒る」、「殴る」など）を辞書テンプレートとして用意しておく。「喜」、「哀」などの感情についても、同様に辞書テンプレートを用意しておく。そして、音素解析結果から得られる音素データとこれらの辞書テンプレートを比較してマッチングを調べることにより、感情を推定する。 Next, an emotion estimation method based on phoneme analysis will be described. In this method, a keyword representing emotion is held as a dictionary template, and the emotion is estimated by matching the result of phoneme analysis with the dictionary template. For example, in the case of the emotion of “anger”, a word representing anger (for example, “anger”, “talk”) is prepared as a dictionary template. Dictionary templates are similarly prepared for emotions such as “joy” and “sorrow”. Then, the phoneme data obtained from the phoneme analysis result is compared with these dictionary templates to check the matching, thereby estimating the emotion.

音強度分析に基づく方法と、音素解析に基づく方法を組み合わせてもよい。例えば、どちらも同じ感情を推定した場合は、その感情を選択し、異なる場合は、乱数によって確率的にどちらかの感情を選択してもよい。 A method based on sound intensity analysis and a method based on phoneme analysis may be combined. For example, if both estimate the same emotion, the emotion may be selected, and if different, either emotion may be selected probabilistically by a random number.

また、別の感情推定手法として、音声信号の周波数や振幅に関する特徴量に基づいて話者の感情を推定する方法もある。例えば、発話中の音声の基本周波数の最大値や振幅の最大値を特徴量として使用できる。感情推定部２３１は、各感情について事前に取得しておいた特徴量の基準データと、音声解析部により取得した特徴量のデータとを比較することにより、話者の感情を推定する。 As another emotion estimation method, there is also a method for estimating a speaker's emotion based on a feature amount related to the frequency and amplitude of a voice signal. For example, the maximum value of the fundamental frequency and the maximum value of the amplitude of the voice being spoken can be used as the feature amount. The emotion estimation unit 231 estimates the speaker's emotion by comparing the reference data of the feature amount acquired in advance for each emotion with the feature amount data acquired by the speech analysis unit.

感情推定部２３１において上記種々の方法のいずれかにより得られた感情推定結果（感情の種類や強度など）が、アニメーション顔合成部２３５に送られる。 Emotion estimation results (such as emotion type and intensity) obtained by any of the above-described various methods in the emotion estimation unit 231 are sent to the animation face synthesis unit 235.

次に、アニメーション顔合成用データベース２３３について説明する。アニメーション顔合成用データベース２３３には、例えば、ＣＧキャラクタを基にした顔の形状データや、写真を基にした顔の形状データが格納されている。この顔形状データは、例えば、目、鼻、口、眉、耳、頭髪などの部品で構成される。 Next, the animation face synthesis database 233 will be described. The animation face synthesis database 233 stores, for example, face shape data based on CG characters and face shape data based on photographs. This face shape data is composed of parts such as eyes, nose, mouth, eyebrows, ears, and hair.

アニメーション顔合成用データベース２３３には、顔形状データの他にも、表情データ、表情動作データ、表情パターンデータなどを格納してもよい。表情データとは、テクスチャを貼り変える合成方法において、笑った表情のテクスチャ、泣いた表情のテクスチャ、その途中段階のテクスチャなどが表情データである。表情パターンデータは、ある表情データから別の表情データへの移行に関するデータであり、ある表情データから移行可能な表情データの情報やその移行確率の情報などを含む。表情動作データは、ＣＧキャラクタの顔の表情を生成するためのデータである。具体的には、顔の形状を変形する場合に、顔の形状データのうち、表情を生成する眉、目や口などの端点に対応する頂点座標の移動量の時系列データが表情動作データである。 The animation face synthesis database 233 may store facial expression data, facial expression motion data, facial expression pattern data, etc. in addition to facial shape data. The expression data is expression data including a texture of a laughing expression, a texture of a crying expression, a texture in the middle of the expression, and the like in a composition method for pasting textures. The facial expression pattern data is data relating to transition from one facial expression data to another facial expression data, and includes information on facial expression data that can be transitioned from certain facial expression data, information on the transition probability, and the like. The expression motion data is data for generating a facial expression of the CG character. Specifically, when transforming the shape of the face, the time series data of the amount of movement of the vertex coordinates corresponding to the end points such as the eyebrows, eyes, and mouth that generate the facial expression of the facial shape data is the facial expression action data. is there.

次に、アニメーション顔合成部２３５について説明する。アニメーション顔合成部２３５は、感情推定部２３１により推定された通話中のクライアントの感情に基づいて、推定された感情が表現されているアニメーション顔を合成する。合成の際、アニメーション合成用データベース２３３に格納されているデータを用いてもよい。各種の感情を表現したアニメーション顔を生成するには、例えば、いわゆるフェイシャルアニメーション技術を用いることができる。具体的には、顔の形状を変形したり、顔のテクスチャを貼り変えることにより感情を表現したアニメーション顔を生成してもよい。顔の形状を変形する手法では、例えば、上記の表情動作データに基づき、眉、目、口、鼻、耳、顔等の形状を変形させることにより各感情を表現する。顔のテクスチャを貼り変える手法では、表情パターンデータを考慮に入れて表情データを用いてテクスチャの貼り変えを行うこともできる。 Next, the animation face synthesis unit 235 will be described. The animation face synthesis unit 235 synthesizes an animation face in which the estimated emotion is expressed based on the client's emotion during the call estimated by the emotion estimation unit 231. At the time of composition, data stored in the animation composition database 233 may be used. In order to generate an animation face expressing various emotions, for example, a so-called facial animation technique can be used. Specifically, an animated face expressing an emotion may be generated by changing the shape of the face or pasting the texture of the face. In the method of deforming the shape of the face, for example, each emotion is expressed by deforming the shape of the eyebrows, eyes, mouth, nose, ears, face, etc. based on the expression motion data. In the method of changing the texture of the face, the expression pattern data can be taken into account and the texture can be changed using the expression data.

図４は、本発明のアニメーション顔生成部２２９により生成されるアニメーション顔を例示する。図４の（ａ）、（ｂ）、（ｃ）のアニメーション顔は、それぞれ、「喜」、「怒」、「哀」の感情を表現している。アニメーション顔合成部２３５は、感情推定部２３１により推定された感情に基づいて、ユーザ毎に、推定された感情が表現されている色を選択し、選択された色を背景色としてアニメーション顔に付加してもよい。例えば、スーパーバイザに分かりやすいように、クライアントの声が穏やかな状態であればアニメーションの背景を青色、クライアントの声が怒っている状態であればアニメーションの背景を赤色、などの効果を追加してもよい。 FIG. 4 illustrates an animation face generated by the animation face generation unit 229 of the present invention. The animation faces in FIGS. 4A, 4B, and 4C express emotions of “joy”, “anger”, and “sorrow”, respectively. The animation face synthesis unit 235 selects a color expressing the estimated emotion for each user based on the emotion estimated by the emotion estimation unit 231 and adds the selected color to the animation face as a background color. May be. For example, if the client's voice is calm, the animation background is blue, and if the client's voice is angry, the animation background is red. Good.

図５は、本発明の第１の実施形態におけるスーパーバイザ端末の構成を示す。スーパーバイザ端末１０９は、画面合成部３０１と表示部３０３と選択入力受付部３０５と選択信号出力部３０７と音声処理部３０９と音声出力部３１１から構成されている。画面合成部３０１は、コールセンターシステム１０１のアニメーション顔生成部２２９−１，・・・，２２９−Ｎから送られてくる夫々のクライアントのアニメーション顔の映像データを受信し、画面を合成する。合成された画面データは表示部３０３に送られ、表示装置に合成画面を表示する。図６は、表示装置にアニメーション顔を表示した例を示す。通話している１２人のクライアントの感情を表現したアニメーション顔が、縦４列、横３列に並べて表示されている。図６の表示例では、怒っているアニメーション顔の背景には、他の感情の顔の背景色とは異なる背景色（この場合は赤色）が付加され、スーパーバイザに分かりやすくなっている。 FIG. 5 shows the configuration of the supervisor terminal in the first embodiment of the present invention. The supervisor terminal 109 includes a screen composition unit 301, a display unit 303, a selection input reception unit 305, a selection signal output unit 307, an audio processing unit 309, and an audio output unit 311. The screen synthesis unit 301 receives the video data of the animation faces of the respective clients sent from the animation face generation units 229-1, ..., 229-N of the call center system 101, and synthesizes the screen. The synthesized screen data is sent to the display unit 303, and the synthesized screen is displayed on the display device. FIG. 6 shows an example in which an animated face is displayed on the display device. Animated faces that express the emotions of the 12 clients who are talking are displayed side by side in 4 columns and 3 columns. In the display example of FIG. 6, a background color (in this case, red) different from the background color of other emotional faces is added to the background of the angry animated face, making it easier for the supervisor to understand.

このように並べて一覧表示することにより、スーパーバイザは、複数のクライアントの喜怒哀楽の状況を一度に、リアルタイムで把握することができる。スーパーバイザは、アニメーションの表情を見て、例えばクライアントの怒りが強く緊急度が高いと判断できるものを優先して音声のモニタリングを開始するなど、効率的にオペレータにアドバイスを送ることができる。 By displaying the list side by side in this way, the supervisor can grasp the state of emotions of a plurality of clients at a time in real time. The supervisor can efficiently send advice to the operator, for example, by starting voice monitoring by giving priority to a client who can determine that the client's anger is strong and the degree of urgency is high.

例えば、第ｋクライアント（ｋは１〜Nの任意の番号）のアニメーション顔が怒った表情をしている場合に、スーパーバイザは第ｋクライアントと第ｋオペレータとの通話をモニターできるように構成してもよい。通話をモニターする場合、まず、スーパーバイザは、マウス、キーボード、タッチパネル、ボタンなど適当な入力手段により、通話モニターしたいクライアントを選択する。その選択入力が選択入力受付部３０５により受け付けられ、選択対象を示す選択信号が選択信号出力部３０７からコールセンターシステム１０１の選択部２４５（図２）に送られる。選択部２４５では、選択信号が示すクライアントとその相手のオペレータの通話音声データが選択されてスーパーバイザ端末１０９の音声処理部３０９に送られる。図５では、第kクライアントが選択されているので、第ｋクライアントの音声データと第ｋオペレータの音声データとが加え合わされた音声データ（図２参照）が音声処理部３０９に送られる。音声処理部３０９においてデコード等の処理が行われ、音声出力部３１１を介してスピーカから第ｋクライアントと第ｋオペレータの通話音声が出力される。このようにして、スーパーバイザは任意のクライアントとその相手のオペレータの通話をモニターすることができる。 For example, when the animated face of the kth client (k is an arbitrary number from 1 to N) has an angry expression, the supervisor can be configured to monitor the call between the kth client and the kth operator. Also good. When monitoring a call, first, the supervisor selects a client to be monitored by an appropriate input means such as a mouse, keyboard, touch panel, or button. The selection input is received by the selection input receiving unit 305, and a selection signal indicating the selection target is sent from the selection signal output unit 307 to the selection unit 245 (FIG. 2) of the call center system 101. In the selection unit 245, the call voice data of the client and the partner operator indicated by the selection signal are selected and sent to the voice processing unit 309 of the supervisor terminal 109. In FIG. 5, since the kth client is selected, the voice data (see FIG. 2) obtained by adding the voice data of the kth client and the voice data of the kth operator is sent to the voice processing unit 309. The voice processing unit 309 performs processing such as decoding, and the voices of the kth client and the kth operator are output from the speaker via the voice output unit 311. In this way, the supervisor can monitor the call between any client and its counterpart operator.

上述したように、本発明の第１の実施態様では、コールセンターシステム内に、クライアント側から入力された音声を基に感情を推定する感情推定機能、及び推定した結果をもとにアニメーション映像を生成する機能をクライアント毎に備え、生成されたアニメーションを、スーパーバイザの監視するモニタ上に映し出すことで、スーパーバイザにとって一目見て分かりやすくクライアントの様子を把握し、効率良くオペレータに指示を出すことができる。 As described above, in the first embodiment of the present invention, in the call center system, an emotion estimation function for estimating an emotion based on voice input from the client side, and an animation image based on the estimation result are generated. This function is provided for each client, and the generated animation is displayed on a monitor monitored by the supervisor, so that the supervisor can grasp the state of the client at a glance and can efficiently instruct the operator.

感情推定部２３１では、予めサーバに設定されていた情報を基に、アニメーションを生成する必要のあるクライアントの音声を解析し、解析結果から感情を推定してもよい。また、推定結果に合わせてアニメーション顔を生成する際の元となる画像は、リアル感を出すために人の写真を元としても良いし、キャラクターのような画像が元であっても良い。 The emotion estimation unit 231 may analyze the voice of a client that needs to generate an animation based on information previously set in the server, and estimate the emotion from the analysis result. In addition, an image that is a base for generating an animation face in accordance with the estimation result may be based on a person's photograph in order to give a real feeling, or may be based on an image such as a character.

次に、本発明の第２の実施形態について、図７、８及び９を参照して詳細に説明する。 Next, a second embodiment of the present invention will be described in detail with reference to FIGS.

図７は、本発明の第２の実施形態の全体構成を示す。図２に示された本発明の第１の実施形態と比べて、コールセンターシステム１０１内にアニメーション生成の機能を備えず、オペレータの使用する端末（１０７−１，１０７−２，・・・，１０７−Ｎ）にそれぞれアニメーション顔生成部（２２９−１，２２９−２，・・・，２２９−Ｎ）が設けられている点、及びコールセンターシステム１０１内に設けられていたクライアントの通話音声の選択部２４５が、スーパーバイザ端末１０９内に選択部３１５（図９）として設けられている点で異なり、その他の構成は同じである。 FIG. 7 shows the overall configuration of the second embodiment of the present invention. Compared with the first embodiment of the present invention shown in FIG. 2, the call center system 101 does not have an animation generation function, and terminals (107-1, 107-2,... -N) are provided with animation face generation units (229-1, 229-2,..., 229-N), respectively, and a client call voice selection unit provided in the call center system 101. 245 is provided as a selection unit 315 (FIG. 9) in the supervisor terminal 109, and the other configurations are the same.

図８は、本発明の第２の実施形態におけるオペレータ端末の構成を示す。図８のオペレータ端末１０７は、図３のコールセンターシステム１０１の構成と比べて、合成されたアニメーション顔がエンコーダ２３７により映像パケットにエンコードされたデータと、クライアントとその相手のオペレータの通話音声をエンコーダ２２５により音声パケットにエンコードされたデータとが、多重化／送出部２３９Ａによって多重化されてスーパーバイザ端末１０９に送られる点で異なり、その他の構造は同じである。 FIG. 8 shows the configuration of the operator terminal in the second embodiment of the present invention. Compared to the configuration of the call center system 101 in FIG. 3, the operator terminal 107 in FIG. 8 encodes the data in which the synthesized animation face is encoded into video packets by the encoder 237 and the call voice of the client and the partner operator. Thus, the data encoded in the voice packet is multiplexed by the multiplexing / sending unit 239A and sent to the supervisor terminal 109, and the other structures are the same.

次に、図８を参照して、本発明の第２の実施形態におけるオペレータ端末１０７の動作について説明する。コールセンターシステム１０１からオペレータの端末１０７に送られてくる音声を音声データ送受信部２２１で受信し、デコーダ２２３でデコードする。デコーダ２２３でデコードされた音声データは複製され、エンコーダ２２５と感情推定部２３１とに分けて送られる。感情推定部２３１では、入力された音声が現在怒っているのか、笑っているのかなど感情を推定し、その推定結果をアニメーション顔合成部２３５に送出する。アニメーション顔合成部２３５では、感情推定部２３１から送られた推定結果をもとにアニメーション顔合成用ＤＢ２３３に格納された各種データを使用してアニメーション顔の生成を行う。生成したアニメーション顔はエンコーダ２３７によって映像パケットデータにエンコードされ、多重化／送出部２３９Ａに送られる。また、クライアントとその相手のオペレータの通話音声をエンコーダ２２５により音声パケットにエンコードされた音声パケットデータが、多重化／送出部２３９Ａに送られる。多重化／送出部２３９Ａでは、これらの映像パケットデータと音声パケットデータが多重化され、スーパーバイザ端末１０９に送られる。 Next, the operation of the operator terminal 107 according to the second embodiment of the present invention will be described with reference to FIG. The voice sent from the call center system 101 to the operator's terminal 107 is received by the voice data transmission / reception unit 221 and decoded by the decoder 223. The audio data decoded by the decoder 223 is duplicated and sent separately to the encoder 225 and the emotion estimation unit 231. The emotion estimation unit 231 estimates emotions such as whether the input voice is currently angry or laughing, and sends the estimation result to the animation face synthesis unit 235. The animation face synthesis unit 235 generates an animation face using various data stored in the animation face synthesis DB 233 based on the estimation result sent from the emotion estimation unit 231. The generated animation face is encoded into video packet data by the encoder 237 and sent to the multiplexing / sending unit 239A. Also, voice packet data obtained by encoding the voice of the call between the client and the operator of the client into voice packets by the encoder 225 is sent to the multiplexing / sending unit 239A. In the multiplexing / sending unit 239A, these video packet data and audio packet data are multiplexed and sent to the supervisor terminal 109.

オペレータ端末１０７をこのように構成することによって、アニメーション生成の機能を備えたコールセンターシステムを介す必要がなく、ピアツーピアなクライアントとオペレータのやりとりであってもスーパーバイザがアニメーションによって状況を把握することができる。 By configuring the operator terminal 107 in this way, it is not necessary to go through a call center system having an animation generation function, and the supervisor can grasp the situation by animation even in the case of a peer-to-peer client-operator exchange. .

図９は、本発明の第２の実施形態におけるスーパーバイザ端末の構成を示す。図９のスーパーバイザ端末１０９は、図５のスーパーバイザ端末と比べて、選択信号出力部３０７がなく、多重分離部３１３と選択部３１５が加わっている点で異なり、その他の構成は同じである。それぞれのオペレータ端末１０７から、各クライアントの音声データとアニメーション顔データとが多重化された信号が多重分離部３１３に送られる。多重分離部３１３では、それぞれのオペレータ端末からの信号を、アニメーション顔データとクライアントの音声データとに分離する。この場合のクライアントの音声データとは、クライアントの音声にその相手のオペレータの音声が加えられた音声データを意味する。それぞれのクライアントの顔データは画面合成部３０１に送られて画面が合成され、表示部３０３を介して表示装置に合成画面が表示される。また、多重分離部３１３により分離された夫々のクライアント及びオペレータの通話音声は、選択部３１５に送られる。選択部３１５では、選択入力受付部３０５が受け付けた選択入力に対応したクライアント（この場合は第ｋクライアント）及びその相手のオペレータの通話音声が選択され、音声処理部３０９に送られる。音声処理部３０９においてデコード等の処理を行い、音声出力部３１１を介してマイクから第ｋクライアントと第ｋオペレータの音声を出力する。 FIG. 9 shows a configuration of a supervisor terminal according to the second embodiment of the present invention. The supervisor terminal 109 in FIG. 9 is different from the supervisor terminal in FIG. 5 in that the selection signal output unit 307 is not provided, a demultiplexing unit 313 and a selection unit 315 are added, and the other configurations are the same. From each operator terminal 107, a signal obtained by multiplexing the voice data and animation face data of each client is sent to the demultiplexing unit 313. The demultiplexing unit 313 separates the signal from each operator terminal into animation face data and client voice data. In this case, the voice data of the client means voice data obtained by adding the voice of the partner operator to the voice of the client. The face data of each client is sent to the screen composition unit 301 to compose the screen, and the composition screen is displayed on the display device via the display unit 303. The call voices of the respective clients and operators separated by the demultiplexing unit 313 are sent to the selection unit 315. In the selection unit 315, the call voice of the client (in this case, the k-th client) corresponding to the selection input received by the selection input reception unit 305 and the operator of the other party is selected and sent to the voice processing unit 309. The voice processing unit 309 performs processing such as decoding, and outputs voices of the kth client and the kth operator from the microphone via the voice output unit 311.

本発明により得られる第１の効果は、クライアントの喜怒哀楽等の感情を、スーパーバイザがリアルタイムで、複数のクライアントに対して同時に把握できることである。 The first effect obtained by the present invention is that a supervisor can simultaneously grasp a client's emotions such as emotions with respect to a plurality of clients in real time.

その理由は、各クライアントの音声を基に作り出したアニメーション顔を１つの画面上に複数並べて表示することで、一人ひとりのオペレータの対応状況をその通話をモニターして確認しなくても、一目で把握できるためである。通常のコールセンターにおけるスーパーバイザの通話モニタリングでは、一人ひとりのオペレータの対応を音声を通じて監視しており、複数のオペレータに対して１人のスーパーバイザが同時に監視することができないため、効率的に状況を判断することができなかった。 The reason is that multiple animated faces created based on each client's voice are displayed side-by-side on a single screen, so it is possible to grasp each operator's response status at a glance without monitoring the call. This is because it can. In supervisor call monitoring in a normal call center, each operator's response is monitored through voice, and one supervisor cannot simultaneously monitor multiple operators, so the situation can be judged efficiently. I could not.

第２の効果は、クライアントがオペレータの気付かない内に徐々に怒ってきている場合や、急に怒り出した場合であっても、スーパーバイザがその様子を把握し、早めの対応が可能になることである。 The second effect is that even if the client is gradually getting angry without the operator's knowledge or suddenly getting angry, the supervisor can grasp the situation and take early action. It is.

その理由は、スーパーバイザが動きのあるアニメーションを基にクライアントの状況を判断することが可能になるためである。これにより、スーパーバイザはオペレータの判断のみに頼ることなく、状況を把握できるようになる。 The reason is that it becomes possible for the supervisor to judge the situation of the client based on a moving animation. As a result, the supervisor can grasp the situation without relying only on the judgment of the operator.

なお、上記の感情表現アニメーション顔表示システムは、ハードウェア、ソフトウェア又はこれらの組合わせにより実現することができる。また、上記の感情表現アニメーション顔表示システムにより行なわれる感情表現アニメーション顔表示方法も、ハードウェア、ソフトウェア又はこれらに組合わせにより実現することができる。ここで、ソフトウェアによって実現されるとは、コンピュータがプログラムを読み込んで実行することにより実現されることを意味する。 The emotion expression animated face display system described above can be realized by hardware, software, or a combination thereof. Also, the emotion expression animation face display method performed by the emotion expression animation face display system can be realized by hardware, software, or a combination thereof. Here, “realized by software” means realized by a computer reading and executing a program.

プログラムは、様々なタイプの非一時的なコンピュータ可読媒体(non-transitory computer readable medium)を用いて格納され、コンピュータに供給することができる。非一時的なコンピュータ可読媒体は、様々なタイプの実体のある記録媒体(tangible storage medium)を含む。非一時的なコンピュータ可読媒体の例は、磁気記録媒体（例えば、フレキシブルディスク、磁気テープ、ハードディスクドライブ）、光磁気記録媒体（例えば、光磁気ディスク）、ＣＤ−ＲＯＭ(Read Only Memory)、ＣＤ−Ｒ、ＣＤ−Ｒ／Ｗ、半導体メモリ（例えば、マスクＲＯＭ、ＰＲＯＭ(Programmable ROM)、ＥＰＲＯＭ(Erasable PROM)、フラッシュＲＯＭ、ＲＡＭ(random access memory)）を含む。また、プログラムは、様々なタイプの一時的なコンピュータ可読媒体(transitory computer readable medium)によってコンピュータに供給されてもよい。一時的なコンピュータ可読媒体の例は、電気信号、光信号、及び電磁波を含む。一時的なコンピュータ可読媒体は、電線及び光ファイバ等の有線通信路、又は無線通信路を介して、プログラムをコンピュータに供給できる。 The program may be stored using various types of non-transitory computer readable media and supplied to the computer. Non-transitory computer readable media include various types of tangible storage media. Examples of non-transitory computer readable media include magnetic recording media (for example, flexible disks, magnetic tapes, hard disk drives), magneto-optical recording media (for example, magneto-optical disks), CD-ROMs (Read Only Memory), CD- R, CD-R / W, semiconductor memory (for example, mask ROM, PROM (Programmable ROM), EPROM (Erasable PROM), flash ROM, RAM (random access memory)). The program may also be supplied to the computer by various types of transitory computer readable media. Examples of transitory computer readable media include electrical signals, optical signals, and electromagnetic waves. The temporary computer-readable medium can supply the program to the computer via a wired communication path such as an electric wire and an optical fiber, or a wireless communication path.

上記の実施形態の一部又は全部は、以下の付記のようにも記載されうるが、以下には限られない。
（付記１）
複数のユーザであって、そのそれぞれが複数のオペレータのそれぞれと少なくとも音声を用いて通話をしている複数のユーザのそれぞれの感情を前記通話で用いられるユーザの音声を基に推定する感情推定手段と、
前記感情推定手段により推定された感情に基づいて、ユーザ毎に、推定された感情が表現されているアニメーション顔を合成するアニメーション顔合成手段と、
前記アニメーション顔合成手段により合成されたユーザ毎のアニメーション顔を視覚的にまとめて確認できるように並べて表示する表示手段と、
を備えることを特徴とする感情表現アニメーション顔表示システム。 A part or all of the above-described embodiment can be described as in the following supplementary notes, but is not limited thereto.
(Appendix 1)
Emotion estimation means for estimating the emotions of a plurality of users, each of which communicates with each of a plurality of operators using at least voice based on the voices of the users used in the call When,
Based on the emotion estimated by the emotion estimation unit, for each user, an animation face synthesis unit that synthesizes an animation face expressing the estimated emotion;
Display means for displaying the animation faces for each user synthesized by the animation face synthesizing means side by side so that they can be visually confirmed together;
An emotion expression animated face display system characterized by comprising:

（付記２）
付記１に記載の感情表現アニメーション顔表示システムであって、
前記アニメーション顔合成手段は、アニメーション合成用データベースに格納されているデータを、前記感情推定手段により合成された感情に基づいて、合成することにより、推定された感情が表現されているアニメーション顔を合成することを特徴とする感情表現アニメーション顔表示システム。 (Appendix 2)
An emotion expression animated face display system according to appendix 1,
The animation face synthesizing unit synthesizes the animation face expressing the estimated emotion by synthesizing the data stored in the animation synthesis database based on the emotion synthesized by the emotion estimation unit. Emotion expression animated face display system characterized by doing.

（付記３）
付記１又は２に記載の感情表現アニメーション顔表示システムであって、
前記アニメーション顔合成手段は、前記感情推定手段により推定された感情に基づいて、ユーザ毎に、推定された感情が表現されている色を選択し、選択された色を背景色としてアニメーション顔に付加することを特徴とする感情表現アニメーション顔表示システム。 (Appendix 3)
The emotion expression animated face display system according to appendix 1 or 2,
The animation face synthesizing unit selects a color expressing the estimated emotion for each user based on the emotion estimated by the emotion estimation unit, and adds the selected color to the animation face as a background color Emotion expression animated face display system characterized by doing.

（付記４）
付記１乃至３の何れか１に記載の感情表現アニメーション顔表示システムであって、
前記感情推定手段と前記アニメーション顔合成手段は、コールセンターシステムに含まれ、
前記表示手段は、スーパーバイザが利用する端末に含まれることを特徴とする感情表現アニメーション顔表示システム。 (Appendix 4)
The emotion expression animated face display system according to any one of appendices 1 to 3,
The emotion estimation means and the animation face synthesis means are included in a call center system,
The emotion expression animated face display system, wherein the display means is included in a terminal used by a supervisor.

（付記５）
付記１乃至３の何れか１に記載の感情表現アニメーション顔表示システムであって、
前記感情推定手段と前記アニメーション顔合成手段は、前記オペレータが利用する端末に含まれ、
前記表示手段は、スーパーバイザが利用する端末に含まれることを特徴とする感情表現アニメーション顔表示システム。 (Appendix 5)
The emotion expression animated face display system according to any one of appendices 1 to 3,
The emotion estimation means and the animation face synthesis means are included in a terminal used by the operator,
The emotion expression animated face display system, wherein the display means is included in a terminal used by a supervisor.

（付記６）
付記１乃至６の何れか１に記載のアニメーション顔表示システムであって、
スーパーバイザが利用する端末が、スーパーバイザにより選択されたオペレータとその相手のユーザとの通話をモニタして出力するためのモニタ手段を更に備えることを特徴とする感情表現アニメーション顔表示システム。 (Appendix 6)
The animation face display system according to any one of appendices 1 to 6,
The emotion expression animation face display system, wherein the terminal used by the supervisor further comprises monitor means for monitoring and outputting a call between the operator selected by the supervisor and the other user.

（付記７）
複数のユーザであって、そのそれぞれが複数のオペレータのそれぞれと少なくとも音声を用いて通話をしている複数のユーザのそれぞれの感情を前記通話で用いられるユーザの音声を基に推定する感情推定ステップと、
前記感情推定ステップにより推定された感情に基づいて、ユーザ毎に、推定された感情が表現されているアニメーション顔を合成するアニメーション顔合成ステップと、
前記アニメーション顔合成ステップにより合成されたユーザ毎のアニメーション顔を視覚的にまとめて確認できるように並べて表示する表示ステップと、
を含むことを特徴とする感情表現アニメーション顔表示方法。 (Appendix 7)
Emotion estimation step of estimating the emotions of a plurality of users, each of which is talking with each of a plurality of operators using at least voice based on the voices of the users used in the call When,
Based on the emotion estimated by the emotion estimation step, for each user, an animation face synthesis step for synthesizing an animation face expressing the estimated emotion;
A display step for displaying the animation faces for each user synthesized in the animation face synthesis step side by side so that they can be visually confirmed together;
An emotion expression animated face display method characterized by comprising:

（付記８）
付記７に記載の感情表現アニメーション顔表示方法であって、
前記アニメーション顔合成ステップは、アニメーション合成用データベースに格納されているデータを、前記感情推定ステップにより推定された感情に基づいて、合成することにより、推定された感情が表現されているアニメーション顔を合成することを特徴とする感情表現アニメーション顔表示方法。 (Appendix 8)
The emotion expression animated face display method according to appendix 7,
The animation face synthesis step combines the data stored in the animation synthesis database based on the emotion estimated by the emotion estimation step, thereby synthesizing the animation face expressing the estimated emotion. An emotion expression animated face display method characterized by:

（付記９）
付記７又は８に記載の感情表現アニメーション顔表示方法であって、
前記アニメーション顔合成ステップは、前記感情推定ステップにより推定された感情に基づいて、ユーザ毎に、推定された感情が表現されている色を選択し、選択された色を背景色としてアニメーション顔に付加することを特徴とする感情表現アニメーション顔表示方法。 (Appendix 9)
The emotion expression animated face display method according to appendix 7 or 8,
The animation face synthesis step selects a color expressing the estimated emotion for each user based on the emotion estimated by the emotion estimation step, and adds the selected color to the animation face as a background color An emotion expression animated face display method characterized by:

（付記１０）
付記７乃至９の何れか１に記載の感情表現アニメーション顔表示方法であって、
前記感情推定ステップと前記アニメーション顔合成ステップは、コールセンターシステムにより実行され、
前記表示ステップは、スーパーバイザが利用する端末により実行されることを特徴とする感情表現アニメーション顔表示方法。 (Appendix 10)
The emotion expression animated face display method according to any one of appendices 7 to 9,
The emotion estimation step and the animation face synthesis step are executed by a call center system,
The emotion expression animated face display method, wherein the display step is executed by a terminal used by a supervisor.

（付記１１）
付記７乃至９の何れか１に記載の感情表現アニメーション顔表示方法であって、
前記感情推定ステップと前記アニメーション顔合成ステップは、前記オペレータが利用する端末により実行され、
前記表示ステップは、スーパーバイザが利用する端末により実行されることを特徴とする感情表現アニメーション顔表示方法。 (Appendix 11)
The emotion expression animated face display method according to any one of appendices 7 to 9,
The emotion estimation step and the animation face synthesis step are executed by a terminal used by the operator,
The emotion expression animated face display method, wherein the display step is executed by a terminal used by a supervisor.

（付記１２）
付記７乃至１１の何れか１に記載のアニメーション顔表示方法であって、
スーパーバイザが利用する端末において、スーパーバイザにより選択されたオペレータとその相手のユーザとの通話をモニタして出力するモニタステップを更に含むことを特徴とする感情表現アニメーション顔表示方法。 (Appendix 12)
The animation face display method according to any one of appendices 7 to 11,
An emotion expression animated face display method characterized by further comprising a monitor step of monitoring and outputting a call between an operator selected by the supervisor and a user of the other party at a terminal used by the supervisor.

（付記１３）
コンピュータに、
複数のユーザであって、そのそれぞれが複数のオペレータのそれぞれと少なくとも音声を用いて通話をしている複数のユーザのそれぞれの感情を前記通話で用いられるユーザの音声を基に推定する感情推定手順と、
前記感情推定手順により推定された感情に基づいて、ユーザ毎に、推定された感情が表現されているアニメーション顔を合成するアニメーション顔合成手順と、
前記アニメーション顔合成手順により合成されたユーザ毎のアニメーション顔を視覚的にまとめて確認できるように並べて表示する表示手順と、
を実行させることを特徴とする感情表現アニメーション顔表示プログラム。 (Appendix 13)
On the computer,
Emotion estimation procedure for estimating the emotions of a plurality of users who are communicating with each of a plurality of operators using at least voice based on the voices of the users used in the call When,
Based on the emotion estimated by the emotion estimation procedure, for each user, an animation face synthesis procedure for synthesizing an animated face expressing the estimated emotion;
A display procedure for displaying the animation faces for each user synthesized by the animation face synthesis procedure side by side so that they can be visually confirmed together;
Emotion expression animation face display program characterized by running.

（付記１４）
付記１３に記載の感情表現アニメーション顔表示プログラムであって、
前記アニメーション顔合成手順は、アニメーション合成用データベースに格納されているデータを、前記感情推定手順により推定された感情に基づいて、合成することにより、推定された感情が表現されているアニメーション顔を合成することを特徴とする感情表現アニメーション顔表示プログラム。 (Appendix 14)
An emotion expression animated face display program according to attachment 13,
The animation face synthesis procedure combines the data stored in the animation synthesis database based on the emotion estimated by the emotion estimation procedure, thereby synthesizing the animation face expressing the estimated emotion. Emotion expression animated face display program characterized by doing.

（付記１５）
付記１３又は１４に記載の感情表現アニメーション顔表示プログラムであって、
前記アニメーション顔合成手順は、前記感情推定手順により推定された感情に基づいて、ユーザ毎に、推定された感情が表現されている色を選択し、選択された色を背景色としてアニメーション顔に付加することを特徴とする感情表現アニメーション顔表示プログラム。 (Appendix 15)
An emotion expression animated face display program according to appendix 13 or 14,
The animation face synthesis procedure selects a color expressing the estimated emotion for each user based on the emotion estimated by the emotion estimation procedure, and adds the selected color to the animation face as a background color Emotion expression animated face display program characterized by doing.

（付記１６）
付記１３乃至１５の何れか１に記載の感情表現アニメーション顔表示プログラムであって、
前記感情推定手順と前記アニメーション顔合成手順は、コールセンターシステムにおいて実行させ、
前記表示手順は、スーパーバイザが利用する端末において実行させることを特徴とする感情表現アニメーション顔表示プログラム。 (Appendix 16)
The emotion expression animation face display program according to any one of appendices 13 to 15,
The emotion estimation procedure and the animation face synthesis procedure are executed in a call center system,
An emotion expression animated face display program characterized in that the display procedure is executed in a terminal used by a supervisor.

（付記１７）
付記１３乃至１５の何れか１に記載の感情表現アニメーション顔表示プログラムであって、
前記感情推定手順と前記アニメーション顔合成手順は、前記オペレータが利用する端末において実行させ、
前記表示手順は、スーパーバイザが利用する端末において実行させることを特徴とする感情表現アニメーション顔表示プログラム。 (Appendix 17)
The emotion expression animation face display program according to any one of appendices 13 to 15,
The emotion estimation procedure and the animation face synthesis procedure are executed in a terminal used by the operator,
An emotion expression animated face display program characterized in that the display procedure is executed in a terminal used by a supervisor.

（付記１８）
付記１３乃至１７の何れか１に記載のアニメーション顔表示プログラムであって、
スーパーバイザが利用する端末において、スーパーバイザにより選択されたオペレータとその相手のユーザとの通話をモニタして出力するモニタ手順を更に実行させることを特徴とする感情表現アニメーション顔表示システム。 (Appendix 18)
The animation face display program according to any one of appendices 13 to 17,
An emotion expression animated face display system, further comprising a monitor procedure for monitoring and outputting a call between an operator selected by a supervisor and a user of the other party at a terminal used by the supervisor.

本発明はコールセンターに利用できるほか、テレビ電話、ビデオ会議システム、及びそれらに用いられる各種アプリケーションに利用できる。 The present invention can be used for a call center, a video phone, a video conference system, and various applications used for them.

１０１コールセンターシステム
１０３ゲートウェイ
１０５クライアント端末
１０７オペレータ端末
１０９スーパーバイザ端末
１２１ＳＩＰ制御部
１２３ＲＴＰ制御部
１２５インターフェース部
２２１音声データ送受信部
２２３デコーダ
２２５エンコーダ
２２７音声データ受入／送出部
２２９アニメーション顔生成部
２３１感情推定部
２３３アニメーション顔合成用データベース
２３５アニメーション顔合成部
２３７エンコーダ
２３９映像データ送出部
２３９Ａ多重化／送出部
２４５選択部
３０１画面合成部
３０３表示部
３０５選択入力受付部
３０７選択信号出力部
３０９音声処理部
３１１音声出力部
３１３多重分離部 101 Call Center System 103 Gateway 105 Client Terminal 107 Operator Terminal 109 Supervisor Terminal 121 SIP Control Unit 123 RTP Control Unit 125 Interface Unit 221 Audio Data Transmit / Receive Unit 223 Decoder 225 Encoder 227 Audio Data Reception / Transmission Unit 229 Animation Face Generation Unit 231 Emotion Estimation Unit 233 Animation face synthesis database 235 Animation face synthesis unit 237 Encoder 239 Video data transmission unit 239A Multiplex / transmission unit 245 Selection unit 301 Screen synthesis unit 303 Display unit 305 Selection input reception unit 307 Selection signal output unit 309 Audio processing unit 311 Audio Output unit 313 Demultiplexing unit

Claims

Emotion estimation means for estimating the emotions of a plurality of users, each of which communicates with each of a plurality of operators using at least voice based on the voices of the users used in the call When,
Based on the emotion estimated by the emotion estimation unit, for each user, an animation face synthesis unit that synthesizes an animation face expressing the estimated emotion;
Display means for displaying the animation faces for each user synthesized by the animation face synthesizing means side by side so that they can be visually confirmed together;
An emotion expression animated face display system characterized by comprising:

The emotion expression animated face display system according to claim 1,
The animation face synthesizing unit synthesizes the data stored in the animation synthesis database based on the emotion estimated by the emotion estimation unit, thereby synthesizing the animation face expressing the estimated emotion. Emotion expression animated face display system characterized by doing.

The emotion expression animated face display system according to claim 1 or 2,
The animation face synthesizing unit selects a color expressing the estimated emotion for each user based on the emotion estimated by the emotion estimation unit, and adds the selected color to the animation face as a background color Emotion expression animated face display system characterized by doing.

The emotion expression animated face display system according to any one of claims 1 to 3,
The emotion estimation means and the animation face synthesis means are included in a call center system,
The emotion expression animated face display system, wherein the display means is included in a terminal used by a supervisor.

The emotion expression animated face display system according to any one of claims 1 to 3,
The emotion estimation means and the animation face synthesis means are included in a terminal used by the operator,
The emotion expression animated face display system, wherein the display means is included in a terminal used by a supervisor.

The animation face display system according to any one of claims 1 to 5,
The emotion expression animation face display system, wherein the terminal used by the supervisor further comprises monitor means for monitoring and outputting a call between the operator selected by the supervisor and the other user.

Emotion estimation step of estimating the emotions of a plurality of users, each of which is talking with each of a plurality of operators using at least voice based on the voices of the users used in the call When,
Based on the emotion estimated by the emotion estimation step, for each user, an animation face synthesis step for synthesizing an animation face expressing the estimated emotion;
A display step for displaying the animation faces for each user synthesized in the animation face synthesis step side by side so that they can be visually confirmed together;
An emotion expression animated face display method characterized by comprising:

The emotion expression animated face display method according to claim 7,
The animation face synthesis step combines the data stored in the animation synthesis database based on the emotion estimated by the emotion estimation step, thereby synthesizing the animation face expressing the estimated emotion. An emotion expression animated face display method characterized by:

The emotion expression animated face display method according to claim 7 or 8,
The animation face synthesis step selects a color expressing the estimated emotion for each user based on the emotion estimated by the emotion estimation step, and adds the selected color to the animation face as a background color An emotion expression animated face display method characterized by:

On the computer,
Emotion estimation procedure for estimating the emotions of a plurality of users who are communicating with each of a plurality of operators using at least voice based on the voices of the users used in the call When,
Based on the emotion estimated by the emotion estimation procedure, for each user, an animation face synthesis procedure for synthesizing an animated face expressing the estimated emotion;
A display procedure for displaying the animation faces for each user synthesized by the animation face synthesis procedure side by side so that they can be visually confirmed together;
Emotion expression animation face display program characterized by running.