JP5613102B2

JP5613102B2 - CONFERENCE DEVICE, CONFERENCE METHOD, AND CONFERENCE PROGRAM

Info

Publication number: JP5613102B2
Application number: JP2011110379A
Authority: JP
Inventors: 秀和玉木; 東野　豪; 豪東野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-05-17
Filing date: 2011-05-17
Publication date: 2014-10-22
Anticipated expiration: 2031-05-17
Also published as: JP2012244285A

Description

本発明は、ネットワークを介した遠隔会議を行う会議装置、会議方法および会議プログラムに関する。 The present invention relates to a conference apparatus, a conference method, and a conference program for performing a remote conference via a network.

環境問題（エコロジー）や、企業活動における出張費を抑えるなどの目的から、遠隔会議の需要が高まっている。従来の会議システムには、例えば、電話（音声）会議システムや、ブラウザ上で動作するＷｅｂ（映像）会議システムなどがある。Ｗｅｂ会議システムでは、ブラウザを起動し、Ｗｅｂカメラを設置し、ヘッドセットを装着すれば、互いの参加者の映像をみながら会議を行うことができる。例えば、特許文献１には、発言権の取得や移譲が可能な多地点通信システムが記載されている。 The demand for remote conferences is increasing for the purpose of reducing environmental problems (ecology) and business trip expenses in corporate activities. Conventional conference systems include, for example, a telephone (voice) conference system and a Web (video) conference system that operates on a browser. In the Web conference system, if a browser is started, a Web camera is installed, and a headset is attached, a conference can be held while watching each other's participants. For example, Patent Document 1 describes a multipoint communication system that can acquire or transfer a right to speak.

特開2004-248145号公報JP 2004-248145 A

このような会議システムでは、映像や音声の質が低いという欠点がある。例えば、映像の解像度は低く、映像表示サイズは小型ディスプレイサイズにより制限され、映像と音声に伝送には遅延が生じる。このため、遠隔に存在する他の会議参加者の表情や仕草を読み取ることが困難になる。 Such a conference system has a drawback that the quality of video and audio is low. For example, the video resolution is low, the video display size is limited by the small display size, and there is a delay in transmission between video and audio. For this reason, it becomes difficult to read the expressions and gestures of other conference participants that exist remotely.

これに起因して発生する問題の１つとして、誰が次に発言しそうなのかがわからず、同時に複数の会議参加者が発話を開始する発話の衝突が発生してしまう。発話の衝突が頻発する会議では、会議参加者に精神的なストレスが蓄積するとともに、会議の進行を停滞させてしまう。 As one of the problems that occur due to this, it is not known who is likely to speak next, and at the same time, a collision of utterances in which a plurality of conference participants start speaking. In a meeting where utterance collisions occur frequently, mental stress accumulates in the meeting participants and the progress of the meeting is delayed.

特許文献１の技術では、発言権を取得するための意識的な行動が必要であり、会議の円滑な進行が妨げられる可能性がある。対面コミュニケーションでは、発話したい旨を明確に意思表示して発言する場合と、自然な仕草（非言語情報のやりとり）から周囲に発話欲求を伝達して発言する場合とがあるが、特許文献１では後者の場合については考慮されていない。 In the technique of Patent Document 1, a conscious action for acquiring the right to speak is necessary, and the smooth progress of the conference may be hindered. In face-to-face communication, there are a case where a person expresses his intention to speak and a person speaks, and a case where he speaks by transmitting a desire to speak from a natural gesture (exchange of non-linguistic information). The latter case is not considered.

本発明は、上記事情に鑑みてなされたものであり、本発明の目的は、参加者の表情や仕草が読み取れない場合であっても、各参加者の発話欲求を容易に認識させる会議装置、会議方法および会議プログラムを提供することにある。 The present invention has been made in view of the above circumstances, and the purpose of the present invention is to provide a conference device that allows the participant's speech desires to be easily recognized even when the facial expressions and gestures of the participants cannot be read. To provide a meeting method and a meeting program.

上記目的を達成するため、本発明は、ネットワークを介して遠隔会議を行う会議装置であって、当該会議装置を使用する自参加者の映像データを取得する映像取得手段と、前記自参加者の映像データから発話欲求があると想定される所定の動作を検出する動作検出手段と、前記動作検出手段が検出した動作に基づいて発話欲求レベルを算出する発話欲求レベル算出手段と、ネットワークを介して他の会議装置から他参加者の映像データおよび発話欲求レベルを受信する受信手段と、自参加者および他参加者の映像データを配置した会議映像を生成する映像生成手段と、自参加者および他参加者の発話欲求レベルを示す情報を、前記会議映像に設定する映像編集手段と、前記映像編集手段が編集した会議映像を表示する表示手段と、を有する。 In order to achieve the above object, the present invention provides a conference device that performs a remote conference via a network, the video acquisition means for acquiring video data of the self-participant who uses the conference device, and the self-participant Via a network, an action detecting means for detecting a predetermined action assumed to have an utterance desire from video data; an utterance desire level calculating means for calculating an utterance desire level based on the action detected by the action detecting means; Receiving means for receiving video data and utterance desire levels of other participants from other conference devices, video generating means for generating conference videos in which video data of the self-participants and other participants are arranged, self-participants and others Video editing means for setting information indicating the utterance desire level of the participant to the conference video, and display means for displaying the conference video edited by the video editing means.

本発明は、ネットワークを介して遠隔会議を行う会議方法であって、会議装置は、当該会議装置を使用する自参加者の映像データを取得する映像取得ステップと、前記自参加者の映像データから発話欲求があると想定される所定の動作を検出する動作検出ステップと、前記動作検出ステップで検出した動作に基づいて発話欲求レベルを算出する発話欲求レベル算出ステップと、ネットワークを介して他の会議装置から他参加者の映像データおよび発話欲求レベルを受信する受信ステップと、自参加者および他参加者の映像データを配置した会議映像を生成する映像生成ステップと、自参加者および他参加者の発話欲求レベルを示す情報を、前記会議映像に設定する映像編集ステップと、前記映像編集ステップで編集した会議映像を表示する表示ステップと、を行う。 The present invention is a conference method for performing a remote conference via a network, wherein the conference device acquires a video acquisition step of acquiring video data of a self-participant who uses the conference device, and the video data of the self-participant An action detecting step for detecting a predetermined action assumed to have an utterance desire, an utterance desire level calculating step for calculating an utterance desire level based on the action detected in the action detecting step, and another meeting via the network A receiving step of receiving video data and speech desire levels of other participants from the device, a video generating step of generating a conference video in which video data of the self-participants and other participants are arranged, and the self-participants and other participants' A video editing step for setting information indicating an utterance desire level in the conference video, and a table for displaying the conference video edited in the video editing step. And the step, is carried out.

本発明は、前記会議方法をコンピュータに実行させるための会議プログラムである。 The present invention is a conference program for causing a computer to execute the conference method.

本発明によれば、参加者の表情や仕草が読み取れない場合であっても、各参加者の発話欲求を容易に認識させる会議装置、会議方法および会議プログラムを提供することができる。 ADVANTAGE OF THE INVENTION According to this invention, even if it is a case where a participant's facial expression and gesture cannot be read, the conference apparatus, the conference method, and the conference program which can recognize each participant's speech desire easily can be provided.

本発明の実施形態に係る会議システムの全体構成図である。1 is an overall configuration diagram of a conference system according to an embodiment of the present invention. 本実施形態の動作を示すフローチャートである。It is a flowchart which shows operation | movement of this embodiment. 会議映像の一例を示す図である。It is a figure which shows an example of a meeting image | video.

以下、本発明の実施の形態について、図面を参照して説明する。 Embodiments of the present invention will be described below with reference to the drawings.

図１は、本発明の実施形態に係る会議システムの全体構成図である。本実施形態の会議システムは、ネットワーク９に接続された複数の会議装置１を用いて、遠隔地にいる複数の会議参加者（以下、「参加者」という）が遠隔会議を行うためのシステムである。 FIG. 1 is an overall configuration diagram of a conference system according to an embodiment of the present invention. The conference system of the present embodiment is a system for a plurality of conference participants (hereinafter referred to as “participants”) in a remote place to perform a remote conference using a plurality of conference devices 1 connected to a network 9. is there.

図示する会議システムは、複数の参加者の各々が使用する複数の会議装置１と、これらの会議装置１とネットワーク９を介して接続された会議サーバ８とを備える。 The conference system shown in the figure includes a plurality of conference devices 1 used by each of a plurality of participants, and a conference server 8 connected to these conference devices 1 via a network 9.

会議装置１は、例えばＰＣなどであって、当該会議装置１を使用する自参加者の映像を撮影するビデオカメラ１１と、自参加者が発言した音声を集音するマイク１２と、会議映像を表示する表示装置（ディスプレイ）１３と、会議サーバ８から送信された他の参加者の音声を出力するスピーカ１４とを備える。 The conference apparatus 1 is, for example, a PC or the like, and includes a video camera 11 that captures images of the self-participant who uses the conference apparatus 1, a microphone 12 that collects sound spoken by the self-participant, and a conference video. A display device (display) 13 for displaying and a speaker 14 for outputting the voices of other participants transmitted from the conference server 8 are provided.

また、会議装置１は、映像取得部２１と、動作検出部２２と、発話欲求レベル算出部２３と、メモリ２４と、音声取得部２５と、相槌検出部２６と、映像生成部２７と、映像編集部２８と、通信部２９とを備える。 In addition, the conference apparatus 1 includes a video acquisition unit 21, a motion detection unit 22, an utterance desire level calculation unit 23, a memory 24, a voice acquisition unit 25, a conflict detection unit 26, a video generation unit 27, and a video. An editing unit 28 and a communication unit 29 are provided.

映像取得部２１は、ビデオカメラ１１が撮像した映像データを取り込む。動作検出部２２は、自参加者の映像データから発話欲求があると想定される所定の動作を検出する。 The video acquisition unit 21 captures video data captured by the video camera 11. The motion detection unit 22 detects a predetermined motion that is assumed to be uttered from the video data of the self-participant.

音声取得部２５は、マイク１２から集音された音声データを取り込む。相槌検出部２６は、自参加者の音声データから相槌を検出する。 The sound acquisition unit 25 takes in sound data collected from the microphone 12. The conflict detection unit 26 detects the conflict from the voice data of the participant.

発話欲求レベル算出手段２３は、動作検出部２２が検出した動作および相槌検出部２６が検出した相槌に基づいて発話欲求レベルを算出し、メモリ２４に記憶する。 The utterance desire level calculation means 23 calculates the utterance desire level based on the motion detected by the motion detector 22 and the conflict detected by the conflict detector 26 and stores it in the memory 24.

通信部２９は、映像取得部２１により取り込まれた映像データを、ネットワーク９を介して会議サーバ８に送信するとともに、映像生成部２７に送出する。また、通信部２９は、音声取得部２５により取り込まれた音声データをネットワーク９を介して会議サーバ８に送信する。また、通信部２９は、他の参加者の音声データを、ネットワーク９を介して会議サーバ８から受信し、スピーカ１４に出力・再生するとともに、他の参加者の映像データをネットワーク９を介して会議サーバ８から受信し、映像生成部２７に送出する。また、通信部２９は、メモリ２４に記憶された情報を、ネットワーク９および会議サーバ８を介して、他の会議装置１と送受信する。 The communication unit 29 transmits the video data captured by the video acquisition unit 21 to the conference server 8 via the network 9 and sends it to the video generation unit 27. In addition, the communication unit 29 transmits the audio data captured by the audio acquisition unit 25 to the conference server 8 via the network 9. In addition, the communication unit 29 receives the audio data of other participants from the conference server 8 via the network 9, and outputs / reproduces the audio data of the other participants via the network 9. Received from the conference server 8 and sent to the video generation unit 27. The communication unit 29 transmits / receives information stored in the memory 24 to / from the other conference apparatus 1 via the network 9 and the conference server 8.

映像生成部２７は、映像取得部２１が取り込んだ自参加者の映像データと、通信部２９が受信した他参加者の映像データとを配置した会議映像を生成する。映像編集部２８は、自参加者および他参加者の発話欲求レベルを示す情報を、映像生成部２７が生成した会議映像に設定し、表示装置１３に表示する。 The video generation unit 27 generates a conference video in which the video data of the self-participant captured by the video acquisition unit 21 and the video data of other participants received by the communication unit 29 are arranged. The video editing unit 28 sets information indicating the utterance desire levels of the self-participant and other participants in the conference video generated by the video generation unit 27 and displays the information on the display device 13.

会議サーバ８は、各参加者が使用する各会議装置１から入力される、当該参加者の映像データ、音声データおよびメモリ２４の情報をそれぞれ受信し、他の会議装置１に送信する。 The conference server 8 receives the participant's video data, audio data, and information in the memory 24 input from each conference device 1 used by each participant, and transmits them to the other conference devices 1.

会議装置１および会議サーバ８は、例えば、ＣＰＵと、メモリと、ＨＤＤ等の外部記憶装置と、入力装置と、出力装置とを備えた汎用的なコンピュータシステムを用いることができる。このコンピュータシステムにおいて、ＣＰＵがメモリ上にロードされた所定のプログラムを実行することにより、各装置の各機能が実現される。例えば、会議装置１および会議サーバ８およびの各機能は、会議装置１用のプログラムの場合は会議装置１のＣＰＵが、そして、会議サーバ８用のプログラムの場合は会議サーバ８のＣＰＵが、それぞれ実行することにより実現される。 The conference device 1 and the conference server 8 can use, for example, a general-purpose computer system including a CPU, a memory, an external storage device such as an HDD, an input device, and an output device. In this computer system, each function of each device is realized by the CPU executing a predetermined program loaded on the memory. For example, the functions of the conference device 1 and the conference server 8 are as follows. The CPU of the conference device 1 is a program for the conference device 1, and the CPU of the conference server 8 is a program for the conference server 8, respectively. It is realized by executing.

また、会議装置１用のプログラムおよび会議サーバ８用のプログラムは、ハードディスク、フレキシブルディスク、ＣＤ−ＲＯＭ、ＭＯ、ＤＶＤ−ＲＯＭなどのコンピュータ読取り可能な記録媒体に記憶することも、ネットワークを介して配信することもできる。 Further, the program for the conference apparatus 1 and the program for the conference server 8 can be stored in a computer-readable recording medium such as a hard disk, a flexible disk, a CD-ROM, an MO, a DVD-ROM, or distributed via a network. You can also

次に、本実施形態の動作について説明する。 Next, the operation of this embodiment will be described.

各参加者は、それぞれ、会議装置１に前に座り、ネットワークを介した遠隔会議を行う。 Each participant sits in front of the conference apparatus 1 and conducts a remote conference via the network.

映像取得部２１は、ビデオカメラ１１で撮像された自参加者の映像データを取り込み、動作検出部２２に送出する。 The video acquisition unit 21 captures the video data of the self-participant captured by the video camera 11 and sends it to the operation detection unit 22.

動作検出部２２は、映像データから自参加者の所定の動作を検出する。ここで検出される所定の動作は、例えば、「手を口元へ動かす」、「挙手」、「頷き」、「体を横に動かす」など、発話欲求があると想定される動作である。 The motion detection unit 22 detects a predetermined motion of the participant from the video data. The predetermined motion detected here is, for example, a motion that is assumed to have an utterance desire such as “move the hand to the mouth”, “raise hand”, “whit”, “move the body sideways”, and the like.

動作検出部２２は、映像データを画像処理することで、このような動作を検出する。例えば、「頷き」動作は、顔向き検出を行い、顔の向きが上下に動いた場合に、「頷き」と判別する。「手を口元へ動かす」動作は、例えば、顔領域と隣接していない肌色の領域を手の領域であると識別し、この手の領域が顔領域と重なった場合、「手を口元へ動かす」と判別する。 The operation detection unit 22 detects such an operation by performing image processing on the video data. For example, in the “whispering” operation, the face direction is detected, and when the face direction moves up and down, it is determined as “whispering”. The “move the hand to the mouth” operation, for example, identifies a skin-colored area that is not adjacent to the face area as a hand area, and when this hand area overlaps the face area, Is determined.

「挙手」動作は、例えば、顔領域と隣接していない肌色の領域を手の領域であると識別し、この手の領域の重心が画像（映像データ）のｙ軸方向に、所定の高さ（閾値）を超えた場合、「挙手」と判別する。「体を横に動かす」動作は、例えば顔検出を行い、検出した顔領域の中心点が、単位時間当たりの所定の閾値を越えて、画像のｘ軸方向へ移動した場合、「体を横に動かす」と判別する。 In the “raising hand” operation, for example, a skin-colored region that is not adjacent to the face region is identified as a hand region, and the center of gravity of the hand region has a predetermined height in the y-axis direction of the image (video data). If (threshold) is exceeded, it is determined that the hand is raised. The “move the body sideways” operation is performed, for example, when face detection is performed, and the center point of the detected face area exceeds the predetermined threshold per unit time and moves in the x-axis direction of the image. "Move to".

音声取得部２５は、マイク１２から入力された自参加者の音声を集音し、相槌検出部２６に送出する。相槌検出部２６は、音声データが入力され、かつ、自参加者が話者でない場合は、入力された音声データは全て「相槌」であると判別する。例えば、各会議装置は、自参加者が話者であるかどうかを判別するための話者フラグがメモリ２４に記憶される。話者フラグの初期値は「０」（非話者）であり、これが「１」（話者）である場合、その自参加者は話者であると判別される。 The voice acquisition unit 25 collects the voice of the self-participant input from the microphone 12 and sends it to the conflict detection unit 26. When the voice data is input and the self-participant is not a speaker, the conflict detection unit 26 determines that all the input voice data is “contrast”. For example, each conference device stores a speaker flag in the memory 24 for determining whether or not the self-participant is a speaker. The initial value of the speaker flag is “0” (non-speaker). When this is “1” (speaker), it is determined that the self-participant is a speaker.

会議サーバ８では、全会議装置（全参加者）の話者フラグを管理するとともに、誰が現在の話者であるかを決定する。例えば、全ての参加者の話者フラグが「０」であるときに、ある会議装置から音声データが入力された場合、この入力を行った会議装置の参加者を話者に指定し、当該参加者の話者フラグを「１」にする。すなわち、会議サーバ８は、当該会議サーバ８内の話者フラグを更新するとともに、話者に指定した会議装置に話者指定通知を送信し、当該会議装置のメモリ２４の話者フラグを「０」から「１」に更新させる。なお、当該参加者の音声データ入力がなくなった時点で、話者の指定を解除し、話者フラグを「１」から「０」に更新する。 The conference server 8 manages the speaker flags of all conference devices (all participants) and determines who is the current speaker. For example, when voice data is input from a certain conference device when the speaker flags of all the participants are “0”, the participant of the conference device that has performed the input is designated as a speaker, and the participation is performed. The speaker flag of the speaker is set to “1”. That is, the conference server 8 updates the speaker flag in the conference server 8 and transmits a speaker designation notification to the conference device designated as the speaker, and sets the speaker flag in the memory 24 of the conference device to “0”. To "1". Note that, when the participant no longer inputs voice data, the speaker designation is canceled and the speaker flag is updated from “1” to “0”.

また、相槌検出部２６は、メモリ２４の自参加者の話者フラグが「０」（非話者）であって、自参加者以外の参加者の話者フラグが「１」であるときに、音声データが入力された場合、これを「相槌」であると判別する。このとき、この参加者の話者フラグは「０」のままである。相槌検出部２６は、音声データが入力された場合、会議サーバ８に話者フラグが「１」の他の参加者が存在するか否かを問い合わせ、話者フラグが「１」の他の参加者が存在する場合、「相槌」であると判別する。 In addition, the conflict detection unit 26, when the speaker flag of the self-participant in the memory 24 is “0” (non-speaker) and the speaker flag of the participant other than the self-participant is “1”. When the voice data is input, it is determined that the voice data is “consideration”. At this time, the speaker flag of the participant remains “0”. When the voice data is input, the conflict detection unit 26 inquires of the conference server 8 whether or not there is another participant whose speaker flag is “1”, and another participant whose speaker flag is “1”. If there is a person, it is determined that it is “contradictory”.

なお、相槌検出部２６は、メモリ２４の自参加者の話者フラグが「１」（話者）のときに音声データが入力された場合は、会議サーバ８への問い合わせを行わず、「相槌」でないと判別する。このとき、メモリ２４の話者フラグは「１」のままである。 When the voice data is input when the speaker flag of the self-participant in the memory 24 is “1” (speaker), the conflict detection unit 26 does not make an inquiry to the conference server 8 and Is not determined. At this time, the speaker flag in the memory 24 remains “1”.

そして、動作検出部２２で検出された動作の情報、および相槌検出部２６が検出した相槌の情報は、発話欲求レベル算出部２３に送出される。 Then, the motion information detected by the motion detector 22 and the conflict information detected by the conflict detector 26 are sent to the utterance desire level calculator 23.

図２は、発話欲求レベル算出部２３の処理を示すフローチャートである。発話欲求レベル算出部２３は、所定のタイミングで繰り返し図２に示す処理を行う。 FIG. 2 is a flowchart showing processing of the utterance desire level calculation unit 23. The utterance desire level calculation unit 23 repeatedly performs the process shown in FIG. 2 at a predetermined timing.

発話欲求レベル算出部２３は、動作検出部２２が、「手を口元へ動かす」、「頷き」、「体を横に動かす」のいずれかの動作を検出した場合（Ｓ１１：ＹＥＳ）、図示しない記憶部のカウンタの値に「１」を加算する（Ｓ１２）。なお、カウンタの初期値は「０」である。 The utterance desire level calculation unit 23 is not shown when the motion detection unit 22 detects any one of “move the hand to the mouth”, “whit”, and “move the body sideways” (S11: YES). “1” is added to the value of the counter in the storage unit (S12). Note that the initial value of the counter is “0”.

また、発話欲求レベル算出部２３は、動作検出部２２が「挙手」の動作を検出した場合（Ｓ１３：ＹＥＳ）、カウンタの値に「５」を加算する（Ｓ１４）。 Further, the utterance desire level calculation unit 23 adds “5” to the value of the counter when the motion detection unit 22 detects the “raising hand” motion (S13: YES) (S14).

また、発話欲求レベル算出部２３は、相槌検出部２６が「相槌」を検出した場合（Ｓ１５：ＹＥＳ）、カウンタの値に「３」を加算する（Ｓ１６）。 Further, the utterance desire level calculation unit 23 adds “3” to the value of the counter when the conflict detection unit 26 detects “conflict” (S15: YES) (S16).

そして、発話欲求レベル算出部２３は、Ｓ１１からＳ１６で加算されたカウンタの値にもとづいて、発話欲求レベルを算出する。例えば、発話欲求レベルが、レベル０〜レベル５までの６つのレベルに分かれている場合、カウンタの値をいずれかのレベルに変換し、メモリ２４に記憶する（Ｓ１７）。 And the utterance desire level calculation part 23 calculates an utterance desire level based on the value of the counter added by S11 to S16. For example, when the utterance desire level is divided into six levels from level 0 to level 5, the value of the counter is converted into any level and stored in the memory 24 (S17).

そして、発話欲求レベル算出部２３は、変換したレベル値が所定の値（例えば「５」）以上の場合（Ｓ１８：ＹＥＳ）、所定のマークの表示を決定し、マーク表示情報をメモリ２４に記憶する（Ｓ１９）。なお、所定のマークには、図形、記号、文字などを用いることができる。例えば、挙手を連想させる挙手マークを、所定のマークとして用いることとしてもよい。
図２で説明したように、メモリ２４には、自参加者の発話欲求レベル、および、発話欲求レベルが「５」以上の場合にはマーク表示情報が記憶される。この自参加者の発話欲求レベルおよびマーク表示情報は、通信部２９によりネットワーク９を介して他の全ての会議装置１に送信される。また、他の全ての会議装置１からネットワーク９を介して送信された他参加者の発話欲求レベルおよびマーク表示情報が通信部２９により受信され、メモリ２４に記憶される。このように、各参加者の発話欲求レベルおよびマーク表示情報は、全ての会議装置１で共有される。 Then, when the converted level value is equal to or higher than a predetermined value (for example, “5”) (S18: YES), the utterance desire level calculation unit 23 determines display of the predetermined mark and stores the mark display information in the memory 24. (S19). In addition, a figure, a symbol, a character, etc. can be used for a predetermined mark. For example, a raised hand mark reminiscent of a raised hand may be used as the predetermined mark.
As described with reference to FIG. 2, the memory 24 stores the utterance desire level of the self-participant and mark display information when the utterance desire level is “5” or more. The self-participant's utterance desire level and mark display information are transmitted to all other conference apparatuses 1 by the communication unit 29 via the network 9. Further, the utterance desire level and mark display information of other participants transmitted from all other conference apparatuses 1 via the network 9 are received by the communication unit 29 and stored in the memory 24. Thus, the speech desire level and mark display information of each participant are shared by all the conference apparatuses 1.

映像生成部２７および映像編集部２８は、ビデオカメラ１１が撮像した自参加者の映像データと、ネットワーク９を介して会議サーバ８から受信した他参加者の映像データと、メモリ２４に記憶された各参加者の発話欲求レベルおよびマーク表示情報とを用いて会議映像を生成・編集する。 The video generation unit 27 and the video editing unit 28 are stored in the memory 24, the video data of the self-participant captured by the video camera 11, the video data of other participants received from the conference server 8 via the network 9, and the memory 24. A conference video is generated and edited using the utterance desire level and mark display information of each participant.

図３は、会議映像の一例を示す図である。映像生成部２７は、図示するように、各参加者の映像データをタイル状に所定の位置に配置する。図示する例では、４人の参加者が会議に参加している。 FIG. 3 is a diagram illustrating an example of a conference video. As shown in the figure, the video generation unit 27 arranges the video data of each participant at a predetermined position in a tile shape. In the example shown in the figure, four participants are participating in the conference.

そして、映像編集部２８は、各参加者の発話欲求レベルおよびマーク表示情報が記憶されたメモリ２４を参照し、映像生成部２７が生成した会議映像を編集する。具体的には、図示するように、発話欲求レベルを示す情報（図示する例では、インジケータ）を、各参加者の発話欲求レベルに応じて、対応する参加者の映像データの近傍に設定する。例えば、発話欲求レベルが「２」の参加者の場合に、下から２つインジケータを点灯するなど、発話欲求レベルに応じてインジケータを点灯する。 Then, the video editing unit 28 edits the conference video generated by the video generation unit 27 with reference to the memory 24 in which each participant's speech desire level and mark display information are stored. Specifically, as shown in the figure, information indicating the utterance desire level (in the example shown, an indicator) is set in the vicinity of the video data of the corresponding participant according to the utterance desire level of each participant. For example, in the case of a participant whose utterance desire level is “2”, the indicator is lit according to the utterance desire level, for example, two indicators are lit from the bottom.

また、マーク表示情報が記憶されている参加者については、当該参加者の映像データの近傍に所定のマーク（図示する例では、挙手マーク）を設定する。表示装置１３は、映像編集部２８が編集した会議映像を表示する。 For a participant in which mark display information is stored, a predetermined mark (in the illustrated example, a hand raising mark) is set in the vicinity of the participant's video data. The display device 13 displays the conference video edited by the video editing unit 28.

また、話者が存在する場合（話者フラグが「１」の参加者が存在する場合）、映像編集部２８は、例えば話者の映像データを所定の色の枠で囲むなど、当該話者を目立たせるように表示させることとしてもよい。 Further, when there is a speaker (when there is a participant whose speaker flag is “1”), the video editing unit 28 surrounds the video data of the speaker with a frame of a predetermined color, for example. It is good also as displaying so that it may stand out.

以上説明した本実施形態では、ネットワークを介した遠隔会議において、各参加者の発話欲求レベルを示す情報を表示することで、各参加者の発話欲求を容易に（自然に）認識することができる。そのため、参加者は、発話意思を他の参加者に示すためにマウスやキーボードなどの入力デバイスを操作することなく、各参加者の発話欲求を汲み取りながら、円滑な会議を進行することができる。例えば、発話の衝突を防止し、スムーズな話者交代を実現することができる。 In the present embodiment described above, it is possible to easily (naturally) recognize each participant's utterance desire by displaying information indicating the utterance desire level of each participant in a remote conference via a network. . Therefore, the participant can proceed with a smooth conference while drawing out each participant's utterance desires without operating an input device such as a mouse or a keyboard in order to show the utterance intention to other participants. For example, it is possible to prevent utterance collisions and realize a smooth speaker change.

また、本実施形態では、動作検出部２２が検出する動作の１つに「挙手」があるため、参加者が発話意思を明示的に示したい場合は、対面での会議と同様に挙手（手を挙げる）動作を行うことで、発話欲求レベルを上げることができる。 In this embodiment, one of the actions detected by the action detection unit 22 is “raising hands”. Therefore, when the participant wants to explicitly indicate his / her intention to speak, To raise the level of utterance desire.

また、本実施形態では、発話欲求レベルが所定の値以上の参加者の映像データの近傍に、所定のマーク（例えば、挙手マーク）を表示すること、発話欲求の高い参加者を一目で容易に認識させることができる。 In the present embodiment, a predetermined mark (for example, a raised hand mark) is displayed in the vicinity of the video data of a participant whose utterance desire level is a predetermined value or more, and a participant with a high utterance desire can be easily recognized at a glance. Can be recognized.

なお、本発明は上記実施形態に限定されるものではなく、その要旨の範囲内で数々の変形が可能である。例えば、本実施形態では、動作検出部２２が検出した動作および相槌検出部２６が検出した相槌に基づいて発話欲求レベルを算出することした。しかしながら、動作検出部２２が検出した動作のみに基づいて発話欲求レベルを算出することとしてもよい。また、相槌検出部２６が検出した相槌のみに基づいて発話欲求レベルを算出することとしてもよい。 In addition, this invention is not limited to the said embodiment, Many deformation | transformation are possible within the range of the summary. For example, in the present embodiment, the utterance desire level is calculated based on the motion detected by the motion detector 22 and the conflict detected by the conflict detector 26. However, the utterance desire level may be calculated based only on the motion detected by the motion detection unit 22. Further, the utterance desire level may be calculated based only on the conflict detected by the conflict detection unit 26.

１：会議装置
１１：ビデオカメラ
１２：マイク
１３：表示装置
１４：スピーカ
２１：映像取得部
２２：動作検出部
２３：発話要求レベル算出部
２４：メモリ
２５：音声取得部
２６：相槌検出部
２７：映像生成部
２８：映像編集部
２９：通信部
８：会議サーバ
９：ネットワーク 1: Conference device 11: Video camera 12: Microphone 13: Display device 14: Speaker 21: Video acquisition unit 22: Motion detection unit 23: Speech request level calculation unit 24: Memory 25: Audio acquisition unit 26: Affinity detection unit 27: Video generation unit 28: Video editing unit 29: Communication unit 8: Conference server 9: Network

Claims

A conference device for performing a remote conference via a network,
Video acquisition means for acquiring video data of self-participants who use the conference device;
Action detecting means for detecting a predetermined action assumed to have utterance desire from the video data of the participant;
Utterance desire level calculating means for calculating an utterance desire level based on the action detected by the action detecting means;
Receiving means for receiving video data and utterance desire levels of other participants from other conference devices via a network;
Video generation means for generating a conference video in which video data of self-participants and other participants are arranged;
Video editing means for setting information indicating the speech desire level of the self-participant and other participants in the conference video;
And a display unit for displaying the conference video edited by the video editing unit.

The conference device according to claim 1,
The video editing means sets a predetermined mark in the vicinity of video data of a participant whose utterance desire level is a predetermined value or more.

The conference device according to claim 1 or 2,
Voice acquisition means for acquiring voice data of the self-participant who uses the conference device;
A conflict detection means for detecting conflict from the audio data of the participant;
The conference apparatus characterized in that the utterance desire level calculation means calculates an utterance desire level based on the action detected by the action detection means and the interaction detected by the interaction detection means.

A conference method for conducting a remote conference via a network,
The conference equipment
A video acquisition step of acquiring video data of the self-participant who uses the conference device;
An operation detecting step for detecting a predetermined operation assumed to have an utterance desire from the video data of the participant;
An utterance desire level calculating step for calculating an utterance desire level based on the motion detected in the motion detection step;
A receiving step of receiving video data and utterance desire levels of other participants from other conference devices via the network;
A video generation step for generating a conference video in which video data of the self-participant and other participants are arranged;
A video editing step for setting information indicating the utterance desire level of the self-participant and other participants in the conference video;
And a display step of displaying the conference video edited in the video editing step.

The conference method according to claim 4,
In the video editing step, a predetermined mark is set in the vicinity of video data of a participant whose utterance desire level is equal to or higher than a predetermined value.

The conference method according to claim 4 or 5, wherein:
An audio acquisition step of acquiring audio data of the self-participant using the conference device;
A conflict detection step of detecting a conflict from the audio data of the participant; and
The utterance desire level calculating step calculates an utterance desire level based on the motion detected in the motion detection step and the conflict detected in the conflict detection step.

A conference program for causing a computer to execute the conference method according to claim 4.