JP5534517B2

JP5534517B2 - Utterance learning support device and program thereof

Info

Publication number: JP5534517B2
Application number: JP2010190481A
Authority: JP
Inventors: 光穗山田; 拓尾上; 篤之笠原; 翼齋藤
Original assignee: Tokai University Educational Systems
Current assignee: Tokai University Educational Systems
Priority date: 2010-08-27
Filing date: 2010-08-27
Publication date: 2014-07-02
Anticipated expiration: 2030-08-27
Also published as: JP2012047998A

Description

本発明は、主に語学の発話学習等に利用されるものであり、発話者の発話時の口唇動作を、模範的な口唇動作と比較し、発話者の口唇動作の改善点を指示する発話学習支援装置およびそのプログラムに関する。 The present invention is mainly used for language utterance learning, etc., and compares the lip movement of a speaker when speaking with an exemplary lip movement, and indicates an improvement point of the speaker's lip movement The present invention relates to a learning support apparatus and its program.

従来、日本では、語学の発話学習において、正しい発音を習得するために、発話者が教科書に書かれた母音ごとの口唇や舌の動きを模擬し、同時に指導者の発音を耳で聞きながら自習する、もしくは、指導者の前で発音を繰り返すことで指導者の発音に近づけるという学習方法が広く行われてきた。
また、聾話者が発音を学習する場合、自分の声が聞こえないため、指導者の前で口唇や舌の動きを模擬して発音を繰り返し、指導者がその発音を聞いて改善点をアドバイスすることで、指導者の発音に近づけるという学習方法が広く行われてきた。 Traditionally, in Japan, in order to acquire correct pronunciation in language utterance learning, the speaker can simulate the movements of the lips and tongue for each vowel written in the textbook and at the same time listen to the instructor's pronunciation with his ears. There has been a widespread learning method in which the pronunciation of the teacher is approximated by repeating the pronunciation in front of the leader.
In addition, when a narrator learns pronunciation, he / she cannot hear his / her voice, so he repeats pronunciation by simulating the movement of the lips and tongue in front of the instructor, and the instructor listens to the pronunciation and advises on improvements. By doing so, the learning method of getting closer to the pronunciation of the instructor has been widely practiced.

しかし、前記した学習方法によると、発話者は、自分の発音が正しいかどうかを確認したい場合、テープに発音時の声を録音して後で聞くことで、発話者自身が判断することになるが、発話者が自身の習熟度を正確に判断することが困難であった。また、聾話者の場合、自身の声を聞くことができないため、口唇の動きをどのように変えれば発音が良くなるのかを客観的に判断することができなかった。 However, according to the learning method described above, when a speaker wants to check whether his / her pronunciation is correct, the speaker himself / herself makes a judgment by recording the voice at the time of pronunciation on a tape and listening to it later. However, it was difficult for the speaker to accurately determine his / her proficiency level. In addition, since the narrator cannot hear his own voice, it was impossible to objectively determine how to change the lip movement to improve pronunciation.

これに鑑み、正しい発音を習得するための学習支援システムとして、発話者に単語を発音させた際の音声データと、指導者が同じ単語を発音した際の音声データとを比較することで、発話者の発音を採点あるいは評価したり、改善点を指示したりするものが開発されている。 In view of this, as a learning support system for learning correct pronunciation, the speech data when the speaker pronounces the word and the speech data when the instructor pronounces the same word are compared. Something has been developed for scoring or evaluating a person's pronunciation and instructing improvement points.

例えば、特許文献１には、発話者に語学学習の授業を提供する学習サーバ装置であって、ＰＣ（Personal Computer）から送信される発話者の語学学習の音声を解析して発話者の発音を採点する技術が開示されている。
また例えば、特許文献２には、ユーザの音声を取得して、予め記憶された模範音声と、取得したユーザの音声を比較し、その比較結果に基づいて、模範音声とユーザの音声との相違点を抽出し、抽出された相違点が存在する部分について強調すべき態様を指示する強調指示データを生成し、生成された強調指示データに基づく態様に合わせて模範音声を出力する語学学習装置が開示されている。
さらに例えば、特許文献３には、音韻毎に連結したデータである指導者データを１以上格納しておき、発話者の音声の入力を受け付けると、この音声をフレームに区分し、フレーム毎の音声データを１以上取得し、指導者データと１以上のフレーム毎の音声データに基づいて、発話者の音声の評定を行い、評定結果を出力する発音評定装置が開示されている。 For example, Patent Literature 1 discloses a learning server device that provides a language learning lesson to a speaker, and analyzes the speech of the speaker's language learning transmitted from a PC (Personal Computer) to generate the pronunciation of the speaker. Techniques for scoring are disclosed.
Further, for example, in Patent Document 2, a user's voice is acquired, and the model voice stored in advance is compared with the acquired user's voice, and the difference between the model voice and the user's voice is based on the comparison result. A language learning device that extracts points, generates emphasis instruction data that indicates an aspect to be emphasized for a portion where the extracted difference exists, and outputs an exemplary voice according to the aspect based on the generated emphasis instruction data It is disclosed.
Further, for example, Patent Document 3 stores one or more instructor data, which is data concatenated for each phoneme, and accepts an input of a speaker's voice, and divides the voice into frames. A pronunciation rating device is disclosed that acquires one or more data, evaluates the voice of a speaker based on instructor data and one or more voice data for each frame, and outputs the rating result.

特開２００７−２１２５５８号公報JP 2007-212558 A 特開２００７−１３９８６８号公報JP 2007-139868 A 特開２００６−２２７５８７号公報JP 2006-227587 A

一方で、英米の英語圏の国では、語学学習において、正しい発音を習得するためには、正しい口唇動作を習得することが重要であると考えられてきた。
このため、英米の英語圏の国では、指導者が単語を発音した際の口唇動作を発話者に見せて、指導者の口唇動作を発話者に真似させ、そのときの口唇動作を、指導者の口唇動作と比較して評価したり、改善点をアドバイスしたりすることで、発話者の口唇動作を指導者の口唇動作に近づけさせることが、発話者に正しい発音を習得させるための有効な指導法として確立している。 On the other hand, in English and American English-speaking countries, it has been considered important to acquire correct lip movements in order to acquire correct pronunciation in language learning.
For this reason, in English-speaking English-speaking countries, the lip movement when the instructor pronounces the word is shown to the speaker, the lip movement of the instructor is imitated to the speaker, and the lip movement at that time is instructed. It is effective to make the speaker's lip movement closer to the instructor's lip movement by evaluating it compared to the lip movement of the speaker and advising improvement points, so that the speaker can learn correct pronunciation. Is established as an effective teaching method.

しかしながら、特許文献１〜３に記載の従来の発話学習支援システムでは、発話者が発話した際の音声を解析して発話者の発音を評価するものであるため、発話者の口唇動作の改善点を指示することができなかった。このため、発話者が、自身の口唇動作の改善点を客観的に認識できるようにする技術の確立が望まれていた。 However, in the conventional utterance learning support systems described in Patent Documents 1 to 3, since the speech when the speaker utters is analyzed to evaluate the pronunciation of the speaker, the improvement of the lip movement of the speaker Could not be directed. For this reason, it has been desired to establish a technology that enables a speaker to objectively recognize the improvement of his / her lip movement.

本発明は、前記した従来技術の問題を解決するために成されたもので、発話者の口唇動作を指導者の口唇動作と比較した結果に基づいて、発話者の口唇動作の改善点を客観的に示すことが可能な発話学習支援装置およびそのプログラムを提供することを目的とする。 The present invention has been made to solve the above-described problems of the prior art, and based on the result of comparing the lip movement of the speaker with the lip movement of the instructor, the improvement of the lip movement of the speaker is objectively evaluated. It is an object to provide an utterance learning support device and a program thereof that can be shown automatically.

前記課題を解決するため、請求項１に記載の発明は、撮影手段で撮影された、発話者がある言語の予め指定された単語を発話している際の口唇部分の画像から、当該発話者の口唇動作を得て、この口唇動作と、指導者が前記単語を発話した際の模範的な口唇動作とを比較した結果に基づいて、発話者の前記口唇動作の改善点を示す発話学習支援装置であって、データ記憶手段と、発話内容指定手段と、画像処理手段と、動作測定手段と、データ変換手段と、差分算出手段と、修正量算出手段と、修正情報出力手段と、を備える構成とした。 In order to solve the above-mentioned problem, the invention according to claim 1 is based on an image of a lip portion when a speaker speaks a predetermined word in a certain language photographed by photographing means. Utterance learning support that shows the improvement of the lip movement of the speaker based on the result of comparing this lip movement with the exemplary lip movement when the instructor utters the word An apparatus, comprising: a data storage means, an utterance content designation means, an image processing means, an action measurement means, a data conversion means, a difference calculation means, a correction amount calculation means, and a correction information output means. The configuration.

かかる構成によれば、発話学習支援装置は、データ記憶手段によって、少なくとも発話者に発話させる前記単語と、指導者が前記単語を発話した際の口唇部分の画像との対応付けを複数記憶する。
また、発話学習支援装置は、発話内容指定手段によって、前記データ記憶手段に記憶された複数の前記単語の中から、発話者に発話させる単語を、外部からの入力により、あるいは、予め設定された順序により一つ指定し、当該単語の発話を発話者に指示する。発話内容指定手段は、例えば、発話者に発話させる単語をデータ記憶手段から読み出し、この単語を表示装置に表示することにより、発話者に当該単語の発話を指示する。 According to such a configuration, the utterance learning support device stores a plurality of correspondences between at least the word that the utterer speaks and the image of the lip portion when the instructor utters the word by the data storage unit.
Further, the utterance learning support device is configured such that an utterance content designation unit is configured to input a word to be spoken by a speaker from an external input or from a plurality of the words stored in the data storage unit. Designate one by order and instruct the speaker to utter the word. The utterance content designation means, for example, reads a word to be uttered by the speaker from the data storage means, and displays the word on the display device, thereby instructing the utterer to utter the word.

また、発話学習支援装置は、画像処理手段によって、前記発話内容指定手段で指定された前記単語を発話者が発話した際の、前記口唇部分の画像から、発話者の口唇動作を特定する基準となる予め設定した特徴点の位置を複数抽出する。 The utterance learning support device uses the image processing means as a reference for specifying the lip movement of the speaker from the image of the lip portion when the speaker utters the word specified by the utterance content specifying means. A plurality of predetermined feature point positions are extracted.

さらに、発話学習支援装置は、動作測定手段によって、前記画像処理手段で抽出された前記特徴点ごとの位置の変化を、口唇動作の履歴である動作履歴として測定する。
そして、発話学習支援装置は、データ変換手段によって、前記動作測定手段で測定された前記特徴点ごとの前記動作履歴を数値解析することで、前記特徴点ごとに、予め設定した複数のスペクトル成分で表される動作スペクトルに変換する。 Furthermore, the utterance learning support device measures the change in position for each of the feature points extracted by the image processing means by the action measuring means as an action history that is a history of lip movement.
Then, the speech learning support device numerically analyzes the motion history for each of the feature points measured by the motion measurement unit by the data conversion unit, so that a plurality of preset spectral components are obtained for each feature point. Convert to the represented operating spectrum.

そして、発話学習支援装置は、差分算出手段によって、前記データ変換手段で求められた前記特徴点ごとの前記動作スペクトルと、前記模範的な口唇部分の画像から予め求めた前記特徴点ごとの模範的な前記動作スペクトルとの差分を算出する。 Then, the utterance learning support device uses the difference calculation means to obtain the model for each feature point obtained in advance from the motion spectrum for each feature point obtained by the data conversion means and the image of the model lip portion. The difference from the operation spectrum is calculated.

さらに、発話学習支援装置は、修正量算出手段によって、前記差分算出手段で算出された前記差分の絶対値と、予め定めた閾値とを前記特徴点ごとに比較し、前記差分の絶対値が前記予め定めた閾値よりも大きい前記特徴点がある場合、予め定めた補正関数によって、当該特徴点の動作を修正する方向と大きさを特定した修正量を算出する。
そして、発話学習支援装置は、修正情報出力手段によって、前記修正量算出手段で算出された前記修正量に応じた修正情報を表示装置に出力する。 Furthermore, the utterance learning support device compares the absolute value of the difference calculated by the difference calculation unit with a predetermined threshold value for each feature point by the correction amount calculation unit, and the absolute value of the difference is If there is a feature point that is larger than a predetermined threshold, a correction amount that specifies the direction and size of correcting the operation of the feature point is calculated by a predetermined correction function.
Then, the speech learning support system, the correction information output means outputs the correction information corresponding to the correction amount calculated in the correction calculation means Viewing device.

これによれば、発話者の口唇動作を指導者の口唇動作と比較して、発話者の口唇動作の修正量を求め、その修正量に応じた修正情報を表示装置に表示することで、発話者に、口唇動作の改善点を客観的に認識させることができる。 According to this, the lip movement of the speaker is compared with the lip movement of the instructor, the correction amount of the lip movement of the speaker is obtained, and the correction information corresponding to the correction amount is displayed on the display device. It is possible to make the person objectively recognize the improvement of the lip movement.

また、請求項２に記載の発話学習支援装置は、請求項１に記載の発話学習支援装置において、前記修正情報出力手段は、発話者の口唇部分の前記画像上における、前記修正量算出手段によって前記修正量が算出された前記特徴点に対応する位置に、当該特徴点の動作を修正する方向と大きさを示す画像を合成して前記表示装置に出力することを特徴とする。 The utterance learning support device according to claim 2 is the utterance learning support device according to claim 1, wherein the correction information output means is the correction amount calculation means on the image of the lip portion of the speaker. An image indicating a direction and a size for correcting the operation of the feature point is synthesized at a position corresponding to the feature point for which the correction amount is calculated, and is output to the display device.

例えば、発話者の口唇部分の画像上における、動作を修正する特徴点に対応する位置に、修正すべき方向と大きさを特定する図形（例えば矢印）のＣＧ（Computer Graphics）を合成して表示することができる。この図形のＣＧは、予め適宜の記憶手段に記憶されていてもよいし、修正量算出手段によって算出された修正量に基づいて、修正情報出力手段が、その都度生成してもよい。
これによれば、発話者に自身の口唇動作をどのように修正すればよいのかを、直感的に認識させやすくなる。 For example, a CG (Computer Graphics) of a graphic (for example, an arrow) that specifies the direction and size to be corrected is synthesized and displayed at the position corresponding to the feature point for correcting the movement on the lip image of the speaker. can do. The CG of this figure may be stored in advance in an appropriate storage unit, or the correction information output unit may generate it each time based on the correction amount calculated by the correction amount calculation unit.
This makes it easier for the speaker to intuitively recognize how to correct his / her lip movement.

また、請求項３に記載の発話学習支援装置は、請求項１または請求項２に記載の発話学習支援装置において、前記修正情報出力手段は、前記修正量算出手段によって前記修正量が算出された前記特徴点と当該特徴点の動作を修正する方向と大きさとを特定したテキストを前記表示装置に出力することを特徴とする。
例えば、当該テキストを音声合成し、表示装置のスピーカで再生してもよいし、当該テキストを、表示装置の表示画面に表示してもよい。
これによれば、どの特徴点の動作をどの程度修正すればよいのかがテキストで表示されるので、発話者に自身の口唇動作の改善点をより理解させやすくなる。 The utterance learning support device according to claim 3 is the utterance learning support device according to claim 1 or 2, wherein the correction information output means calculates the correction amount by the correction amount calculation means. A text specifying the feature point and a direction and size for correcting the operation of the feature point is output to the display device.
For example, the text may be synthesized by speech and reproduced by a speaker of the display device, or the text may be displayed on the display screen of the display device.
According to this, since it is displayed in text that how much the action of which feature point should be corrected, it becomes easier for the speaker to understand the improvement of his / her lip movement.

また、請求項４に記載の発話学習支援プログラムは、撮影手段で撮影された、発話者がある言語の予め指定された単語を発話している際の口唇部分の画像から、当該発話者の口唇動作を得て、この口唇動作と、指導者が前記単語を発話した際の模範的な口唇動作とを比較した結果に基づいて、発話者の前記口唇動作の改善点を示すために、コンピュータを、発話指示手段、画像処理手段、動作測定手段、データ変換手段、差分算出手段、修正量算出手段、修正情報出力手段、として機能させることを特徴とする。 Further, the utterance learning support program according to claim 4, based on the image of the lip portion of the utterance when the utterer utters a predetermined word in a certain language, which is photographed by the photographing means, In order to show the improvement of the speaker's lip movement based on the result of comparing the lip movement with the exemplary lip movement when the instructor spoke the word , Speech instruction means, image processing means, motion measurement means, data conversion means, difference calculation means, correction amount calculation means, and correction information output means.

かかる構成によれば、発話学習支援プログラムは、発話指示手段によって、少なくとも発話者に発話させる前記単語と、指導者が前記単語を発話した際の口唇部分の画像との対応付けを複数記憶するデータ記憶手段に記憶された複数の前記単語の中から、発話者に発話させる単語を外部からの入力により、あるいは、予め設定された順序により一つ選択し、発話者に対し当該単語の発話を指示する。 According to such a configuration, the utterance learning support program stores, by the utterance instruction means, at least a plurality of correspondences between the word that the utterer speaks and the image of the lip portion when the instructor utters the word. From the plurality of words stored in the storage means, one word to be spoken by the speaker is selected from the outside or in a preset order, and the speaker is instructed to speak the word. To do.

発話学習支援プログラムは、画像処理手段によって、前記発話指示手段で指示された前記単語を発話者が発話した際の、前記口唇部分の画像から、発話者の口唇動作を特定する基準となる予め設定した特徴点の位置を複数抽出する。 The utterance learning support program is set in advance as a reference for specifying the lip movement of the speaker from the image of the lip portion when the speaker utters the word instructed by the utterance instruction unit by the image processing unit. A plurality of extracted feature point positions are extracted.

発話学習支援プログラムは、動作測定手段によって、前記画像処理手段で抽出された前記特徴点ごとの位置の変化を、口唇動作の履歴である動作履歴として測定する。 The utterance learning support program measures the change in position for each of the feature points extracted by the image processing means as an action history that is a history of lip movements by the action measuring means.

発話学習支援プログラムは、データ変換手段によって、前記動作測定手段で測定された前記特徴点ごとの前記動作履歴を数値解析することで、前記特徴点ごとに、予め設定した複数のスペクトル成分で表される動作スペクトルに変換する。 The utterance learning support program is represented by a plurality of preset spectral components for each feature point by numerically analyzing the motion history for each feature point measured by the motion measurement unit by the data conversion unit. To the operating spectrum.

発話学習支援プログラムは、差分算出手段によって、前記動作測定手段で測定された前記特徴点ごとの前記動作履歴を数値解析することで、前記特徴点ごとに、予め設定した複数のスペクトル成分で表される動作スペクトルに変換する。 The utterance learning support program is represented by a plurality of preset spectral components for each feature point by numerically analyzing the motion history for each feature point measured by the motion measurement unit by the difference calculation unit. To the operating spectrum.

発話学習支援プログラムは、修正量算出手段によって、前記差分算出手段で算出された前記差分の絶対値と、予め定めた閾値とを前記特徴点ごとに比較し、前記差分の絶対値が前記予め定めた閾値よりも大きい前記特徴点がある場合、予め定めた補正関数によって、当該特徴点の動作を修正する方向と大きさを特定した修正量を算出する。 The utterance learning support program compares the absolute value of the difference calculated by the difference calculating unit with a predetermined threshold by the correction amount calculating unit for each feature point, and the absolute value of the difference is determined in advance. If there is a feature point that is larger than the threshold value, a correction amount that specifies the direction and size of correcting the operation of the feature point is calculated by a predetermined correction function.

そして、発話学習支援プログラムは、修正情報出力手段によって、前記修正量算出手段で算出された前記修正量に応じた修正情報を表示装置に出力する。 Then, the speech learning support program, the correction information output means outputs the correction information corresponding to the correction amount calculated in the correction calculation means Viewing device.

本発明に係る発話学習支援装置及び発話学習支援プログラムでは、以下のような優れた効果を奏する。
請求項１、４に記載の発明によれば、発話者の口唇動作を模範的な口唇動作と比較した結果に基づいて、修正量を算出し、修正量に対応する修正情報を表示装置に表示させることで、発話者に対し、自身の口唇動作の改善点を客観的に示すことができるので、正しい発音を効果的に学習可能となる。
請求項２、３に記載の発明によれば、発話者に、口唇動作の修正点をより理解させやすくなる。 The utterance learning support device and the utterance learning support program according to the present invention have the following excellent effects.
According to the first and fourth aspects of the present invention, the correction amount is calculated based on the result of comparing the lip movement of the speaker with the exemplary lip movement, and the correction information corresponding to the correction amount is displayed on the display device. By doing so, it is possible to objectively show the improvement point of the own lip movement to the speaker, so that correct pronunciation can be effectively learned.
According to the second and third aspects of the invention, it becomes easier for the speaker to understand the correction point of the lip movement.

本発明における発話学習支援装置を用いて発話者が学習する様子を概念的に説明するための図であり、学習開始前に表示画面に表示する画面例と、学習開始後に表示画面に表示する画面例を示している。It is a figure for demonstrating a mode that a speaker learns using the speech learning assistance apparatus in this invention, the example of a screen displayed on a display screen before a learning start, and the screen displayed on a display screen after a learning start An example is shown. 本発明の発話学習支援装置の構成を示すブロック構成図である。It is a block block diagram which shows the structure of the speech learning assistance apparatus of this invention. 本発明の発話学習支援装置におけるデータ記憶手段が記憶するデータ構造の例を概念的に表した図である。It is the figure which represented notionally the example of the data structure which the data storage means in the speech learning assistance apparatus of this invention memorize | stores. 本発明の発話学習支援装置における画像処理手段により抽出する特徴点を示した図である。It is the figure which showed the feature point extracted by the image processing means in the speech learning assistance apparatus of this invention. 本発明の発話学習支援装置における動作測定手段により生成される動作履歴グラフの例を示した図である。It is the figure which showed the example of the action history graph produced | generated by the action measurement means in the speech learning assistance apparatus of this invention. 本発明の発話学習支援装置におけるデータ変換手段により生成される特徴点の動作スペクトルグラフの例を示した図である。It is the figure which showed the example of the operation | movement spectrum graph of the feature point produced | generated by the data conversion means in the speech learning assistance apparatus of this invention. 英単語の中で日本語の母音に近い母音列を有する単語を発話した際の特徴点の動作スペクトルグラフの例を示した図である。It is the figure which showed the example of the operation | movement spectrum graph of the feature point at the time of uttering the word which has a vowel string close | similar to a Japanese vowel among English words. 英単語の中で日本語の母音に近い母音列を有する英母音間の特徴点の動作スペクトルグラフの比較例を示した図である。It is the figure which showed the comparative example of the operation | movement spectrum graph of the feature point between the English vowels which have a vowel string close | similar to a Japanese vowel among English words. （ａ）は、指導者がある英母音を含む単語を発話した際の特徴点の動作スペクトルにおけるスペクトル成分の比率を示し、（ｂ）は、発話者（学習者）が、学習を始めた当初に、（ａ）と同じ英母音を含む単語を発話した際の特徴点の動作スペクトルにおけるスペクトル成分の比率を示し、（ｃ）は、（ｂ）と同じ発話者（学習者）が、学習がある程度進んだ段階で、（ａ）および（ｂ）と同じ英母音を含む単語を発話した際の特徴点の動作スペクトルにおけるスペクトル成分の比率を示す図である。(A) shows the ratio of spectral components in the motion spectrum of feature points when a teacher utters a word containing a certain English vowel, and (b) shows the initial time when the speaker (learner) started learning Shows the ratio of spectral components in the operating spectrum of the feature points when the word containing the same English vowel as in (a) is uttered, and (c) shows the same speaker (learner) as in (b) learning. It is a figure which shows the ratio of the spectrum component in the operation | movement spectrum of the feature point when the word containing the same English vowel as (a) and (b) is uttered in the stage advanced to some extent. 本発明の発話学習支援装置における修正量算出手段で用いられる補正関数を説明するための概念図である。It is a conceptual diagram for demonstrating the correction function used with the correction amount calculation means in the speech learning assistance apparatus of this invention. 本発明の発話学習支援装置における修正情報出力手段により発話者の口唇動作の修正点を表示した際の画面例を示す図である。It is a figure which shows the example of a screen at the time of displaying the correction point of a speaker's lip movement by the correction information output means in the speech learning assistance apparatus of this invention. 本発明の発話学習支援装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speech learning assistance apparatus of this invention. 本発明の実施形態の変形例に係る発話学習支援装置の構成を示すブロック構成図である。It is a block block diagram which shows the structure of the speech learning assistance apparatus which concerns on the modification of embodiment of this invention.

以下、本発明の実施の形態について図面を参照して説明する。
［発話学習支援装置の概要］
まず、図１を参照して、本発明における発話学習支援装置１の概要について説明する。
発話学習支援装置１は、発話者が学習しようとする言語の指定された単語を発話した際の口唇動作と、その言語の学習を指導する教師あるいはその言語を母国語とするネイティブスピーカ等（以下では、単に指導者という。）に、同じ単語を予め発話してもらった際の模範的な口唇動作と、を比較した結果に基づいて、発話者の口唇動作の改善点を示すものである。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[Outline of speech learning support device]
First, with reference to FIG. 1, the outline | summary of the speech learning assistance apparatus 1 in this invention is demonstrated.
The utterance learning support device 1 includes a lip movement when a speaker utters a word specified in a language to be learned, a teacher who teaches learning of the language, a native speaker whose language is a native speaker, etc. Then, it is simply referred to as an instructor.) Based on the result of comparing the exemplary lip movement when the same word is uttered in advance, the improvement point of the lip movement of the speaker is shown.

図１右側に示すように、発話学習支援装置１は、発話者に発話させる単語を指示するテキストデータＴａと、その単語を指導者が発話した際の模範的な口唇部分の画像ＴＧとを、表示装置３０の表示画面３１に表示させる。この画像ＴＧは、発話学習支援装置１が予め保有するものであり、発話者に発話させる単語を指導者が発話した際の、指導者の口唇部分の一連の動作を示す動画像である。 As shown on the right side of FIG. 1, the utterance learning support device 1 includes text data Ta that instructs a word to be uttered by a speaker, and an image TG of an exemplary lip portion when the instructor utters the word. It is displayed on the display screen 31 of the display device 30. This image TG is a moving image that is stored in advance in the utterance learning support device 1 and shows a series of operations of the lip portion of the instructor when the instructor speaks a word to be spoken by the speaker.

また、発話学習支援装置１は、図１左側に示すように、初期画面として、例えば言語選択ボタン、難易度選択ボタン、学習開始ボタン等を備えた画面を表示画面３１に表示するようになっている。この初期画面は、発話学習支援装置１の図示しない記憶手段に予め記憶されており、発話者が発話学習支援システムを起動させた際に、表示装置３０の表示画面３１に表示されるようになっている。一方、表示装置３０は、例えばマウス等を介して、発話者による学習する言語の選択、難易度の選択、学習開始の決定の入力を受け付けた際は、そのことを示す信号を発話学習支援装置１に出力するようになっている。なお、この初期画面において、発話者によって、言語選択ボタン等で学習する言語の選択が入力された後に、母音選択画面に遷移するようになっていてもよい。これによれば、発話者が苦手な母音を重点的に学習することができるので、利便性が高くなる。 Further, as shown in the left side of FIG. 1, the utterance learning support device 1 displays a screen including, for example, a language selection button, a difficulty level selection button, a learning start button, and the like on the display screen 31 as an initial screen. Yes. The initial screen is stored in advance in a storage unit (not shown) of the utterance learning support device 1 and is displayed on the display screen 31 of the display device 30 when the speaker activates the utterance learning support system. ing. On the other hand, when the display device 30 receives an input of selection of a language to be learned, selection of difficulty level, and determination of learning start by a speaker or the like via, for example, a mouse or the like, a signal indicating the input is received. 1 is output. In this initial screen, the vowel selection screen may be transitioned to after the selection of the language to be learned by the language selection button or the like is input by the speaker. According to this, since the vowel that the speaker is not good at can be learned intensively, convenience is enhanced.

そして、発話学習支援装置１は、表示装置３０から、例えば、発話者により学習言語として「英語」が選択され、難易度として「普通」が選択されたことを示す信号の入力を受け付けた場合、図１右側に例示するように、発話者に発話させる単語を指示する“「ａｕｎｔ」を発話してください”というテキストと、予め保有する、指導者が「ａｕｎｔ」を発話したときの口唇部分の画像Ｇｔとを、表示画面３１に表示する。 Then, the speech learning support device 1 receives, from the display device 30, for example, an input of a signal indicating that “English” is selected as the learning language by the speaker and “normal” is selected as the difficulty level. As illustrated on the right side of FIG. 1, the text “Please say“ aunt ”” that indicates the word to be spoken to the speaker, and the lip portion of the lip when the instructor speaks “aunt” are stored in advance. The image Gt is displayed on the display screen 31.

そして、発話者が、発話学習支援装置１によって指定されて表示画面３１に表示された単語を、表示画面３１に表示された模範的な口唇動作の画像ＴＧを参照しつつ発話すると、撮影手段２０により、発話者の口唇部分が発話開始から発話終了まで継続して撮影される。 Then, when the speaker utters the word specified by the utterance learning support device 1 and displayed on the display screen 31 with reference to the exemplary lip movement image TG displayed on the display screen 31, the photographing unit 20. Thus, the lip portion of the speaker is continuously photographed from the start of the utterance to the end of the utterance.

そして、発話学習支援装置１は、撮影手段２０によって撮影された発話者の口唇部分の画像を取得して解析し、その解析結果と、指導者が同じ単語を発話した際の口唇部分の画像ＴＧの解析結果とを比較した結果に基づいて、発話者の口唇動作の修正量を算出し、その修正量に応じた修正情報を表示装置３０の表示画面３１に表示するようになっている。 Then, the utterance learning support device 1 acquires and analyzes the image of the lip portion of the speaker photographed by the photographing unit 20, and the analysis result and the image TG of the lip portion when the instructor utters the same word. The amount of correction of the lip movement of the speaker is calculated based on the result of comparison with the analysis result of the above, and the correction information corresponding to the amount of correction is displayed on the display screen 31 of the display device 30.

このように、発話学習支援装置１によれば、発話者の口唇動作の修正量を算出し、その修正量に応じた修正情報を、表示装置３０の表示画面３１に表示するため、発話者に対し、口唇動作の改善点を客観的に示すことが可能となる。 As described above, according to the utterance learning support device 1, the correction amount of the lip movement of the speaker is calculated, and the correction information corresponding to the correction amount is displayed on the display screen 31 of the display device 30. On the other hand, it is possible to objectively show improvement points of lip movement.

なお動画像は、発話者が模範的な口唇動作を繰り返し確認できるよう、所定回数繰り返して再生されるように予め設定しておいてもよいし、発話者によって外部から信号を入力された際に繰り返して再生するようにしてもよい。 The moving image may be set in advance so as to be repeatedly reproduced a predetermined number of times so that the speaker can repeatedly confirm the exemplary lip movement, or when a signal is input from the outside by the speaker. You may make it reproduce repeatedly.

［発話学習支援装置の構成］
次に、図２を参照して、本発明における発話学習支援装置１の構成について説明する。
図２に示すように、発話学習支援システムは、発話学習支援装置１と、撮影手段２０と、表示装置３０と、を含んで構成されている。
ここで、発話学習支援装置１の構成の説明に先立ち、適宜図１を参照して撮影手段２０および表示装置３０について説明する。 [Configuration of speech learning support device]
Next, the configuration of the utterance learning support device 1 according to the present invention will be described with reference to FIG.
As shown in FIG. 2, the utterance learning support system includes an utterance learning support device 1, a photographing unit 20, and a display device 30.
Here, prior to the description of the configuration of the utterance learning support device 1, the photographing unit 20 and the display device 30 will be described with reference to FIG.

撮影手段２０は、発話者が発話している際に、当該発話者の口唇部分を撮影するためのものである。この撮影手段２０で撮影された口唇部分の画像は、発話学習支援装置１に出力される。撮影手段２０は、例えば、一般的なカメラであってもよいし、口唇部分の奥行きの変位を検出可能なステレオカメラであってもよい。なお、口唇部分の画像は、発話者が発話している間、撮影手段２０によって継続して撮影されたものである。また、口唇部分とは、口唇の周囲だけではなく、発話者の顔面下部（鼻下から下顎まで）を含んでいてもよい。また、後記する画像処理手段１３で、発話者の顔面全体の画像を使って画像処理を行う場合には、撮影手段２０は、発話者の顔面全体を撮影してもよい。この撮影手段２０は、図１に示すように、表示装置３０に内蔵されていてもよい。 The photographing means 20 is for photographing the lip portion of the speaker when the speaker is speaking. The lip image captured by the photographing means 20 is output to the utterance learning support device 1. The photographing unit 20 may be, for example, a general camera or a stereo camera that can detect the displacement of the depth of the lip. Note that the image of the lip is continuously photographed by the photographing means 20 while the speaker is speaking. Further, the lip portion may include not only the periphery of the lip but also the lower part of the speaker's face (from the nose to the lower jaw). When the image processing unit 13 described later performs image processing using an image of the entire face of the speaker, the photographing unit 20 may photograph the entire face of the speaker. The photographing means 20 may be built in the display device 30 as shown in FIG.

表示装置３０は、発話学習支援装置１から単語の発話を指示するテキストデータ、および、その単語を指導者が発話した際の模範的な口唇動作の画像の入力を受け付けて発話者に表示するものであり、表示画面３１を有している。例えば表示装置３０を、ＰＣ（Personal Computer）とし、表示画面３１を、ＰＣのモニタ等とすることができる。なお、図１では、図示の都合上、表示画面３１に、模範的な口唇動作の一場面が表示されているが、実際は、模範的な口唇動作を動画像で表示するようになっている。表示装置３０は、さらにスピーカ（図示せず）を備えていてもよい。 The display device 30 accepts input of text data for instructing the utterance of a word from the utterance learning support device 1 and an image of an exemplary lip movement when the instructor utters the word and displays it to the speaker. And has a display screen 31. For example, the display device 30 can be a PC (Personal Computer), and the display screen 31 can be a PC monitor or the like. In FIG. 1, for convenience of illustration, a scene of an exemplary lip movement is displayed on the display screen 31, but in actuality, the exemplary lip movement is displayed as a moving image. The display device 30 may further include a speaker (not shown).

なお、表示画面３１は、画像を平面的に表示するものに限らず、立体的に表示するものを用いてもよい。表示画面３１が、画像を立体的に表示可能な場合には、模範的な口唇動作の動画像を立体的に表示するとよい。立体的に表示することにより、「ウ、ｕ」、「オ、ｏ」等の口唇を突き出して発音する母音を含む単語の学習に好適である。 The display screen 31 is not limited to displaying an image in a planar manner, and a display screen 31 may be used. When the display screen 31 can display an image three-dimensionally, a moving image of an exemplary lip movement may be displayed three-dimensionally. The three-dimensional display is suitable for learning words including vowels that are pronounced by protruding lips such as “U, u” and “O, o”.

発話学習支援装置１は、撮影手段２０で撮影された、発話者がある言語の予め指定された単語を発話している際の口唇部分の画像から、当該発話者の口唇動作を得て、この口唇動作と、予め保有する模範的な口唇動作とを比較した結果に基づいて、発話者の口唇動作の改善点を示すものである。この発話学習支援装置１は、データ記憶手段１１と、発話内容指定手段１２と、画像処理手段１３と、動作測定手段１４と、データ変換手段１５と、差分算出手段１６と、修正量算出手段１７と、修正情報出力手段１８と、を備えている。なおここでは、発話学習支援装置１は、発話者が発話した際の音声を入力しておらず、撮影手段２０で撮影された動画像のみから発話者の口唇動作を得ている。 The utterance learning support device 1 obtains the lip movement of the utterer from the image of the lip portion when the utterer utters a word specified in advance in a certain language, which is photographed by the photographing means 20. Based on the result of comparing the lip movement with the typical lip movement possessed in advance, the improvement of the lip movement of the speaker is shown. The utterance learning support device 1 includes a data storage unit 11, an utterance content designation unit 12, an image processing unit 13, an action measurement unit 14, a data conversion unit 15, a difference calculation unit 16, and a correction amount calculation unit 17. And correction information output means 18. Here, the utterance learning support device 1 does not input the voice when the speaker utters, and obtains the lip movement of the speaker from only the moving image captured by the imaging means 20.

データ記憶手段１１は、少なくとも発話者に発話させる単語と、指導者がその単語を発話した際の口唇部分の画像との対応付けを複数記憶するものであり、不揮発性メモリ（ＮＶＲＡＭ）、ハードディスク等の一般的な記憶媒体である。 The data storage means 11 stores a plurality of correspondences between at least a word to be spoken by a speaker and an image of a lip portion when the instructor utters the word, such as a nonvolatile memory (NVRAM), a hard disk, etc. This is a general storage medium.

ここで、図３を参照しつつ、データ記憶手段１１内のデータ構造の例を説明する。なお、図３では、データ記憶手段１１内のデータ構造の一部を概念的に表している。
図３に示すように、データ記憶手段１１は、ここでは、各言語に存在する単語と、発話者にその単語の発話を指示するテキストデータ（テキスト１，テキスト２，…，テキストｎ）と、指導者がその単語を発話した際の口唇部分の画像（画像１，画像２，…，画像ｎ）と、指導者の口唇部分の画像から求めた特徴点ごとの模範的な動作スペクトル（スペクトル１，スペクトル２，…，スペクトルｎ）と、当該動作スペクトルの第１成分が動作スペクトル全体に占める比率（比率１，比率２，…，比率ｎ）と、単語の発音の難易度（難しい，普通，易しい）と、を一組としたデータ組（データ１，データ２，…，データｎ）とを、その言語の母音ごとに分類して複数記憶している。動作スペクトルおよび比率については、詳しくは後記する。 Here, an example of the data structure in the data storage unit 11 will be described with reference to FIG. In FIG. 3, a part of the data structure in the data storage means 11 is conceptually represented.
As shown in FIG. 3, the data storage means 11 includes a word existing in each language, text data (text 1, text 2,..., Text n) that instructs the speaker to speak the word, Exemplary motion spectrum (spectrum 1) for each feature point obtained from the image of the lip portion (image 1, image 2,..., Image n) when the instructor utters the word and the image of the instructor's lip portion. , Spectrum 2,..., Spectrum n), the ratio of the first component of the motion spectrum to the entire motion spectrum (ratio 1, ratio 2,..., Ratio n), and the difficulty of pronunciation of words (difficult, normal, And a plurality of data sets (data 1, data 2,..., Data n) classified into each vowel of the language. The operating spectrum and ratio will be described later in detail.

一つの母音につき単語が少なくとも一つ記憶されていればよいので、データ記憶手段１１は、一つの母音につき少なくとも一つのデータ組を記憶していれば足りるが、図３に示すように、一つの母音につき複数のデータ組を記憶しておくと好ましい。一つの母音につき複数のデータ組を記憶しておくと、発話者が、一つの母音について、様々な単語を発話して学習することが可能となり、発話者の学習効率を向上させることができるためである。なお、図３では、一つの母音につき２つのデータ組を記憶した例を示したが、これに限られず、任意数とすることができる。 Since it is sufficient that at least one word is stored for each vowel, it is sufficient for the data storage means 11 to store at least one data set for each vowel. However, as shown in FIG. It is preferable to store a plurality of data sets for each vowel. If multiple data sets are stored for each vowel, the speaker can speak and learn various words for one vowel, which can improve the learning efficiency of the speaker. It is. Although FIG. 3 shows an example in which two data sets are stored for one vowel, the present invention is not limited to this, and an arbitrary number can be used.

例えば、言語が日本語の場合、母音は「ア、イ、ウ、エ、オ」の５音となるので、データ記憶手段１１は、予め定めた日本語の単語を、母音「ア」を含む単語、母音「イ」を含む単語、・・・、母音「オ」を含む単語にそれぞれ分類して記憶している。また、データ記憶手段１１は、このほかに、長母音、短母音、複合母音、半母音、弱母音等、その言語に存在する母音ごとにそれぞれデータ組を記憶している。 For example, when the language is Japanese, the vowels are five tones “A, I, U, D, E”, so the data storage means 11 includes a predetermined Japanese word including the vowel “A”. A word, a word including the vowel “I”,..., And a word including the vowel “O” are classified and stored. In addition, the data storage unit 11 stores a data set for each vowel existing in the language, such as a long vowel, a short vowel, a composite vowel, a semi-vowel, and a weak vowel.

なお、データ記憶手段１１は、一つの母音内において、複数のデータ組を予めある順序（例えば発音の難易度順等）に従って並べて記憶していてもよい。なお、図３のデータ構造は一例であり、これに限られるものではなく、例えば難易度を記憶していなくてもよい。このようにしてデータ記憶手段１１に記憶された複数のデータ組の各データは、発話内容指定手段１２あるいは差分算出手段１６によって適宜読み出される。 Note that the data storage means 11 may store a plurality of data sets arranged in advance in a certain order (for example, in order of difficulty of pronunciation) within one vowel. Note that the data structure of FIG. 3 is an example, and is not limited to this. For example, the degree of difficulty may not be stored. In this way, each data of the plurality of data sets stored in the data storage unit 11 is appropriately read out by the utterance content designation unit 12 or the difference calculation unit 16.

発話内容指定手段１２は、データ記憶手段１１に記憶された複数のデータ組の単語の中から、発話者に発話させる単語を、外部からの入力により、あるいは、予め設定された順序により一つ指定し、当該単語の発話を発話者に指示するものである。
発話内容指定手段１２は、ここでは、表示装置３０の表示画面３１に表示する表示内容を制御する機能も有している。発話内容指定手段１２は、ここでは、図示しない記憶手段から初期画面を読み出して表示画面３１に表示すると共に、表示装置３０から、初期画面において発話者によって各種項目が選択されたことを示す信号および学習開始の決定を示す信号の入力を受け付けると、データ記憶手段１１に母音ごとに記憶された複数のデータ組の中から、発話者に発話させる単語を一つ選択し、その単語の発話を指示するテキストデータと、その単語を指導者が予め発話した際の口唇部分の画像と、を読み出して表示画面３１に表示する。 The utterance content designating unit 12 designates one word to be uttered by the utterer from a plurality of data sets stored in the data storage unit 11 by an external input or in a preset order. Then, the speaker is instructed to utter the word.
Here, the utterance content specifying means 12 also has a function of controlling the display content displayed on the display screen 31 of the display device 30. Here, the utterance content specifying means 12 reads out an initial screen from a storage means (not shown) and displays it on the display screen 31, and a signal indicating that various items are selected by the speaker on the initial screen from the display device 30. When an input of a signal indicating the start of learning is received, one word to be spoken by the speaker is selected from a plurality of data sets stored for each vowel in the data storage unit 11, and the utterance of the word is instructed The text data to be read and the image of the lip when the instructor utters the word in advance are read out and displayed on the display screen 31.

発話者に発話させる単語を指導者が予め発話した際の口唇部分の画像は、指導者の模範的な口唇動作を示すものであり、この画像を表示画面３１に表示することで、発話者が模範的な口唇動作を参照しながら発話できることから、発話者の学習効率を向上させることができる。 The image of the lip when the instructor utters the word to be spoken by the speaker in advance shows the exemplary lip movement of the instructor. By displaying this image on the display screen 31, the speaker can Since the user can speak while referring to the exemplary lip movement, the learning efficiency of the speaker can be improved.

発話内容指定手段１２は、発話者に発話させる単語を指定するときは、データ記憶手段１１に母音ごとに記憶された複数のデータ組から、単語をランダムに選択してよい。例えば、データ記憶手段１１に図３に示したデータ構造が記憶されている場合、発話内容指定手段１２は、発話者に発話させる単語を、まず「ａｕｎｔ」とし、次に「ｃｕｐ」とする等、ランダムに選択することができる。 When designating a word to be uttered by a speaker, the utterance content designation unit 12 may randomly select a word from a plurality of data sets stored for each vowel in the data storage unit 11. For example, when the data structure shown in FIG. 3 is stored in the data storage unit 11, the utterance content designation unit 12 first sets the word to be spoken by the speaker to “aunt”, then “cup”, etc. Can be selected at random.

ただし、初期画面において、発話者により難易度が指定された場合、発話内容指定手段１２は、データ記憶手段１１に記憶された複数のデータ組内の難易度を参照し、発話者により指定された難易度に適応した単語を選択するようにする。また、初期画面において、発話者により母音が指定された場合、発話内容指定手段１２は、データ記憶手段１１において、発話者により指定された母音に対応付けて記憶された複数のデータ組から単語を適宜選択する。 However, when the difficulty level is designated by the speaker on the initial screen, the utterance content designation unit 12 refers to the difficulty levels in the plurality of data sets stored in the data storage unit 11 and is designated by the speaker. Select words that are suitable for the difficulty level. In the initial screen, when a vowel is specified by the speaker, the utterance content specifying unit 12 selects a word from a plurality of data sets stored in the data storage unit 11 in association with the vowel specified by the speaker. Select as appropriate.

画像処理手段１３は、撮影手段２０で撮影された口唇部分の動画像から、当該口唇部分における予め設定した特徴点の位置を抽出するものである。
ここで、図４を参照して特徴点について説明する。図４に示すように、特徴点は、口唇部分の上端（口唇上部の上端点２点（ａ_１、ａ_２）の中点）Ａ点、下端Ｂ点、左端Ｃ点、右端Ｄ点の４点、または、下顎の頂点をさらに含む５点とすることができる。なお、前記した４点は、口唇部分を示す赤色の画素値を有する画素、顔面部分を示す肌色の画素値を有する画素との画素値の差が最大となる画素（境界となる画素）、つまり、口唇上部（上唇）及び口唇下部（下唇）の最端座標に位置する画素が対象となっている。さらに下顎の頂点を抽出するときは、顎部の肌色の画素値を有する画素と首部の肌色の画素値を有する画素との画素値の差（輝度の差）が最大となる画素で、且つ、最下端に位置する画素を抽出すればよい。 The image processing means 13 extracts the position of a preset feature point in the lip part from the moving image of the lip part photographed by the photographing means 20.
Here, feature points will be described with reference to FIG. As shown in FIG. 4, the feature points are the upper end of the lip part (midpoints of the upper end point of the upper lip ( ₂ midpoints of a ₁ , a ₂ )) A point, lower end B point, left end C point, right end D point A point or 5 points further including the apex of the lower jaw can be used. The four points described above are pixels (pixels serving as boundaries) having a maximum difference in pixel value from pixels having red pixel values indicating the lip portion and pixels having skin color pixel values indicating the face portion. The pixel located at the extreme end coordinates of the upper lip (upper lip) and the lower lip (lower lip) is the object. Further, when extracting the vertex of the lower jaw, a pixel having a maximum pixel value difference (luminance difference) between a pixel having a skin color pixel value of the jaw and a pixel having a skin color pixel value of the neck, and What is necessary is just to extract the pixel located in the lowest end.

この画像処理手段１３は、撮影手段２０から口唇部分の動画像が継続して入力される限り、各画像について、抽出点を抽出し続ける。つまり、画像処理手段１３は、１つの画像について、４点あるいは５点を抽出し、動画像を構成する複数の画像（通常、１秒間に３０フレーム）ごとに抽出していき、抽出した特徴点を、口唇部分の動画像を撮影していた時刻に沿った時系列データとして出力することとなる。画像処理手段１３は、抽出した特徴点の位置を、動作測定手段１４に出力する。 As long as the moving image of the lip portion is continuously input from the photographing unit 20, the image processing unit 13 continues to extract extraction points for each image. That is, the image processing means 13 extracts four or five points for one image, extracts each of a plurality of images (usually 30 frames per second) constituting the moving image, and extracts the feature points. Are output as time-series data along the time when the moving image of the lip portion was captured. The image processing means 13 outputs the extracted feature point positions to the motion measuring means 14.

なおここでは、画像処理手段１３で、４点あるいは５点の特徴点を抽出することとしたが、これに限られず任意点数（６点以上）の特徴点を抽出してもよい。 Although the image processing unit 13 extracts four or five feature points here, the present invention is not limited to this, and feature points with an arbitrary number (six or more) may be extracted.

動作測定手段１４は、画像処理手段１３で抽出された特徴点の位置について、当該特徴点の位置の変化を、口唇動作の履歴である動作履歴として測定するものである。
そして、この動作測定手段１４では、画像処理手段１３で抽出された４点あるいは５点の特徴点それぞれについて、特徴点それぞれの変化を表す動作履歴を測定する。なお、画像処理手段１３に口唇部分の動画像が継続して入力され、特徴点が抽出された際に、一定時間（例えば２秒間）、特徴点の変化が無い場合、つまり、口唇部分の動画像には変化がない（動きがない）場合、この動作測定手段１４によって、発話者は発話していないと判定される。 The motion measuring unit 14 measures the change in the position of the feature point with respect to the position of the feature point extracted by the image processing unit 13 as a motion history that is a history of lip motion.
Then, the motion measuring unit 14 measures an operation history representing the change of each feature point for each of the four or five feature points extracted by the image processing unit 13. When the moving image of the lip portion is continuously input to the image processing means 13 and the feature point is extracted, if there is no change in the feature point for a certain time (for example, 2 seconds), that is, the moving image of the lip portion. If there is no change in the image (no movement), the motion measuring means 14 determines that the speaker is not speaking.

ここで、図５を参照して、動作履歴（動作履歴グラフ）の例について説明する。この図５に示した動作履歴グラフは、口唇下部（下唇）のＢ点（図４参照）の変化について示したものであり、横軸に時間ｔ（ｍｓ）、縦軸に変位ｙ（ｍｍ）を取ったものである。この動作履歴グラフの例に示したように、下唇は、時間経過により、一旦下がって（１４ｍｓで最もさがる（約５１ｍｍ））、その後上がっている。動作測定手段１４によって測定された動作履歴（動作履歴グラフ）は、データ変換手段１５に出力される。 Here, an example of an operation history (operation history graph) will be described with reference to FIG. The movement history graph shown in FIG. 5 shows changes in point B (see FIG. 4) of the lower lip (lower lip), with the horizontal axis representing time t (ms) and the vertical axis representing displacement y (mm). ). As shown in the example of the operation history graph, the lower lip is lowered once with the passage of time (the maximum is reached in 14 ms (about 51 mm)) and then raised. The operation history (operation history graph) measured by the operation measurement unit 14 is output to the data conversion unit 15.

図２に戻って、データ変換手段１５は、動作測定手段１４で測定された動作履歴（動作履歴グラフ）を、数値解析することで、予め設定した複数のスペクトル成分で表される動作スペクトルグラフに変換するものである。データ変換手段１５では、動作測定手段１４で測定された４つあるいは５つの動作履歴それぞれを、動作スペクトルグラフに変換する。この実施形態では、数値解析にフーリエ変換を用いているが、フーリエ変換以外に、動作履歴を関数に見立てて近似する関数近似や動作履歴について微小時間ごとの特徴点の移動量を基に解析する方法等を採用してもよい。 Returning to FIG. 2, the data conversion unit 15 performs numerical analysis on the operation history (operation history graph) measured by the operation measurement unit 14, thereby converting it into an operation spectrum graph represented by a plurality of preset spectrum components. To convert. The data conversion unit 15 converts each of the four or five operation histories measured by the operation measurement unit 14 into an operation spectrum graph. In this embodiment, Fourier transform is used for numerical analysis. However, in addition to Fourier transform, function approximation that approximates motion history as a function or motion history is analyzed based on the amount of movement of feature points for each minute time. A method or the like may be adopted.

このように、動作スペクトルグラフは、特徴的なピークを有したもの、この例では、検出されたスペクトルにおいて、周波数が“１”のところにピークが存在する。この特徴的なピークは、発話者が発話する際に、口唇部分の動作が以下に述べるようになるために生じるものである。すなわち、口唇部分の動作は、発話時にまず開かれ、ある形状に変形され、母音一語が発音された後に、閉じられるという一連の動作の繰り返しになる。そしてこのとき、母音一語の発音では、口唇部分が開かれてから閉じられるまでの一連の動作について、多少の個人差はあるものの、動作履歴グラフが非常に周期的になる。この結果、母音発音時には、必ず特徴的な周波数が現出することになる。 Thus, the operating spectrum graph has a characteristic peak, and in this example, a peak exists at a frequency of “1” in the detected spectrum. This characteristic peak occurs because the movement of the lip portion is described below when the speaker speaks. That is, the movement of the lip portion is a series of operations that are first opened at the time of utterance, transformed into a certain shape, and closed after a single vowel is pronounced. At this time, in the pronunciation of a single vowel, the motion history graph becomes very periodic, although there are some individual differences in the series of operations from the opening to closing of the lip. As a result, a characteristic frequency always appears at the time of vowel pronunciation.

なお、データ変換手段１５において、フーリエ変換を用いて動作スペクトルグラフを識別する場合、発話者が発話した際の動作速度（発話速度）が速くなったり遅くなったりしても、現出する様々なパターン（スペクトルパターン）は、周波数の位置が異なるだけで、パターン形状が変化することはない。つまり、このデータ変換手段１５では、発話者がいかなる発話速度で話しても、発話内容に対応した動作スペクトルグラフを得ることができる。 Note that when the motion spectrum graph is identified using the Fourier transform in the data conversion means 15, various appearing even if the operation speed (speech speed) when the speaker speaks increases or decreases. The pattern (spectral pattern) is different only in the position of the frequency, and the pattern shape does not change. That is, the data conversion means 15 can obtain an operation spectrum graph corresponding to the utterance content regardless of the utterance speed of the utterer.

また、口唇部分の上下左右端の４つの特徴点あるいは下顎を含めた５つの特徴点は、それぞれ独特の動きをするため、４つあるいは５つの特徴点の動作履歴は、発音される母音によって大きく異なることになる。その結果、これらの特徴点の動作履歴グラフを変換した動作スペクトルグラフは、発音される母音によって大きく異なることになるので、発音される母音に応じて区別可能となる。例えば、イギリス英語には、母音が２４音あると言われているが、このイギリス英語の母音（以下、「英母音」ともいう。）それぞれについても口唇動作の動作履歴をスペクトル解析することによって、動作スペクトルグラフに変換することができることが確認されている（尾上他：イギリス英母音に対する口唇動作解析、映像情報メディア学会年次大会、2009，7-3、尾上他：イギリス英母音に対する口唇動作解析（その２）映像情報メディア学会冬期大会、2009，5-9）。 Also, the four feature points at the top, bottom, left, and right ends of the lip part or the five feature points including the lower jaw each move in a unique manner. Therefore, the motion history of the four or five feature points is greatly dependent on the vowel that is pronounced. Will be different. As a result, the motion spectrum graph obtained by converting the motion history graph of these feature points varies greatly depending on the vowel that is pronounced, so that it can be distinguished according to the vowel that is pronounced. For example, in British English, it is said that there are 24 vowels. By analyzing the spectrum of the movement history of lip movements for each of these English vowels (hereinafter also referred to as “English vowels”), It has been confirmed that it can be converted into a motion spectrum graph (Onoe et al .: Lip motion analysis for British English vowels, IPSJ Annual Conference, 2009, 7-3, Onoe et al: Lip motion analysis for British English vowels (Part 2) Video Information Media Society Winter Conference, 2009, 5-9).

図７に示すように、口唇左端部の動作スペクトルグラフでは、「ア」に似た母音列および「ウ」に似た母音列の双方で、周波数が“１”のところにピークが存在するが、「ウ」に似た母音列では強い動作スペクトルを示し、「ア」に似た母音列では弱い動作スペクトルを示している。このように、「ア」に似た母音列と「ウ」に似た母音列とでは、口唇左端の動作スペクトルグラフが大きく異なることになる。なお、ここでは図示しないが「ウ」に似た母音列の「ｌｏｏｋ」と「ｌｕｋｅ」については、口唇上部における動作スペクトルグラフが異なる。このように、母音によって特徴点なスペクトルが得られる特徴点が異なり、それぞれ異なる動作スペクトルが得られる。 As shown in FIG. 7, in the motion spectrum graph at the left end of the lip, there is a peak at a frequency of “1” in both the vowel string similar to “a” and the vowel string similar to “c”. The vowel string similar to “U” shows a strong operating spectrum, and the vowel string similar to “A” shows a weak operating spectrum. As described above, the motion spectrum graph at the left end of the lip is greatly different between the vowel string similar to “A” and the vowel string similar to “C”. Although not shown here, the operation spectrum graph at the upper lip is different for “look” and “luke” of vowel strings similar to “c”. In this way, the characteristic points from which the characteristic point spectra are obtained differ depending on the vowel, and different operating spectra are obtained.

次に、図８を参照して、日本語の「ア、イ、ウ、エ、オ」に近い各英母音について、それぞれ異なる単語を発話した際のそれぞれの動作スペクトルグラフの例について説明する。図８では、縦軸に周波数を取り、横軸には左側から順に、「ア」に近い英母音から「オ」に近い英母音を含む単語対を並べている。例えば、「ア」に近い英母音を含む単語対は、「ｆａｔｈｅｒ」と「ｃｕｐ」である。また、図８において、それぞれの単語対の下方に示す「口唇下部」、「口唇左端部」、「下顎部」は、その単語対で特徴的な動作スペクトルが表れる特徴点を示したものである。 Next, with reference to FIG. 8, an example of each operation spectrum graph when a different word is spoken for each English vowel close to “a, i, u, e, o” in Japanese will be described. In FIG. 8, frequency is plotted on the vertical axis, and word pairs including English vowels close to “A” to English vowels close to “A” are arranged in order from the left side on the horizontal axis. For example, a word pair including an English vowel close to “a” is “father” and “cup”. In FIG. 8, “lower lip”, “left lip”, and “lower jaw” shown below each word pair indicate feature points at which a characteristic motion spectrum appears in the word pair. .

図２に戻って、差分算出手段１６は、データ変換手段１５で解析された、発話者がある単語を発話した際の動作スペクトルと、指導者が同じ単語を発話した際の模範的な動作スペクトルとの差分を算出するものである。
ここで、口唇動作は、動作スペクトルの複数のスペクトル成分のうち、主に第１成分に反映される。言い換えれば、動作スペクトルの第１成分は、口唇動作においては、特徴点の動きの大きさを示す情報となる。 Returning to FIG. 2, the difference calculation means 16 analyzes the operation spectrum when the speaker speaks a certain word analyzed by the data conversion means 15 and the exemplary motion spectrum when the instructor speaks the same word. The difference is calculated.
Here, the lip movement is mainly reflected in the first component among the plurality of spectrum components of the movement spectrum. In other words, the first component of the motion spectrum is information indicating the magnitude of the movement of the feature point in the lip motion.

図９（ａ）に示すように、ネイティブの動作スペクトルグラフは、動作スペクトル全体を１００パーセントとしたときに、動作スペクトルの第１成分が動作スペクトル全体に占める比率が、おおよそ２５パーセントとなっている。一方、図９（ｂ）に示すように、学習を始めたばかりの発話者の動作スペクトルグラフは、動作スペクトル全体を１００パーセントとしたときに、動作スペクトルの第１成分が動作スペクトル全体に占める比率が、おおよそ６０パーセントとなっている。そして、図９（ｃ）に示すように、ある程度学習が進んだ（ｂ）と同じ発話者の動作スペクトルグラフは、動作スペクトルの複数の成分のうち、動作スペクトル全体を１００パーセントとしたときに、動作スペクトルの第１成分が動作スペクトル全体に占める比率が、おおよそ４０パーセントに減じている。 As shown in FIG. 9A, in the native motion spectrum graph, when the entire motion spectrum is 100%, the ratio of the first component of the motion spectrum to the entire motion spectrum is approximately 25%. . On the other hand, as shown in FIG. 9B, in the motion spectrum graph of a speaker who has just started learning, the ratio of the first component of the motion spectrum to the entire motion spectrum is 100% when the entire motion spectrum is 100%. , Approximately 60 percent. Then, as shown in FIG. 9 (c), the same speaker's motion spectrum graph as in (b) where learning has progressed to some extent, when the entire motion spectrum is 100% among a plurality of motion spectrum components, The ratio of the first component of the operating spectrum to the entire operating spectrum has been reduced to approximately 40 percent.

図９（ａ）〜（ｃ）によれば、同じ母音を含む単語を発話した場合であっても、発話者間の口唇動作の違いによって、動作スペクトル全体に占める第１成分の比率に違いが生じることが分かる。また、学習が進み、口唇動作が改善されるにつれて、発話者の動作スペクトルの第１成分が動作スペクトル全体に占める比率が、ネイティブの動作スペクトルの第１成分が動作スペクトル全体に占める比率に近づいていることが分かる。 According to FIGS. 9A to 9C, even when words containing the same vowel are uttered, there is a difference in the ratio of the first component in the entire motion spectrum due to the difference in the lip motion among the speakers. You can see that it happens. Further, as learning progresses and lip movement is improved, the ratio of the first component of the speaker's motion spectrum to the entire motion spectrum approaches the ratio of the first component of the native motion spectrum to the entire motion spectrum. I understand that.

以上のように、発話者（発話者）の口唇動作と指導者の口唇動作との差は、同じ母音を含む単語を発話した際に、それぞれの動作スペクトルの第１成分が動作スペクトル全体に占める比率の違いとなって表れる。
そして、発話者の前記比率と指導者の前記比率との差分が、発話者と指導者との発音の差を表すので、この差分を利用することにより、発話者の口唇動作の修正量を求めることが可能となる。 As described above, the difference between the lip motion of the speaker (speaker) and the lip motion of the instructor is that the first component of each motion spectrum occupies the entire motion spectrum when a word containing the same vowel is spoken. Appears as a difference in ratio.
Since the difference between the ratio of the speaker and the ratio of the leader represents the difference in pronunciation between the speaker and the leader, the correction amount of the lip movement of the speaker is obtained by using this difference. It becomes possible.

そこで、差分算出手段１６では、発話者が、発話内容指定手段１２で指定された単語を発話した際の動作スペクトルの第１成分が動作スペクトル全体において占める比率から、指導者が同じ単語を発話した際の動作スペクトルの第１成分が動作スペクトル全体において占める比率を減算して得られる差分を特徴点ごとに算出する。 Therefore, in the difference calculation means 16, the instructor uttered the same word from the ratio of the first component of the motion spectrum in the whole motion spectrum when the speaker utters the word specified by the speech content specification means 12. The difference obtained by subtracting the ratio of the first component of the motion spectrum in the entire motion spectrum is calculated for each feature point.

なお、発話者の前記比率から指導者の前記比率を減算して算出された差分が負の値であるときは、指導者の模範的な口唇動作に対し、発話者の口の開き方が小さすぎることになる。同様に、算出された差分が正の値であるときは、指導者の模範的な口唇動作に対し、発話者の口の開き方が大きすぎることになる。このように、差分の値の正負によっても、口唇動作を修正する方向（大きい方向か小さい方向か）が分かる。 When the difference calculated by subtracting the ratio of the instructor from the ratio of the speaker is a negative value, the speaker's mouth opening is smaller than the exemplary lip movement of the instructor. It will be too much. Similarly, when the calculated difference is a positive value, the speaker's mouth opening is too large for the exemplary lip movement of the instructor. In this way, the direction in which the lip movement is corrected (larger direction or smaller direction) can also be determined by the sign of the difference value.

以下では、特徴点ごとの差分を区別する場合、口唇上部の差分をΔＵとし、口唇下部の差分をΔＤとし、口唇左端部の差分をΔＬとし、口唇右端部の差分をΔＲとし、下顎の差分をΔＪとし、さらに、画像処理手段１３によって口唇中央部奥行き方向の特徴点が抽出されているときは、口唇中央部奥行き方向の差分をΔＤｅｐｔｈと表すこととする。
差分算出手段１６で算出された特徴点ごとの差分は、修正量算出手段１７にそれぞれ出力される。 In the following, when distinguishing the difference for each feature point, the upper lip difference is ΔU, the lower lip difference is ΔD, the lip left end difference is ΔL, the lip right end difference is ΔR, and the lower jaw difference ΔJ, and when the feature point in the lip center portion depth direction is extracted by the image processing means 13, the difference in the lip center portion depth direction is expressed as ΔDepth.
The difference for each feature point calculated by the difference calculation unit 16 is output to the correction amount calculation unit 17.

なお、指導者の模範的な動作スペクトルは、予め指導者に、その言語の全ての母音（あるいは母音を含む単語であってもよい）を発話してもらい、そのときの指導者の口唇部分を撮影手段２０でその都度撮影し、画像処理手段１３によって、撮影手段２０で取得された画像から予め設定した特徴点の位置を抽出し、さらに、動作測定手段１４によって、特徴点の動作履歴を測定し、そして、データ変換手段１５によって、動作履歴を数値解析することによって、母音ごとの動作スペクトルを求めることができる。
そして、差分算出手段１６によって、動作スペクトルの第１成分が動作スペクトル全体において占める比率を、母音ごとに算出することができる。 Note that the exemplary motion spectrum of the instructor asks the instructor to utter all vowels (or words that contain vowels) in that language in advance, and the lip portion of the instructor at that time The photographing means 20 takes a picture each time, the image processing means 13 extracts a preset position of the feature point from the image acquired by the photographing means 20, and the action measuring means 14 measures the action history of the feature point. Then, the operation spectrum for each vowel can be obtained by numerically analyzing the operation history by the data conversion means 15.
Then, the difference calculating means 16 can calculate the ratio of the first component of the motion spectrum in the entire motion spectrum for each vowel.

この指導者の前記比率は、ここでは、前記したように動作スペクトルと対応付けてデータ記憶手段１１に記憶しているので、差分算出手段１６は、発話者の動作スペクトルが入力された際に適宜読み出すこととする。 Since the ratio of the instructor is stored in the data storage means 11 in association with the motion spectrum as described above, the difference calculation means 16 is appropriately selected when the motion spectrum of the speaker is input. Read.

修正量算出手段１７は、差分算出手段１６で算出された差分に基づいて、発話者の口唇動作の修正量を算出するものである。
修正量算出手段１７は、差分算出手段１６で算出された特徴点ごとの差分の絶対値を、特徴点ごとに予め定めた閾値と比較し、差分の絶対値が閾値よりも大きいと判定した特徴点については、特徴点ごとに予め定めた補正関数に従って、修正量を算出する。
一方、差分の絶対値が閾値よりも小さいと判断した特徴点については、指導者の特徴点の動作に十分近づいているものとして、発話者の特徴点の動作の修正を指示しない。この閾値は、予め実験等を行って母音ごとに一つの値を定めておく。 The correction amount calculating unit 17 calculates the correction amount of the lip movement of the speaker based on the difference calculated by the difference calculating unit 16.
The correction amount calculating unit 17 compares the absolute value of the difference for each feature point calculated by the difference calculating unit 16 with a threshold value predetermined for each feature point, and determines that the absolute value of the difference is larger than the threshold value. For points, the correction amount is calculated according to a correction function predetermined for each feature point.
On the other hand, regarding the feature point for which the absolute value of the difference is determined to be smaller than the threshold value, it is assumed that the feature point is sufficiently close to the operation of the instructor's feature point, and no modification of the speaker's feature point operation is instructed. This threshold value is determined in advance for each vowel by conducting an experiment or the like.

ここで、図１０を参照して、本実施形態で適用される補正関数について説明する。
図１０は、口唇上部の差分ΔＵに対する口唇上部の動作の修正量ｆ１との関係を示した補正関数のグラフを表しており、縦軸に口唇上部の動作の修正量ｆ１を取り、横軸に差分ΔＵを取っている。図１０において、ｆ１ｍａｘは、当該指定の母音の発音時に、口唇上部を最も大きく動かしたときの、発話者の口唇上部の動作の修正量を示すものであり、ｆ１ｍｉｎは、当該指定の母音の発音時に、口唇上部を最も小さく動かしたときの、発話者の口唇上部の動作の修正量を示すものであり、ｆ１（ΔＵ）は、この２点を直線的に結んだ補正式である。図１０におけるｆ１ｍａｘは、最も大きく口唇上部を動かしたときの修正量であり、人によって異なる。ｆ１ｍｉｎは、口唇上部を動かしていないときの修正量に相当する。本発明を実施する上で、これらの値を測定する必要は無いが、ｆ１ｍａｘ、ｆ１ｍｉｎとして（ΔＵ）に上限、下限があることを概念的に示している。但し、ｆ１ｍｉｎは、口唇左端部、口唇右端部、口唇上部、口唇下部、下顎部の各特徴点について、必ずしも動かしていない状態ではなく、反対方向に動いた場合も含む。例えば、口を横に広げるべき発音で、口をすぼめた場合である。 Here, the correction function applied in the present embodiment will be described with reference to FIG.
FIG. 10 is a graph of a correction function showing the relationship between the upper lip difference ΔU and the upper lip motion correction amount f1, and the vertical axis represents the upper lip motion correction amount f1 and the horizontal axis represents the correction value f1. The difference ΔU is taken. In FIG. 10, f1max indicates the amount of correction of the movement of the speaker's upper lip when the upper lip is moved most greatly when the designated vowel is pronounced, and f1min is the pronunciation of the designated vowel. Sometimes, the correction amount of the motion of the speaker's upper lip when the upper lip is moved the smallest is shown, and f1 (ΔU) is a correction formula that linearly connects these two points. F1max in FIG. 10 is the correction amount when the upper lip is moved most greatly, and varies depending on the person. f1min corresponds to the correction amount when the upper lip is not moved. In carrying out the present invention, it is not necessary to measure these values, but it conceptually indicates that (ΔU) has an upper limit and a lower limit as f1max and f1min. However, f1min includes the case where the feature points of the lip left end, lip right end, upper lip, lower lip, and lower jaw are not necessarily moved but moved in the opposite direction. For example, it is a case where the mouth is shrugged with pronunciation that should widen the mouth sideways.

また、図１０において、−Ｔｈｕは、口唇上部を大きく開ける方向に修正するかどうかを判定するための閾値であり、＋Ｔｈｕは、口唇上部を小さく開ける方向に修正するかどうかを判定するための閾値である。差分ΔＵの絶対値であるａｂｓΔＵが閾値Ｔｈｕより小さい場合（閾値＋Ｔｈｕ〜−Ｔｈｕ内に収まっているとき）、対応する修正量ｆ１（＋Ｔｈｕ）〜ｆ１（−Ｔｈｕ）を発話者に修正量として指示しないか、あるいは、ΔＵを補正関数に代入する処理を行わないものとする。 In FIG. 10, −Thu is a threshold value for determining whether or not the upper lip is corrected in a direction to be opened widely, and + Thu is a threshold value for determining whether or not the upper lip is corrected in a direction to be opened small. It is. When absΔU that is the absolute value of the difference ΔU is smaller than the threshold value Thu (when it falls within the threshold value + Thu to −Thu), the corresponding correction amount f1 (+ Thu) to f1 (−Thu) is instructed to the speaker as the correction amount. Or the process of substituting ΔU into the correction function is not performed.

なお、人により口唇の物理的な大きさが異なり、口唇動作量も異なるため、前記した補正関数を発話者に適用する場合、予めデータ記憶手段１１に記憶された指導者の口唇動作量を基準に正規化を行うこととする。その方法の一例として、口を自然に閉じているときの口唇左端部と口唇右端部との差、すなわち横幅と、口唇上部と口唇下部との差、すなわち縦幅とを、指導者および発明者のそれぞれについて求め、さらに、指導者と発明者の横幅同士と縦幅同士とをそれぞれ比較して、それぞれ比率を求めておく。そして、このようにしてそれぞれ求めた比率を用いて、発話学習時に測定された発話者の口唇動作量を正規化する。 Since the physical size of the lips differs depending on the person and the amount of lip movement varies, when applying the correction function to the speaker, the lip movement amount of the instructor stored in advance in the data storage unit 11 is used as a reference. Normalization is performed on. As an example of the method, the difference between the left lip and the right lip when the mouth is naturally closed, that is, the width and the difference between the upper lip and the lower lip, ie, the vertical width Further, the widths of the instructor and the inventor are compared with each other, and the ratios are obtained. Then, using the ratios obtained in this way, the lip movement amount of the speaker measured during the utterance learning is normalized.

そして、発話者に正規化された補正関数ｆ１（ΔＵ）に従い、図１０に示すように直線的に補正する場合、差分ΔＵが予め定めた閾値＋Ｔｈｕよりも大きいときは、閾値＋Ｔｈｕから離れるにつれて、発話者の口唇上部の動きを小さくする方向に、口唇上部の動きの修正量を大きくする。一方、ΔＵが閾値−Ｔｈｕよりも小さいときは、閾値−Ｔｈｕから離れるにつれて口唇上部の動きを大きくする方向に、口唇上部の動きの修正量を大きくする。このようにして、補正関数ｆ１（ΔＵ）に従い、発話者の口唇上部の動作を指導者の口唇上部の動作に近づけるための、発話者の口唇上部の動作の修正の方向と大きさを特定した修正量を算出することができる。 And when correcting linearly as shown in FIG. 10 according to the correction function f1 (ΔU) normalized by the speaker, when the difference ΔU is larger than a predetermined threshold value + Thu, as the distance from the threshold value + Thu increases, The correction amount of the upper lip movement is increased in the direction of decreasing the upper lip movement of the speaker. On the other hand, when ΔU is smaller than the threshold value -Thu, the correction amount of the upper lip movement is increased in a direction in which the upper lip movement increases as the distance from the threshold -Thu increases. Thus, according to the correction function f1 (ΔU), the direction and magnitude of the correction of the movement of the speaker's upper lip in order to bring the movement of the speaker's upper lip closer to the movement of the instructor's upper lip is specified. The correction amount can be calculated.

また例えば、口唇左端部の動作の修正量を求める場合、発話者に正規化された口唇差端部の補正関数に従い、図１０に示すように直線的に補正する場合、差分ΔＬ（図示せず）が予め定めた閾値Ｔｈｌ（図示せず）よりも大きいときは、閾値Ｔｈｌ（図示せず）から離れるにつれて、発話者の口唇左端部の動きを小さくする方向に、口唇左端部の動きの修正量を大きくする。一方、ΔＬ（図示せず）が予め定めた閾値−Ｔｈｌ（図示せず）よりも小さいときは、閾値−Ｔｈｌ（図示せず）から離れるにつれて口唇左端部の動きを大きくする方向に、口唇左端部の動きの修正量を大きくする。 Further, for example, when the correction amount of the movement of the left lip of the lip is obtained, the difference ΔL (not shown) is used when the correction is performed linearly as shown in FIG. 10 according to the correction function of the lip difference end normalized by the speaker. ) Is larger than a predetermined threshold value Thl (not shown), the movement of the left lip portion of the lip is corrected in a direction to reduce the movement of the left lip portion of the speaker as the distance from the threshold value Thl (not shown) increases. Increase the amount. On the other hand, when ΔL (not shown) is smaller than a predetermined threshold value −Thl (not shown), the left edge of the lip increases in a direction in which the movement of the left edge of the lip increases as the distance from the threshold value −Thl (not shown) increases. Increase the amount of movement correction.

さらに例えば、口唇中央部の奥行き方向の動作の修正量を求める場合、発話者に正規化された口唇中央部の奥行き方向の補正関数に従い、図１０に示すように直線的に補正する場合、差分ΔＤｅｐｔｈ（図示せず）が予め定めた閾値Ｔｈｄｅｐｔｈ（図示せず）よりも大きいときは、閾値Ｔｈｄｅｐｔｈ（図示せず）から離れるにつれて、発話者の口唇中央部をすぼめる方向に、口唇中央部の奥行き方向の動きの修正量を大きくする。一方、ΔＤｅｐｔｈ（図示せず）が予め定めた閾値−Ｔｈｄｅｐｔｈ（図示せず）よりも小さいときは、閾値−Ｔｈｄｅｐｔｈ（図示せず）から離れるにつれて口唇中央部を突き出す方向に、口唇中央部の奥行き方向の動きの修正量を大きくする。他の特徴点についても、同様に、それぞれ対応する補正関数に従い、修正量を算出することができる。修正量算出手段１７は、算出した修正量を、修正情報出力手段１８に出力する。 Further, for example, when obtaining the correction amount of the motion in the depth direction of the lip center portion, when correcting linearly as shown in FIG. 10 according to the correction function in the depth direction of the lip center portion normalized to the speaker, the difference When ΔDepth (not shown) is larger than a predetermined threshold Thdepth (not shown), the distance from the threshold Thdepth (not shown) increases in the direction in which the center of the lip of the speaker is reduced. Increase the amount of motion correction in the depth direction. On the other hand, when ΔDepth (not shown) is smaller than a predetermined threshold value -Thdepth (not shown), the depth of the lip center portion protrudes in the direction of protruding the lip center portion as the distance from the threshold value -Thdepth (not shown) increases. Increase the amount of direction movement correction. Similarly, the correction amount can be calculated for other feature points in accordance with the corresponding correction functions. The correction amount calculation unit 17 outputs the calculated correction amount to the correction information output unit 18.

修正情報出力手段１８は、修正量算出手段１７によって算出された修正量を、発話者が認識可能な形式の修正情報として出力する手段である。この修正情報は、特徴点の修正の方向と大きさを示すものである。ここでは、予め修正量に応じて生成されたパターンを図示しない記憶手段に記憶しておき、修正情報出力手段１８によって適宜読み出すこととする。ただし、修正情報出力手段１８は、修正量算出手段１７によって算出された修正量に応じて、パターンをその都度生成する機能を有していてもよい。 The correction information output means 18 is a means for outputting the correction amount calculated by the correction amount calculation means 17 as correction information in a format that can be recognized by the speaker. This correction information indicates the correction direction and size of the feature point. Here, a pattern generated in advance according to the correction amount is stored in a storage unit (not shown), and is appropriately read out by the correction information output unit 18. However, the correction information output unit 18 may have a function of generating a pattern each time according to the correction amount calculated by the correction amount calculation unit 17.

修正の向きは、口唇上部を、さらに上へ動かすように修正する場合は、上方向の矢印で、上への動きを小さくするように修正するときは、下方向の矢印で示す。以下同様に、口唇下部を、さらに下へ動かすように修正する場合は、下方向の矢印で、下への動きを小さくするように修正するときは、上方向の矢印で示す。口唇左端部を、口をすぼめる方向に動かすように修正するときは右方向の矢印で、開く方向に動かすように修正するときは左方向の矢印で示す。逆に、口唇右端部を、口をすぼめるように動かすように修正するときは左方向の矢印で、開く方向に動かすように修正するときは右方向の矢印で示す。下顎部を、さらに下へ動かすように修正するときは、下方向の矢印で、下への動きを小さくするように修正するときは、上方向の矢印で示す。口唇中央部を、より突き出すように修正するときは、平面的に表示する場合、突き出すような矢印を透視図的に描いて示し、立体表示の場合、視差をつけて飛び出す矢印で示す。口唇中央部を、あまり突き出さないように修正するときは、逆方向の矢印で示す。この矢印の大きさは、補正量によって異なり、補正量の大きさにより大中小、大小など経験的に決めてよい。 The correction direction is indicated by an upward arrow when the upper lip is corrected to move further upward, and by an downward arrow when the correction is made to reduce the upward movement. Similarly, when the lower lip is corrected so as to move further downward, it is indicated by a downward arrow, and when it is corrected so as to reduce the downward movement, it is indicated by an upward arrow. When the left lip of the lip is corrected to move in the direction in which the mouth is squeezed, it is indicated by a right arrow, and when it is corrected to move in the opening direction, it is indicated by a left arrow. On the contrary, the right end of the lip is indicated by a left arrow when it is corrected so as to move the mouth, and it is indicated by a right arrow when it is corrected so that it moves in the opening direction. When the lower jaw is corrected to move further downward, it is indicated by a downward arrow, and when it is corrected so as to reduce the downward movement, it is indicated by an upward arrow. When correcting the center of the lip so as to protrude further, an arrow that protrudes is shown in a perspective view in the case of displaying in a plane, and in the case of stereoscopic display, it is indicated by an arrow protruding with parallax. When the center of the lip is corrected so as not to protrude too much, it is indicated by an arrow in the reverse direction. The size of the arrow differs depending on the correction amount, and may be determined empirically such as large, medium, small, and large depending on the size of the correction amount.

このパターンは、発話者が把握可能な形式であり、かつ、特徴点の修正の方向と大きさを示すことができればどのようなものであってもよい。例えば、パターンを、表示装置３０のスピーカ（図示せず）で再生可能な音声としてもよいし、表示装置３０の表示画面３１に表示可能な態様のテキストとしてもよい。 This pattern may be in any form that can be grasped by the speaker and can indicate the correction direction and size of the feature point. For example, the pattern may be a sound that can be reproduced by a speaker (not shown) of the display device 30, or may be a text that can be displayed on the display screen 31 of the display device 30.

また例えば、パターンを修正の方向と大きさを示す図形（例えば矢印）のＣＧ（Computer Graphics）とし、この図形のＣＧを、発話者の口唇部分の画像における、修正量算出手段１７によって修正量が算出された特徴点に対応する位置に合成した合成画像を、修正情報としてもよい。この図形を矢印とする場合、矢印の向きで修正の方向を示し、矢印の長さで修正の大きさを示すことができる。例えば、矢印が長くなるほど修正量が大きく、短くなるほど修正量が小さいことを示す。矢印の長さで修正量を表す場合、予め修正量に応じて矢印の長さを設定しておく。
そして、この矢印のＣＧを発話者の口唇部分の画像に合成する場合、画像における動作を修正する特徴点に対応する位置に矢印の根元の位置を合わせるとよい。このようにすると、発話者に、矢印の向き（矢印が口唇の内側を向いているか外側を向いているか）によって、口唇のどの部分をどのように修正すればよいかを直感的に理解させやすくなる。 Further, for example, the pattern is a CG (Computer Graphics) of a graphic (for example, an arrow) indicating the direction and size of correction, and the correction amount is calculated by the correction amount calculation means 17 in the lip portion image of the speaker. A synthesized image synthesized at a position corresponding to the calculated feature point may be used as the correction information. When this figure is an arrow, the direction of the correction can be indicated by the direction of the arrow, and the size of the correction can be indicated by the length of the arrow. For example, the longer the arrow, the larger the correction amount, and the shorter the arrow, the smaller the correction amount. When the correction amount is represented by the length of the arrow, the length of the arrow is set in advance according to the correction amount.
Then, when the CG of the arrow is combined with the image of the lip portion of the speaker, the base of the arrow may be aligned with the position corresponding to the feature point for correcting the motion in the image. This makes it easier for the speaker to intuitively understand which part of the lip should be corrected and how it depends on the direction of the arrow (whether the arrow points inward or outward). Become.

なお、表示画面３１が、画像を立体的に表示可能な場合には、パターンを立体的な図形のＣＧとしてもよい。これによれば、修正量算出手段１７により、口唇中央部の奥行き方向の動作の修正量が算出された場合に、修正情報出力手段１８によって、口唇中央部に、修正の方向と大きさを示す立体的な矢印のＣＧを合成して修正情報とすることができる。この場合、矢印の先端を画面手前側に向けるか画面奥側に向けるかによって、修正の方向を示すことができる。 If the display screen 31 can display an image in three dimensions, the pattern may be a three-dimensional CG. According to this, when the correction amount calculation unit 17 calculates the correction amount of the motion in the depth direction of the lip center, the correction information output unit 18 indicates the correction direction and size in the lip center. Three-dimensional arrow CG can be synthesized to obtain correction information. In this case, the direction of correction can be indicated by whether the tip of the arrow is directed toward the front of the screen or toward the back of the screen.

このように、修正情報を合成画像とする場合、修正情報出力手段１８は、修正量算出手段１７から、ある特徴点の修正量の入力を受け付けると、撮影手段２０から発話者の口唇部分の画像を取得すると共に、図示しない記憶手段からその特徴点の修正量に応じた図形のＣＧを読み出して、発話者の口唇部分の画像に当該図形のＣＧを合成することによって合成画像を生成する。また、修正情報を予め図示しない記憶手段に記憶したパターンとする場合、修正情報出力手段１８は、修正量算出手段１７から、ある特徴点の修正量の入力を受け付けると、図示しない記憶手段からその特徴点の修正量に応じたパターン（音声、テキスト等）を読み出す。
このようにして、修正情報出力手段１８によって生成され、あるいは、図示しない記憶手段から読み出された修正情報は、表示装置３０に出力される。 As described above, when the correction information is a composite image, the correction information output unit 18 receives the input of the correction amount of a certain feature point from the correction amount calculation unit 17 and then the image of the lip portion of the speaker from the photographing unit 20. And a graphic CG corresponding to the correction amount of the feature point is read from a storage unit (not shown), and a composite image is generated by synthesizing the graphic CG of the lip portion of the speaker. When the correction information is a pattern stored in a storage unit (not shown) in advance, the correction information output unit 18 receives an input of a correction amount of a certain feature point from the correction amount calculation unit 17 and then stores the correction information from the storage unit (not shown). A pattern (speech, text, etc.) corresponding to the correction amount of the feature point is read out.
In this way, the correction information generated by the correction information output unit 18 or read from the storage unit (not shown) is output to the display device 30.

そして、修正情報出力手段１８から表示装置３０に出力した修正情報を、表示装置３０の表示画面３１に表示し、あるいは、表示装置３０のスピーカ（図示せず）によって再生することで、発話者に、口唇動作の改善点を客観的に認識させることが可能となる。 Then, the correction information output from the correction information output means 18 to the display device 30 is displayed on the display screen 31 of the display device 30, or reproduced by a speaker (not shown) of the display device 30, thereby allowing the speaker to speak. It is possible to objectively recognize the improvement of lip movement.

ここで、図１１を参照して、修正情報出力手段１８から修正情報を表示装置３０に出力し、この修正情報を表示画面３１に表示させた際の画面構成について説明する。ここでは、発話学習支援装置１は、修正量算出手段１７によって、発話者の口唇左端部の口唇動作が指導者の口唇左端部の動作よりも大きすぎると判断し、発話者の口唇左端部の動作を小さく動作させるための修正量を算出し、修正情報出力手段１８によって、この修正量に応じた修正情報を表示装置３０に出力したものとする。 Here, with reference to FIG. 11, a screen configuration when the correction information is output from the correction information output unit 18 to the display device 30 and the correction information is displayed on the display screen 31 will be described. Here, the utterance learning support device 1 determines that the lip movement at the left end of the speaker's lip is too larger than the movement at the left end of the instructor by the correction amount calculation means 17, and It is assumed that the correction amount for making the operation small is calculated and the correction information output means 18 outputs the correction information corresponding to the correction amount to the display device 30.

ここでは、発話学習支援装置１は、修正情報出力手段１８によって、発話者の口唇部分の画像上における口唇左端部に対応する位置に、修正の大きさと方向を示す図形（ここでは、矢印）のＣＧを合成した画像と、修正の大きさと方向を示すテキストと、を修正情報として、表示装置３０に出力したものとする。 Here, the utterance learning support device 1 uses the correction information output means 18 to display a figure (here, an arrow) indicating the magnitude and direction of correction at a position corresponding to the left end of the lip on the lip portion image of the speaker. Assume that an image obtained by synthesizing CG and text indicating the size and direction of correction are output to the display device 30 as correction information.

図１１に示すように、表示画面３１には、発話学習支援装置１の修正情報出力手段１８から出力された修正情報として、発話者の口唇部分の画像における口唇左端部に対応する位置に、口唇左端部から口唇右端部方向に向かって伸びる所定長さの矢印Ｙが合成された画像Ｇｓが表示されている。またさらに、画像Ｇｓの下方には、発話学習支援装置１の修正情報出力手段１８から出力された修正情報として、「口唇の左端をもう少し小さく動かしてください」というテキストデータＴｂが表示されている。このようにして、発話学習支援装置１の修正情報出力手段１８から出力された修正情報を表示装置３０の表示画面３１に表示させることで、発話者に、自身の口唇動作の改善点を客観的に認識させることが可能となる。なお、ある特徴点の動きを修正した結果、他の特徴点の動きが指導者の動きから外れることもあり得る。その場合、その特徴点について差分が閾値内に収まっているかどうか判定し、収まっていない場合、修正量を演算して修正情報として発話者に示し、修正情報を確認した発話者が再度発話したときに、差分が閾値内に収まっているかどうか再度判定することとし、最終的に全ての特徴点の動きと指導者の対応する特徴点の動きとの差分が閾値内に収まるように、発話者はトレーニングを続ける。 As shown in FIG. 11, the display screen 31 has the lip as a correction information output from the correction information output means 18 of the utterance learning support device 1 at a position corresponding to the left end of the lip in the image of the speaker's lip. An image Gs in which an arrow Y having a predetermined length extending from the left end toward the right end of the lip is combined is displayed. Further, below the image Gs, text data Tb “Please move the left end of the lip a little smaller” is displayed as the correction information output from the correction information output means 18 of the utterance learning support device 1. In this way, the correction information output from the correction information output means 18 of the utterance learning support device 1 is displayed on the display screen 31 of the display device 30 so that the speaker can objectively improve the improvement of his / her lip movement. Can be recognized. As a result of correcting the movement of a certain feature point, the movement of another feature point may deviate from the movement of the instructor. In that case, it is determined whether the difference is within the threshold for the feature point, and if not, the correction amount is calculated and shown to the speaker as correction information, and the speaker who confirmed the correction information speaks again Then, it is determined again whether or not the difference is within the threshold value, and the speaker is finally set so that the difference between the movement of all the feature points and the movement of the corresponding feature point of the instructor falls within the threshold value. Continue training.

以上に説明した発話学習支援装置１は、コンピュータにおいて各手段を各機能プログラムとして実現することも可能であり、各機能プログラムを結合して、発話学習支援プログラムとして動作させることも可能である。 In the utterance learning support device 1 described above, each means can be realized as a function program in a computer, and the function programs can be combined to operate as an utterance learning support program.

［発話学習支援装置の動作］
次に、図１２を参照して、発話学習支援装置１を含む発話学習支援システムの動作について説明する。
発話学習支援システムは、発話学習支援装置１の発話内容指定手段１２によって、表示装置３０から、発話者により、ある言語の学習開始の決定がされたことを示す信号の入力を受け付ける（ステップＳ１１）。そして、発話学習支援システムは、発話学習支援装置１の発話内容指定手段１２によって、ステップＳ１１において入力を受け付けた信号に応じて、データ記憶手段１１に記憶されたその言語の複数のデータ組に含まれる単語の中から、発話者に発話させる単語を一つ選択し、その単語の発話を指示するテキストデータと、指導者がその単語を発話した際の口唇部分の画像と、を読み出し、表示装置３０の表示画面３１に表示する（ステップＳ１２）。なおこのとき、データ記憶手段１１に、その単語の音声データがさらに記憶されている場合、発話内容指定手段１２は、当該音声データをさらに読み出して表示装置３０のスピーカ（図示せず）から再生してもよい。 [Operation of speech learning support device]
Next, the operation of the utterance learning support system including the utterance learning support device 1 will be described with reference to FIG.
In the utterance learning support system, the utterance content designation unit 12 of the utterance learning support device 1 receives from the display device 30 an input of a signal indicating that the utterer has decided to start learning a certain language (step S11). . Then, the utterance learning support system is included in a plurality of data sets of the language stored in the data storage unit 11 in response to the signal received in step S11 by the utterance content designation unit 12 of the utterance learning support device 1. A word to be uttered by the speaker from among the words to be spoken, text data for instructing the utterance of the word, and an image of the lip portion when the instructor utters the word, and a display device 30 is displayed on the display screen 31 (step S12). At this time, when the voice data of the word is further stored in the data storage unit 11, the utterance content designation unit 12 further reads out the voice data and reproduces it from a speaker (not shown) of the display device 30. May be.

続いて、発話学習支援システムは、撮影手段２０によって、発話者が、ステップＳ１２において指示された単語を発話した際の口唇部分を撮影する（ステップＳ１３）。
そして、発話学習支援システムは、発話学習支援装置１の画像処理手段１３によって、ステップＳ１３において撮影された発話者の口唇部分の画像の入力を受け付けると、当該口唇部分の画像から口唇動作の基準となる予め設定した特徴点の位置を抽出する（ステップＳ１４）。 Subsequently, in the utterance learning support system, the photographing unit 20 photographs the lip portion when the speaker utters the word instructed in step S12 (step S13).
Then, when the image processing means 13 of the utterance learning support device 1 receives the input of the lip portion image of the utterer taken in step S13, the utterance learning support system receives the lip movement reference from the lip portion image. The position of the preset feature point is extracted (step S14).

そして、発話学習支援システムは、発話学習支援装置１の動作測定手段１４によって、ステップＳ１４において位置が抽出された特徴点の動作履歴を測定する（ステップＳ１５）。そして、発話学習支援装置１は、データ変換手段１５によって、ステップＳ１５において測定された特徴点ごとの動作履歴をフーリエ解析し、特徴点ごとの動作スペクトルを生成する（ステップＳ１６）。 Then, the speech learning support system measures the motion history of the feature point whose position is extracted in step S14 by the motion measurement means 14 of the speech learning support device 1 (step S15). Then, the utterance learning support device 1 uses the data conversion means 15 to Fourier-analyze the motion history for each feature point measured in step S15 and generate a motion spectrum for each feature point (step S16).

さらに、発話学習支援システムは、発話学習支援装置１の差分算出手段１６によって、ステップＳ１６において生成された特徴点ごとの動作スペクトルグラフにおける動作スペクトルの第１成分が動作スペクトル全体に占める比率を求め、この比率と、予めデータ記憶手段１１に記憶された指導者の模範的な動作スペクトルグラフから求めた動作スペクトルの第１成分が動作スペクトル全体に占める比率と、の差分を特徴点ごとに算出する（ステップＳ１７）。 Furthermore, the utterance learning support system obtains the ratio of the first component of the motion spectrum in the motion spectrum graph for each feature point generated in step S16 by the difference calculation means 16 of the utterance learning support device 1 to the entire motion spectrum, The difference between this ratio and the ratio of the first component of the motion spectrum obtained from the typical motion spectrum graph of the instructor stored in advance in the data storage means 11 to the entire motion spectrum is calculated for each feature point ( Step S17).

そして、発話学習支援システムは、発話学習支援装置１の修正量算出手段１７によって、ステップＳ１７において算出された特徴点ごとの差分の絶対値が、発話者が発話した単語の母音について予め定めた閾値より大きいかを、特徴点ごとに判定する（ステップＳ１８）。そして、予め定めた閾値より差分の絶対値の方が大きい特徴点があると判定された場合（ステップＳ１８でＹｅｓ）、発話学習支援装置１は、修正量算出手段１７によって、当該特徴点の差分を、予め定めた補正関数に代入し、差分に応じた、修正の方向と大きさを特定した修正量を算出して（ステップＳ１９）、そのままステップＳ２０に進む。
一方、予め定めた閾値より差分の絶対値の方が大きい特徴点がないと判定された場合（ステップＳ１８でＮｏ）、発話学習支援装置１は、修正量算出手段１７によって修正量を算出せずに、そのまま処理を終了する。 Then, in the utterance learning support system, the absolute value of the difference for each feature point calculated in step S17 by the correction amount calculation means 17 of the utterance learning support device 1 is a threshold value determined in advance for the vowel of the word uttered by the speaker. Whether it is larger is determined for each feature point (step S18). When it is determined that there is a feature point whose absolute value of the difference is larger than a predetermined threshold (Yes in step S18), the utterance learning support device 1 uses the correction amount calculation unit 17 to calculate the difference between the feature points. Is substituted into a predetermined correction function, a correction amount specifying the correction direction and magnitude according to the difference is calculated (step S19), and the process directly proceeds to step S20.
On the other hand, when it is determined that there is no feature point whose absolute value of the difference is larger than the predetermined threshold (No in step S18), the utterance learning support device 1 does not calculate the correction amount by the correction amount calculation unit 17. Then, the process is finished as it is.

そして、発話学習支援システムは、発話学習支援装置１の修正情報出力手段１８によって、ステップＳ１９において修正量が算出された特徴点について、当該修正量に応じた修正情報を図示しない記憶手段から読み出し、あるいは、当該修正量に応じた修正情報を生成して、表示装置３０に出力する（ステップＳ２０）。 Then, the utterance learning support system reads correction information corresponding to the correction amount from the storage unit (not shown) for the feature point whose correction amount is calculated in step S19 by the correction information output unit 18 of the utterance learning support device 1. Or the correction information according to the said correction amount is produced | generated, and it outputs to the display apparatus 30 (step S20).

そして、発話学習支援システムは、表示装置３０によって、ステップＳ２０において発話学習支援装置１から出力された修正情報の入力を受け付ける。そして、発話学習支援システムは、表示装置３０によって入力を受け付けた修正情報が、画像またはテキストの場合、当該修正情報を表示画面３１に表示し、音声の場合、スピーカ（図示せず）から再生する（ステップＳ２１）。
そして、発話学習支援システムは、ステップＳ１３に戻り、例えば表示画面３１に表示された修正情報を参照して、発話内容指定手段１２によってすでに指定されている単語を発話者が繰り返し発話した際の口唇部分を撮影手段２０によって撮影する。このようにして、発話学習支援システムは、ステップＳ１８において予め定めた閾値より差分の絶対値の方が大きい特徴点がないと判定されるまで、ステップＳ１３からステップＳ２１を繰り返す。
以上のようにして、発話学習支援システムは、発話者に口唇動作の修正量を、発話者が客観的に認識可能な態様で示すことができる。 Then, the utterance learning support system receives the input of the correction information output from the utterance learning support device 1 in step S20 by the display device 30. Then, when the correction information received by the display device 30 is an image or text, the utterance learning support system displays the correction information on the display screen 31 and, in the case of voice, reproduces it from a speaker (not shown). (Step S21).
Then, the utterance learning support system returns to step S13 and refers to the correction information displayed on the display screen 31, for example, and the lip when the speaker repeatedly utters the word already specified by the utterance content specifying means 12 The part is photographed by the photographing means 20. In this way, the utterance learning support system repeats step S13 to step S21 until it is determined in step S18 that there is no feature point whose absolute value of the difference is larger than the predetermined threshold value.
As described above, the utterance learning support system can indicate the correction amount of the lip movement to the speaker in a manner that the speaker can objectively recognize.

本実施形態に係る発話学習支援装置１によれば、発話者の口唇動作を指導者の口唇動作と比較した結果として、発話者の口唇動作の修正量を、発話者が客観的に認識可能な態様で示すことができる。このため、発話者に、自身の口唇動作の改善点を客観的に認識させることが可能となる。これにより、効果的な学習教育に役立てることができる。
また、本実施形態に係る発話学習支援装置１によれば、聾話者等、生まれつき難聴もしくは耳が聞こえない人が、口話を学習する際に正しい発音を身に着けるのに役立てることができる。
さらに、本実施形態に係る発話学習支援装置１は、脳梗塞や脳卒中などによる片麻痺により、口唇の左部、あるいは、右部の動作に支障をきたすようになった患者が、単語を正しく発音するための口唇動作のリハビリに利用することができる。 According to the utterance learning support device 1 according to the present embodiment, as a result of comparing the lip movement of the speaker with the lip movement of the instructor, the speaker can objectively recognize the correction amount of the lip movement of the speaker. It can be shown in an embodiment. For this reason, it becomes possible to make a speaker objectively recognize the improvement of his / her lip movement. Thereby, it can be used for effective learning education.
Moreover, according to the utterance learning support device 1 according to the present embodiment, it is possible to help a person who is naturally deaf or deaf, such as a narrator, to acquire correct pronunciation when learning speech. .
Furthermore, in the speech learning support device 1 according to the present embodiment, a patient who has troubled the movement of the left part or the right part of the lip due to hemiplegia due to cerebral infarction, stroke or the like correctly pronounces a word. Can be used for rehabilitation of lip movements.

以上、本発明の実施形態について説明したが、本発明は前記実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更可能である。
例えば、前記実施形態では、修正量算出手段１７で修正量を求める際に適用する補正関数を図１０に示すような一次式としたが、これに限られず、差分の絶対値が閾値から大きく外れたときの修正量を大きくし、差分の絶対値が閾値に近いときの修正量を小さくする場合には、２次の多項式を用いればよい。また、差分の絶対値が閾値に近いときの微妙な口唇動作の修正を優先させる場合には、平方根（０．５次式）や、対数の多項式を用いればよい。 Although the embodiments of the present invention have been described above, the present invention is not limited to the above-described embodiments, and can be appropriately changed without departing from the spirit of the present invention.
For example, in the above-described embodiment, the correction function applied when the correction amount is calculated by the correction amount calculation unit 17 is a linear expression as shown in FIG. 10, but is not limited to this, and the absolute value of the difference greatly deviates from the threshold value. In order to increase the correction amount at the time and reduce the correction amount when the absolute value of the difference is close to the threshold value, a quadratic polynomial may be used. In addition, when priority is given to the delicate correction of the lip movement when the absolute value of the difference is close to the threshold value, a square root (0.5th order equation) or a logarithmic polynomial may be used.

また、前記実施形態では、データ記憶手段１１に、単語をテキスト形式で記憶しておき、発話内容指定手段１２によってデータ記憶手段１１から単語のテキストデータを読み出して表示装置３０の表示画面３１に表示することにより、発話者に発話させる単語を指示していたが、これに限られず、データ記憶手段１１に、単語に対応付けて、当該単語を音声合成した音声データを記憶しておき、発話内容指定手段１２によってこの音声データを読み出して表示装置３０に内蔵されたスピーカ（図示せず）から再生することにより、発話者に発話させる単語を指示してもよい。また、音声とテキストの両方を用いてもよい。 In the embodiment, the word is stored in the data storage unit 11 in a text format, and the text data of the word is read from the data storage unit 11 by the utterance content specifying unit 12 and displayed on the display screen 31 of the display device 30. However, the present invention is not limited to this, and the data storage means 11 stores voice data obtained by voice-synthesizing the word in association with the word. The voice data may be read by the designation means 12 and reproduced from a speaker (not shown) built in the display device 30 to indicate a word to be spoken to the speaker. Further, both voice and text may be used.

なお、撮影手段２０としてステレオカメラを用いた場合、例えば、ステレオカメラを構成する左右カメラの口唇下部の水平方向の座標差を視差として計算し、予め分かっているステレオカメラのカメラ間距離等のカメラパラメータを適用して、ステレオ測量の原理により口唇下部の奥行き方向の動きを算出する。口唇上端の奥行き方向の動きも同様にして算出することができる。 When a stereo camera is used as the photographing unit 20, for example, a horizontal coordinate difference between the lower lips of the left and right cameras constituting the stereo camera is calculated as a parallax, and a camera such as a known inter-camera distance of the stereo camera is calculated. Applying the parameters, the movement in the depth direction of the lower lip is calculated by the principle of stereo surveying. The movement of the upper end of the lip in the depth direction can be calculated in the same manner.

次に、図１３を参照して、前記実施形態の変形例に係る発話学習支援装置１Ｂについて説明する。変形例に係る発話学習支援装置１Ｂは、前記実施形態に係る発話学習支援装置１の構成に加え、評価値算出手段１９を備えている。以下の説明では、前記実施形態と重複する構成要素については、同一の符号を付して説明を省略する。 Next, an utterance learning support device 1B according to a modification of the embodiment will be described with reference to FIG. The utterance learning support device 1B according to the modification includes an evaluation value calculation means 19 in addition to the configuration of the utterance learning support device 1 according to the embodiment. In the following description, components that are the same as those in the above embodiment are given the same reference numerals and description thereof is omitted.

図１３に示すように、発話学習支援装置１Ｂは、データ記憶手段１１と、発話内容指定手段１２と、画像処理手段１３と、動作測定手段１４と、データ変換手段１５と、差分算出手段１６と、修正量算出手段１７と、修正情報出力手段１８と、評価値算出手段１９とを備えている。 As shown in FIG. 13, the utterance learning support device 1B includes a data storage unit 11, an utterance content designation unit 12, an image processing unit 13, an action measurement unit 14, a data conversion unit 15, and a difference calculation unit 16. , Correction amount calculation means 17, correction information output means 18, and evaluation value calculation means 19.

評価値算出手段１９は、発話者の口唇動作を、指導者の口唇動作と比較した結果に基づいて、発話者の口唇動作の評価値を算出するものである。
評価値算出手段１９は、ここでは、式（１）に示すように、データ変換手段１５で得られた発話者の動作スペクトルグラフを構成する全スペクトルにそれぞれの重みを乗算して総スペクトルを計算し、さらに、データ記憶手段１１に記憶された指導者の動作スペクトルグラフを構成する全スペクトルにそれぞれの重みを乗算して総スペクトルを計算し、指導者に対して発話者に動作スペクトルの第１成分が総スペクトルに対して占める量を評価値Ｘとして算出する。 The evaluation value calculation means 19 calculates an evaluation value of the lip movement of the speaker based on the result of comparing the lip movement of the speaker with the lip movement of the instructor.
Here, the evaluation value calculation means 19 calculates the total spectrum by multiplying all the spectrums constituting the motion spectrum graph of the speaker obtained by the data conversion means 15 with respective weights, as shown in the equation (1). Furthermore, the total spectrum is calculated by multiplying all the spectrums constituting the instructor's motion spectrum graph stored in the data storage means 11 with respective weights, and the first spectrum of the motion spectrum is transmitted to the speaker. The amount that the component occupies in the total spectrum is calculated as the evaluation value X.

式（１）において、Ｔ_１は、指導者の動作スペクトルの第１成分を示し、Ｐ_１は、発話者（学習者）の動作スペクトルの第１成分を示す。また、式（１）において、重みＷ_ｉは、適宜設定することができるが、動作スペクトルの第１成分に適用される重みＷ_１は、他の成分の重みよりも大きくすることとする。 In the formula (1), T ₁ represents the first component of the operating spectral leaders, P ₁ shows a first component of the operating spectral of a speaker (learner). In Equation (1), the weight W _i can be set as appropriate, but the weight W ₁ applied to the first component of the operating spectrum is set to be larger than the weights of the other components.

このように評価値算出手段１９によって、評価値Ｘを算出することで、練習前後での評価値の変化によって、発話者に、自身の口唇動作が指導者の口唇動作に近づいているかどうか客観的に認識させることが可能となる。
評価値算出手段１９によって算出された評価値Ｘは、表示装置３０に出力されて表示画面３１に表示される。この評価値Ｘを、修正情報出力手段１８から表示装置３０に出力された修正情報と合わせて表示画面３１に表示してもよいし、いずれか一方のみを表示画面３１に表示してもよい。この評価値Ｘを修正情報と合わせて表示画面３１に表示すると、発話者に、自身の口唇動作の改善点をより認識させやすくなるため好ましい。 By calculating the evaluation value X by the evaluation value calculation means 19 in this way, it is possible to objectively determine whether or not the lip movement of the speaker approaches the lip movement of the instructor due to the change in the evaluation value before and after practice. Can be recognized.
The evaluation value X calculated by the evaluation value calculation means 19 is output to the display device 30 and displayed on the display screen 31. This evaluation value X may be displayed on the display screen 31 together with the correction information output from the correction information output means 18 to the display device 30, or only one of them may be displayed on the display screen 31. Displaying the evaluation value X together with the correction information on the display screen 31 is preferable because it makes it easier for the speaker to recognize the improvement of his / her lip movement.

１、１Ｂ発話学習支援装置
１１データ記憶手段
１２発話内容指定手段
１３画像処理手段
１４動作測定手段
１５データ変換手段
１６差分算出手段
１７修正量算出手段
１８修正情報出力手段
１９評価値算出手段
２０撮影手段
３０表示装置
３１表示画面 DESCRIPTION OF SYMBOLS 1, 1B Utterance learning support apparatus 11 Data storage means 12 Utterance content designation means 13 Image processing means 14 Motion measurement means 15 Data conversion means 16 Difference calculation means 17 Correction amount calculation means 18 Correction information output means 19 Evaluation value calculation means 20 Imaging means 30 Display device 31 Display screen

Claims

The lip movement of the speaker is obtained from the image of the lip part when the speaker speaks a predetermined word in a certain language, which is photographed by the photographing means, and the lip movement and the instructor An utterance learning support device showing an improvement point of the lip movement of a speaker based on a result of comparison with an exemplary lip movement when a word is uttered,
Data storage means for storing a plurality of associations between at least the word to be uttered by a speaker and an image of a lip portion when the instructor utters the word;
Of the plurality of words stored in the data storage means, a word to be spoken by a speaker is designated by an external input or in a preset order, and the utterance of the word is designated by the speaker. Utterance content designation means for instructing
An image for extracting a plurality of positions of preset feature points that serve as a reference for specifying the lip movement of the speaker from the image of the lip portion when the speaker utters the word specified by the utterance content specifying means Processing means;
An action measuring means for measuring a change in position of each feature point extracted by the image processing means as an action history that is a history of lip action;
Data conversion means for converting the motion history for each feature point measured by the motion measurement means into a motion spectrum represented by a plurality of preset spectral components for each feature point;
Difference calculating means for calculating a difference between the motion spectrum for each feature point determined by the data conversion means and the exemplary motion spectrum for each feature point determined in advance from the image of the exemplary lip portion When,
The absolute value of the difference calculated by the difference calculation means and a predetermined threshold value are compared for each feature point, and when there is the feature point where the absolute value of the difference is larger than the predetermined threshold value, A correction amount calculating means for calculating a correction amount specifying a direction and a size for correcting the operation of the feature point by a predetermined correction function;
Speech learning support device characterized by and a correction information output means for outputting the corrected information in accordance with the correction amount calculated in Viewing device by said correction calculation means.

The correction information output means includes
On the image of the lip portion of the speaker, an image indicating the direction and size for correcting the operation of the feature point is synthesized at a position corresponding to the feature point for which the correction amount is calculated by the correction amount calculation unit. The utterance learning support device according to claim 1, wherein the utterance learning support device is output to the display device.

The correction information output means includes
2. The text specifying the feature point for which the correction amount has been calculated by the correction amount calculating means and the direction and magnitude for correcting the operation of the feature point are output to the display device. The speech learning support device according to Item 2.

The lip movement of the speaker is obtained from the image of the lip part when the speaker speaks a predetermined word in a certain language, which is photographed by the photographing means, and the lip movement and the instructor In order to show the improvement of the lip movement of the speaker based on the result of comparing the exemplary lip movement when speaking a word, the computer,
Wherein the word to be uttered at least the speaker, from among the plurality of words stored in data storage means for storing a plurality of correspondence between images of the lip portion when the leader has uttered the word, calling speaker Utterance content designation means for designating one word to be uttered from outside or one in a preset order, and instructing the utterer to utter the word,
An image for extracting a plurality of positions of preset feature points that serve as a reference for specifying the lip movement of the speaker from the image of the lip portion when the speaker utters the word specified by the utterance content specifying means Processing means,
An action measuring means for measuring a change in position for each feature point extracted by the image processing means as an action history that is a history of lip movement;
A data conversion means for converting the motion history for each feature point measured by the motion measurement means into a motion spectrum represented by a plurality of preset spectral components for each feature point;
Difference calculating means for calculating a difference between the motion spectrum for each feature point determined by the data conversion means and the exemplary motion spectrum for each feature point determined in advance from the image of the exemplary lip portion ,
The absolute value of the difference calculated by the difference calculation means and a predetermined threshold value are compared for each feature point, and when there is the feature point where the absolute value of the difference is larger than the predetermined threshold value, A correction amount calculating means for calculating a correction amount specifying a direction and a size for correcting the operation of the feature point by a predetermined correction function;
Speech learning support program for causing a correction information corresponding to the correction amount calculated by the correction amount calculating unit as, correction information output means for outputting the Viewing device.