JP7381054B2

JP7381054B2 - Speech training system, speech training method and program

Info

Publication number: JP7381054B2
Application number: JP2019148071A
Authority: JP
Inventors: 達也北村
Original assignee: Konan University
Current assignee: Konan University
Priority date: 2019-08-09
Filing date: 2019-08-09
Publication date: 2023-11-15
Anticipated expiration: 2039-08-09
Also published as: JP2019197236A

Description

本発明は、発話訓練システム、発話訓練方法及びプログラムに関する。 The present invention relates to a speech training system, a speech training method, and a program.

特開平７－３１９３８０号公報（特許文献１）は、発声訓練装置を開示する。この発声訓練装置においては、被訓練者の発声の調音法とモデル発声の調音法とのずれに基づく指示文が被訓練者にフィードバックされる。この発声訓練装置によれば、被訓練者は、指示文に従って訓練を進めることで調音法の矯正を効果的に行なうことができる（特許文献１参照）。 JP-A-7-319380 (Patent Document 1) discloses a vocal training device. In this vocal training device, instructions based on the difference between the articulation method of the trainee's vocalization and the articulation method of the model vocalization are fed back to the trainee. According to this vocal training device, the trainee can effectively correct his or her articulation method by proceeding with the training according to the instructions (see Patent Document 1).

特開平７－３１９３８０号公報Japanese Patent Application Publication No. 7-319380

上記特許文献１に開示されている発声訓練装置においては、被訓練者が発声した音声に基づいて被訓練者へのフィードバックが行なわれている。しかしながら、被訓練者が発声した音声に基づいたフィードバックのみでは、必ずしも被訓練者の発話訓練が効果的に行なわれないことを本発明者は見出した。 In the voice training device disclosed in Patent Document 1, feedback is provided to the trainee based on the voice uttered by the trainee. However, the present inventor has discovered that the trainee's speech training is not necessarily effectively performed by feedback based only on the voice uttered by the trainee.

本発明は、このような問題を解決するためになされたものであって、その目的は、より効果的に被訓練者の発話訓練を行なうことが可能な発話訓練システム、発話訓練方法及びプログラムを提供することである。 The present invention has been made to solve such problems, and its purpose is to provide a speech training system, a speech training method, and a program that can more effectively train trainees to speak. It is to provide.

本発明のある局面に従う発話訓練システムは、被訓練者の発話訓練に用いられる。発話訓練システムは、撮像手段と、表示手段とを備える。撮像手段は、被訓練者の顔を撮像し、動画像データを生成する。表示手段は、動画像データが示す動画像を表示する。表示手段は、被訓練者の口の動きの量を示す画像を動画像に重畳して表示する。 A speech training system according to an aspect of the present invention is used for speech training of a trainee. The speech training system includes an imaging means and a display means. The imaging means images the trainee's face and generates moving image data. The display means displays a moving image indicated by the moving image data. The display means displays an image indicating the amount of mouth movement of the trainee superimposed on the moving image.

本発明者は、口の周りの筋肉を大きく動かすことを意識して発話訓練を行なうと、音声器官の可動域が広がり、被訓練者が発する音声の明瞭性が向上することを見出した。この発話訓練システムによれば、被訓練者の口の動きの量を示す画像が動画像に重畳して表示されるため、口の動きが不十分か否かを被訓練者に視覚的に認識させることができる。その結果、この発話訓練システムによれば、被訓練者が口の周りの筋肉を大きく動かすことを意識して発話訓練を行なうことができるため、より効果的に被訓練者の発話訓練を行なうことができる。 The present inventor has discovered that when speech training is performed with an awareness of making large movements of the muscles around the mouth, the range of motion of the speech organs is expanded and the clarity of the speech uttered by the trainee is improved. According to this speech training system, an image showing the amount of the trainee's mouth movement is superimposed on the video image, so the trainee can visually recognize whether or not the mouth movement is insufficient. can be done. As a result, according to this speech training system, the trainee can perform speech training while being conscious of making large movements of the muscles around the mouth, so that the trainee's speech training can be carried out more effectively. I can do it.

上記発話訓練システムにおいて、表示手段は、被訓練者に音読させる文章をさらに表示してもよい。 In the above speech training system, the display means may further display sentences for the trainee to read aloud.

この発話訓練システムによれば、被訓練者に音読させる文章が表示されるため、被訓練者は、表示される文章を音読するだけで発話訓練を行なうことができる。 According to this speech training system, since sentences for the trainee to read aloud are displayed, the trainee can perform speech training simply by reading the displayed sentences aloud.

上記発話訓練システムにおいて、表示手段は、被訓練者の発話に関する評価結果をさらに表示してもよい。 In the above speech training system, the display means may further display evaluation results regarding the trainee's speech.

この発話訓練システムによれば、被訓練者の発話に関する評価結果が表示されるため、被訓練者は、評価結果を確認しながら発話訓練を行なうことができる。 According to this speech training system, since the evaluation results regarding the trainee's speech are displayed, the trainee can perform speech training while checking the evaluation results.

上記発話訓練システムにおいて、表示手段は、被訓練者の発話が所定要件を満たさない場合に、警告メッセージをさらに表示してもよい。 In the above speech training system, the display means may further display a warning message when the trainee's speech does not meet predetermined requirements.

この発話訓練システムによれば、被訓練者の発話が所定要件を満たさない場合に警告メッセージが表示されるため、被訓練者は、自らの発話が所定要件を満たしていないことを視覚的に認識することができる。 According to this speech training system, a warning message is displayed if the trainee's speech does not meet the predetermined requirements, so the trainee can visually recognize that his or her speech does not meet the predetermined requirements. can do.

上記発話訓練システムにおいて、口の動きの量を示す画像は、口が移動した軌跡を示す線であってもよい。 In the above speech training system, the image indicating the amount of mouth movement may be a line indicating a trajectory of the mouth movement.

上記発話訓練システムは、動画像データに基づいてオプティカルフローを算出する算出手段と、オプティカルフローに基づいて口の動きの量を示す画像を生成する生成手段とをさらに備えてもよい。 The speech training system may further include a calculation unit that calculates an optical flow based on the moving image data, and a generation unit that generates an image indicating the amount of mouth movement based on the optical flow.

本発明の別の局面に従う発話訓練方法は、発話に関して被訓練者を訓練する。発話訓練方法は、被訓練者の顔を撮像し、動画像データを生成するステップと、動画像データが示す動画像を表示するステップと、被訓練者の口の動きの量を示す画像を動画像に重畳して表示するステップとを含む。 A speech training method according to another aspect of the invention trains a trainee in speech. The speech training method includes the steps of capturing an image of the trainee's face and generating moving image data, displaying a moving image represented by the moving image data, and converting an image showing the amount of mouth movements of the trainee into a moving image. and displaying the image superimposed on the image.

この発話訓練方法によれば、被訓練者の口の動きの量を示す画像が動画像に重畳して表示されるため、口の動きが不十分か否かを被訓練者に視覚的に認識させることができる。その結果、この発話訓練方法によれば、より効果的に被訓練者の発話訓練を行なうことができる。 According to this speech training method, an image showing the amount of the trainee's mouth movements is superimposed on the video image, so the trainee can visually recognize whether or not the mouth movements are insufficient. can be done. As a result, according to this speech training method, it is possible to more effectively train the trainee's speech.

本発明の別の局面に従うプログラムは、被訓練者の発話訓練に用いられる。プログラムは、撮像手段に、被訓練者の顔を撮像させ、動画像データを生成させるステップと、表示手段に、動画像データが示す動画像を表示させるステップと、表示手段に、被訓練者の口の動きの量を示す画像を動画像に重畳して表示させるステップとをコンピュータに実行させる。 A program according to another aspect of the invention is used for speech training of a trainee. The program includes the steps of causing the imaging means to image the trainee's face and generating moving image data, causing the displaying means to display a moving image represented by the moving image data, and causing the displaying means to capture the trainee's face. The computer is caused to execute a step of superimposing and displaying an image indicating the amount of mouth movement on a moving image.

このプログラムがコンピュータによって実行されると、被訓練者の口の動きの量を示す画像が動画像に重畳して表示されるため、口の動きが不十分か否かを被訓練者に視覚的に認識させることができる。その結果、このプログラムによれば、より効果的に被訓練者の発話訓練を行なうことができる。 When this program is executed by a computer, an image showing the amount of mouth movement of the trainee is displayed superimposed on the moving image, so that the trainee can visually see if the mouth movement is insufficient. can be recognized. As a result, according to this program, the trainee can be trained to speak more effectively.

本発明によれば、より効果的に被訓練者の発話訓練を行なうことが可能な発話訓練システム、発話訓練方法及びプログラムを提供することができる。 According to the present invention, it is possible to provide a speech training system, a speech training method, and a program that can more effectively train a trainee to speak.

スマートフォンを用いた発話訓練風景の一例を示す図である。FIG. 2 is a diagram showing an example of a speech training scene using a smartphone. スマートフォンのハードウェア構成の一例を示す図である。It is a diagram showing an example of the hardware configuration of a smartphone. 制御部によって実現される各ソフトウェアモジュールの関係の一例を示す図である。FIG. 3 is a diagram illustrating an example of the relationship between software modules realized by a control unit. 動画表示処理の実行手順を示すフローチャートである。It is a flowchart which shows the execution procedure of video display processing. ディスプレイに表示される画像の一例を示す図である。FIG. 3 is a diagram showing an example of an image displayed on a display. オプティカルフロー表示処理の実行手順を示すフローチャートである。7 is a flowchart illustrating an execution procedure of optical flow display processing. 筋活動量表示処理の実行手順を示すフローチャートである。It is a flowchart which shows the execution procedure of muscle activity amount display processing. 音声特徴量表示処理の実行手順を示すフローチャートである。3 is a flowchart illustrating an execution procedure of audio feature quantity display processing. 警告メッセージ表示処理の実行手順を示すフローチャートである。3 is a flowchart illustrating an execution procedure of a warning message display process. ディスプレイに表示される画像の一例を示す図である。FIG. 3 is a diagram showing an example of an image displayed on a display. 訓練前後に録音した音声の振幅を示す図である。It is a figure showing the amplitude of the voice recorded before and after training. 訓練前後に録音した音声の基本周波数の変化幅を示す図である。FIG. 3 is a diagram showing the range of change in the fundamental frequency of voices recorded before and after training. 訓練前後に計測したＶＡＳを示す図である。It is a figure showing VAS measured before and after training.

以下、本発明の実施の形態について、図面を参照しながら詳細に説明する。なお、図中同一又は相当部分には同一符号を付してその説明は繰り返さない。 Embodiments of the present invention will be described in detail below with reference to the drawings. In addition, the same reference numerals are attached to the same or corresponding parts in the drawings, and the description thereof will not be repeated.

［１．概要］
本発明者の調査によって、健常者である大学生及び大学院生の約３割が発話のしにくさを自覚していることが分かった。本発明者が種々の発話訓練方法を試したところ、被訓練者が細い棒を咥えた状態で発話訓練を行なうことによって、高い訓練効果が得られる可能性があることが分かった。特に、本発明者は、発話訓練時に、被訓練者が大きい声を出すとともに顔面の筋肉をしっかりと動かすことによって高い訓練効果が得られることを見出した。 [1. overview]
A survey conducted by the present inventors revealed that approximately 30% of healthy university and graduate students are aware of difficulty in speaking. When the present inventor tried various speech training methods, it was found that a high training effect could be obtained by performing speech training with the trainee holding a thin stick in his or her mouth. In particular, the present inventor has found that a high training effect can be obtained when the trainee makes a loud voice and firmly moves the facial muscles during speech training.

図１は、本実施の形態に従うスマートフォン１００を用いた発話訓練風景の一例を示す図である。図１に示されるように、発話訓練において、被訓練者１０は、棒２０を咥えた状態で発声する。被訓練者１０は、スマートフォン１００に表示される画像を見ながら発話訓練を行なう。詳細については後述するが、スマートフォン１００には、大きい声を出すとともに顔面の筋肉をしっかりと動かすように被訓練者１０に促す画像が表示される。以下、スマートフォン１００の詳細について説明する。 FIG. 1 is a diagram showing an example of a speech training scene using a smartphone 100 according to the present embodiment. As shown in FIG. 1, during speech training, the trainee 10 speaks while holding a stick 20 in his/her mouth. The trainee 10 performs speech training while viewing images displayed on the smartphone 100. Although details will be described later, an image is displayed on the smartphone 100 that urges the trainee 10 to speak loudly and firmly move the muscles of the face. Details of the smartphone 100 will be described below.

［２．ハードウェア構成］
図２は、スマートフォン１００のハードウェア構成の一例を示す図である。図２に示されるように、スマートフォン１００は、カメラ１３０と、ディスプレイ１４０と、マイク１５０と、スピーカ１６０と、制御部１７０と、記憶部１８０と、通信モジュール１９０とを含んでいる。スマートフォン１００に含まれる各構成要素は、バスを介して電気的に接続されている。 [2. Hardware configuration]
FIG. 2 is a diagram showing an example of the hardware configuration of the smartphone 100. As shown in FIG. 2, the smartphone 100 includes a camera 130, a display 140, a microphone 150, a speaker 160, a control section 170, a storage section 180, and a communication module 190. Each component included in smartphone 100 is electrically connected via a bus.

カメラ１３０は、被写体像を撮像し、画像データを生成するように構成されている。カメラ１３０は、たとえば、被訓練者１０（図１）を撮像し、動画像データを生成する。カメラ１３０は、たとえば、ＣＭＯＳイメージセンサ又はＣＣＤイメージセンサ等のイメージセンサを含んでいる。 Camera 130 is configured to capture an image of a subject and generate image data. The camera 130 captures an image of the trainee 10 (FIG. 1), for example, and generates moving image data. Camera 130 includes, for example, an image sensor such as a CMOS image sensor or a CCD image sensor.

ディスプレイ１４０は、画像を表示するように構成されている。ディスプレイ１４０は、たとえば、カメラ１３０によって生成された動画像データが示す動画像を表示する。ディスプレイ１４０は、たとえば、液晶ディスプレイ又は有機ＥＬディスプレイ等のディスプレイによって構成される。 Display 140 is configured to display images. Display 140 displays a moving image represented by moving image data generated by camera 130, for example. The display 140 is configured by a display such as a liquid crystal display or an organic EL display, for example.

マイク１５０は、マイク１５０の周囲の音に基づいて音声データを生成するように構成されている。マイク１５０は、たとえば、被訓練者１０が発した声に基づいて音声データを生成する。 Microphone 150 is configured to generate audio data based on sounds surrounding microphone 150. The microphone 150 generates audio data based on the voice uttered by the trainee 10, for example.

スピーカ１６０は、音声データが示す音声を出力するように構成されている。スピーカ１６０は、たとえば、被訓練者１０の声に基づいて生成された音声データが示す音を出力する。 The speaker 160 is configured to output the sound indicated by the sound data. The speaker 160 outputs, for example, a sound indicated by audio data generated based on the voice of the trainee 10.

制御部１７０は、ＣＰＵ（Central Processing Unit）１７２、ＲＡＭ（Random Access Memory）１７４及びＲＯＭ（Read Only Memory）１７６等を含み、情報処理に応じて各構成要素の制御を行なうように構成されている。 The control unit 170 includes a CPU (Central Processing Unit) 172, a RAM (Random Access Memory) 174, a ROM (Read Only Memory) 176, etc., and is configured to control each component according to information processing. .

記憶部１８０は、たとえば、フラッシュメモリ等のメモリである。記憶部１８０は、たとえば、制御プログラム１８１を記憶するように構成されている。制御プログラム１８１は、制御部１７０によって実行されるスマートフォン１００の制御プログラムである。制御部１７０が制御プログラム１８１を実行する場合に、制御プログラム１８１は、ＲＡＭ１７４に展開される。そして、制御部１７０は、ＲＡＭ１７４に展開された制御プログラム１８１をＣＰＵ１７２によって解釈及び実行することにより、各構成要素を制御する。 The storage unit 180 is, for example, a memory such as a flash memory. The storage unit 180 is configured to store a control program 181, for example. The control program 181 is a control program for the smartphone 100 that is executed by the control unit 170. When the control unit 170 executes the control program 181, the control program 181 is loaded into the RAM 174. The control unit 170 controls each component by causing the CPU 172 to interpret and execute the control program 181 loaded in the RAM 174.

通信モジュール１９０は、外部機器と通信するように構成されている。通信モジュール１９０は、たとえば、ＬＴＥ（Long Term Evolution）モジュール、無線ＬＡＮモジュール等で構成される。 Communication module 190 is configured to communicate with external devices. The communication module 190 includes, for example, an LTE (Long Term Evolution) module, a wireless LAN module, and the like.

［３．ソフトウェア構成］
図３は、制御部１７０によって実現される各ソフトウェアモジュールの関係の一例を示す図である。図３に示されるように、顔領域抽出部１３１、画素移動量算出部１３２、顔移動量補正部１３３、筋活動量推定部１３４、第１判定部１３５、音声特徴抽出部１５１及び第２判定部１５２の各々は、ソフトウェアモジュールであり、制御部１７０が制御プログラム１８１を実行することによって実現されている。 [3. Software configuration]
FIG. 3 is a diagram illustrating an example of the relationship between each software module realized by the control unit 170. As shown in FIG. 3, a face area extraction section 131, a pixel movement amount calculation section 132, a face movement amount correction section 133, a muscle activity amount estimation section 134, a first judgment section 135, an audio feature extraction section 151, and a second judgment section. Each of the units 152 is a software module, and is realized by the control unit 170 executing the control program 181.

顔領域抽出部１３１は、カメラ１３０によって生成された動画像データに基づいて、被訓練者１０の顔に対応する領域を抽出するように構成されている。顔領域の抽出方法としては、公知の種々の方法が用いられる。 The face region extraction unit 131 is configured to extract a region corresponding to the face of the trainee 10 based on the moving image data generated by the camera 130. Various known methods can be used to extract the face area.

画素移動量算出部１３２は、カメラ１３０によって生成された動画像データに基づいて、各領域のオプティカルフローを算出するように構成されている。オプティカルフローの算出方法としては、公知の種々の方法が用いられる。ここで、各領域は、画像に含まれる各画素によって構成されてもよいし、画像に含まれる複数画素によって構成されてもよい。また、画素移動量算出部１３２は、領域毎に、算出されたオプティカルフローの大きさを示す画像を生成し、生成された画像をディスプレイ１４０に出力する。 The pixel movement amount calculation unit 132 is configured to calculate the optical flow of each area based on the moving image data generated by the camera 130. Various known methods are used to calculate the optical flow. Here, each region may be composed of each pixel included in the image, or may be composed of multiple pixels included in the image. Further, the pixel movement amount calculation unit 132 generates an image indicating the calculated optical flow size for each area, and outputs the generated image to the display 140.

顔移動量補正部１３３は、顔領域抽出部１３１によって抽出された顔領域の移動量及び移動方向を算出し、画素移動量算出部１３２によって算出されたオプティカルフローから減算するように構成されている。これにより、顔の移動量を差し引いた、顔面の筋肉の動きを示すオプティカルフローを算出することができる。 The face movement amount correction unit 133 is configured to calculate the movement amount and movement direction of the face area extracted by the face area extraction unit 131, and subtract it from the optical flow calculated by the pixel movement amount calculation unit 132. . As a result, it is possible to calculate an optical flow indicating the movement of facial muscles by subtracting the amount of movement of the face.

筋活動量推定部１３４は、各領域のオプティカルフローの大きさの和を算出することによって、被訓練者１０の顔面の筋肉の動き量を推定するように構成されている。すなわち、筋活動量推定部１３４は、被訓練者１０の口の動き量を推定するように構成されている。推定された顔面の筋肉の動き量（各領域のオプティカルフローの大きさの和）は、ディスプレイ１４０に出力される。 The muscle activity amount estimating unit 134 is configured to estimate the amount of facial muscle movement of the trainee 10 by calculating the sum of the magnitudes of optical flows in each region. That is, the muscle activity amount estimating unit 134 is configured to estimate the amount of mouth movement of the trainee 10. The estimated facial muscle movement amount (the sum of the optical flow magnitudes of each region) is output to the display 140.

第１判定部１３５は、筋活動量推定部１３４によって推定された顔面の筋肉の動き量が第１所定量より小さい状態が所定時間継続したか否かを判定するように構成されている。第１所定量は、顔面の筋肉の動き量がこれよりも小さい場合に期待される発話訓練効果が得られない値である。顔面の筋肉の動き量が第１所定量よりも小さい状態が所定時間継続した場合に、第１警告画像がディスプレイ１４０に出力される。 The first determining unit 135 is configured to determine whether the amount of facial muscle movement estimated by the muscle activity amount estimating unit 134 remains smaller than a first predetermined amount for a predetermined period of time. The first predetermined amount is a value at which the expected speech training effect cannot be obtained if the amount of facial muscle movement is smaller than this. If the amount of movement of the facial muscles continues to be smaller than the first predetermined amount for a predetermined period of time, a first warning image is output to the display 140.

音声特徴抽出部１５１は、マイク１５０によって生成された音声データに基づいて、被訓練者１０が発した声の特徴量を抽出するように構成されている。音声特徴抽出部１５１は、たとえば、被訓練者１０が発した声の大きさを抽出する。また、音声特徴抽出部１５１は、抽出された声の大きさを示す画像を生成し、生成された画像をディスプレイ１４０に出力する。 The voice feature extraction unit 151 is configured to extract the feature amount of the voice uttered by the trainee 10 based on the voice data generated by the microphone 150. The voice feature extraction unit 151 extracts, for example, the volume of the voice uttered by the trainee 10. The voice feature extraction unit 151 also generates an image indicating the extracted voice volume, and outputs the generated image to the display 140.

第２判定部１５２は、音声特徴抽出部１５１によって抽出された声の特徴量が第２所定量より小さい状態が所定時間継続したか否かを判定するように構成されている。第２所定量は、声の特徴量がこれよりも小さい場合に期待される発話訓練効果が得られない値である。声の特徴量が第２所定量よりも小さい状態が所定時間継続した場合に、第２警告画像がディスプレイ１４０に出力される。 The second determination unit 152 is configured to determine whether a state in which the voice feature amount extracted by the voice feature extraction unit 151 is smaller than a second predetermined amount continues for a predetermined period of time. The second predetermined amount is a value at which the expected speech training effect cannot be obtained when the voice feature amount is smaller than this. A second warning image is output to the display 140 when the voice feature amount remains smaller than the second predetermined amount for a predetermined period of time.

［４．動作］
本実施の形態に従うスマートフォン１００においては、制御部１７０によって、動画表示処理、オプティカルフロー表示処理、筋活動量表示処理、音声特徴量表示処理及び警告メッセージ表示処理が並列的に実行されている。以下、各処理について順に説明する。 [4. motion]
In smartphone 100 according to the present embodiment, control unit 170 executes video display processing, optical flow display processing, muscle activity amount display processing, audio feature amount display processing, and warning message display processing in parallel. Each process will be explained in turn below.

（４－１．動画表示処理）
図４は、動画表示処理の実行手順を示すフローチャートである。このフローチャートに示される処理は、予め定められた周期で実行される。 (4-1. Video display processing)
FIG. 4 is a flowchart showing the procedure for executing moving image display processing. The processing shown in this flowchart is executed at predetermined intervals.

図４を参照して、制御部１７０は、被訓練者１０の顔を含む動画像を撮像し動画像データを生成するとともに、被訓練者１０の声を含む音声データを生成するようにカメラ１３０及びマイク１５０をそれぞれ制御する（ステップＳ１００）。制御部１７０は、生成された動画像データに基づいて、動画に含まれる顔領域を抽出する（ステップＳ１１０）。制御部１７０は、被訓練者１０に読ませる文章、抽出された顔領域を囲む枠、及び、動画像データが示す動画を重畳して表示するようにディスプレイ１４０を制御する（ステップＳ１２０）。被訓練者１０に読ませる文章を示すテキストデータは、たとえば、記憶部１８０（図２）に予め記憶されている。 Referring to FIG. 4, the control unit 170 controls the camera 130 to capture a moving image including the face of the trainee 10 and generate moving image data, and also to generate audio data including the voice of the trainee 10. and the microphone 150 (step S100). The control unit 170 extracts a face area included in the video based on the generated video data (step S110). The control unit 170 controls the display 140 to display the text to be read by the trainee 10, a frame surrounding the extracted face area, and a moving image indicated by the moving image data in a superimposed manner (step S120). Text data indicating a sentence to be read by the trainee 10 is stored in advance in the storage unit 180 (FIG. 2), for example.

図５は、ディスプレイ１４０に表示される画像の一例を示す図である。図５に示されるように、ディスプレイ１４０には、被訓練者１０を含む動画、被訓練者１０の顔領域を囲む顔枠２００、及び、被訓練者１０に読ませる文章２１０が表示されている。スマートフォン１００によれば、被訓練者１０に音読させる文章２１０がディスプレイ１４０に表示されるため、被訓練者１０は、表示される文章を音読するだけで発話訓練を行なうことができる。 FIG. 5 is a diagram showing an example of an image displayed on the display 140. As shown in FIG. 5, the display 140 displays a video including the trainee 10, a face frame 200 surrounding the face area of the trainee 10, and a sentence 210 to be read by the trainee 10. . According to the smartphone 100, the sentence 210 to be read aloud by the trainee 10 is displayed on the display 140, so the trainee 10 can practice speaking by simply reading the displayed sentence aloud.

（４－２．オプティカルフロー表示処理）
図６は、オプティカルフロー表示処理の実行手順を示すフローチャートである。このフローチャートに示される処理は、予め定められた周期で実行される。 (4-2. Optical flow display processing)
FIG. 6 is a flowchart showing the execution procedure of optical flow display processing. The processing shown in this flowchart is executed at predetermined intervals.

図６を参照して、制御部１７０は、動画表示処理において生成された動画像データに基づいて、各領域のオプティカルフローを算出する（ステップＳ２００）。制御部１７０は、領域毎に、オプティカルフローの大きさ及び方向を示す画像を生成する（ステップＳ２１０）。制御部１７０は、生成された画像を動画に重畳表示するようにディスプレイ１４０を制御する（ステップＳ２２０）。 Referring to FIG. 6, control unit 170 calculates the optical flow of each area based on the moving image data generated in the moving image display process (step S200). The control unit 170 generates an image indicating the size and direction of optical flow for each region (step S210). The control unit 170 controls the display 140 to display the generated image superimposed on the video (step S220).

再び図５を参照して、ディスプレイ１４０においては、被訓練者１０の口が移動した軌跡を示す線２４０（オプティカルフロー）の画像が動画に重畳表示される。スマートフォン１００によれば、被訓練者１０の口の動きの量を示す画像が動画像に重畳して表示されるため、口の動きが不十分か否かを被訓練者１０に視覚的に認識させることができる。その結果、スマートフォン１００によれば、被訓練者１０が口の周りの筋肉を大きく動かすことを意識して発話訓練を行なうことができるため、より効果的に被訓練者１０の発話訓練を行なうことができる。 Referring again to FIG. 5, on the display 140, an image of a line 240 (optical flow) indicating the locus of movement of the mouth of the trainee 10 is displayed superimposed on a moving image. According to the smartphone 100, an image indicating the amount of mouth movement of the trainee 10 is displayed superimposed on a moving image, so that the trainee 10 can visually recognize whether or not the mouth movement is insufficient. can be done. As a result, according to the smartphone 100, the trainee 10 can perform speech training while being conscious of making large movements of the muscles around the mouth, so that the trainee 10 can perform speech training more effectively. I can do it.

（４－３．筋活動量表示処理）
図７は、筋活動量表示処理の実行手順を示すフローチャートである。このフローチャートに示される処理は、予め定められた周期で実行される。 (4-3. Muscle activity amount display processing)
FIG. 7 is a flowchart showing the execution procedure of muscle activity amount display processing. The processing shown in this flowchart is executed at predetermined intervals.

図７を参照して、制御部１７０は、動画表示処理において生成された動画像データに基づいて、被訓練者１０の顔領域を抽出するとともに、顔領域の動き（大きさ及び方向）を抽出する（ステップＳ３００）。制御部１７０は、オプティカルフロー表示処理において算出されたオプティカルフローから、ステップＳ３００において抽出された顔領域の動きを減算することによって、オプティカルフローの補正を行なう（ステップＳ３１０）。制御部１７０は、各領域の補正後のオプティカルフローの大きさの和を算出することによって、被訓練者１０の顔面の筋肉の動き量（以下、「筋活動量」とも称する。）を推定する（ステップＳ３２０）。制御部１７０は、推定された顔面の筋肉の動き量（各領域のオプティカルフローの大きさの和）を示す画像を生成し、該画像を表示するようにディスプレイ１４０を制御する（ステップＳ３３０）。 Referring to FIG. 7, control unit 170 extracts the facial area of trainee 10 and the movement (size and direction) of the facial area based on the video data generated in the video display process. (Step S300). The control unit 170 corrects the optical flow by subtracting the movement of the face area extracted in step S300 from the optical flow calculated in the optical flow display process (step S310). The control unit 170 estimates the amount of facial muscle movement (hereinafter also referred to as "muscle activity amount") of the trainee 10 by calculating the sum of the corrected optical flow magnitudes of each region. (Step S320). The control unit 170 generates an image showing the estimated facial muscle movement amount (the sum of the optical flow magnitudes of each area), and controls the display 140 to display the image (step S330).

再び図５を参照して、ディスプレイ１４０においては、レベルメータ２２０のような顔面の筋肉の動き量を示す画像が動画に重畳表示される。スマートフォン１００によれば、被訓練者１０の発話に関する評価結果（たとえば、口を含む顔面の筋肉の動き量）がディスプレイ１４０に表示されるため、被訓練者１０は、評価結果を確認しながら発話訓練を行なうことができる。 Referring again to FIG. 5, on display 140, an image showing the amount of movement of facial muscles, such as level meter 220, is displayed superimposed on a moving image. According to the smartphone 100, the evaluation results regarding the utterances of the trainee 10 (for example, the amount of movement of facial muscles including the mouth) are displayed on the display 140, so the trainee 10 can speak while checking the evaluation results. Training can be carried out.

（４－４．音声特徴量表示処理）
図８は、音声特徴量表示処理の実行手順を示すフローチャートである。このフローチャートに示される処理は、予め定められた周期で実行される。 (4-4. Audio feature amount display processing)
FIG. 8 is a flowchart showing the execution procedure of the audio feature value display process. The processing shown in this flowchart is executed at predetermined intervals.

図８を参照して、制御部１７０は、動画表示処理において生成された音声データに基づいて、被訓練者１０の声の特徴量（たとえば、大きさ）を抽出する（ステップＳ４００）。制御部１７０は、抽出された声の特徴量を示す画像を生成し、該画像を表示するようにディスプレイ１４０を制御する（ステップＳ４１０）。 Referring to FIG. 8, control unit 170 extracts the feature amount (for example, loudness) of the voice of trainee 10 based on the audio data generated in the video display process (step S400). The control unit 170 generates an image showing the extracted voice feature amount, and controls the display 140 to display the image (step S410).

再び図５を参照して、ディスプレイ１４０においては、レベルメータ２３０のような声の特徴量を示す画像が動画に重畳表示される。スマートフォン１００によれば、被訓練者１０の発話に関する評価結果（たとえば、声の大きさ）がディスプレイ１４０に表示されるため、被訓練者１０は、評価結果を確認しながら発話訓練を行なうことができる。 Referring again to FIG. 5, on the display 140, an image indicating the feature amount of the voice, such as the level meter 230, is displayed superimposed on the moving image. According to the smartphone 100, the evaluation results (for example, the volume of voice) regarding the utterances of the trainee 10 are displayed on the display 140, so the trainee 10 can perform speech training while checking the evaluation results. can.

（４－５．警告メッセージ表示処理）
図９は、警告メッセージ表示処理の実行手順を示すフローチャートである。このフローチャートに示される処理は、予め定められた周期で実行される。 (4-5. Warning message display processing)
FIG. 9 is a flowchart showing the execution procedure of the warning message display process. The processing shown in this flowchart is executed at predetermined intervals.

図９を参照して、制御部１７０は、筋活動量表示処理において推定された筋活動量が第１所定量よりも小さい状態が所定時間継続したか否かを判定する（ステップＳ５００）。筋活動量が第１所定量以上であると判定されると（ステップＳ５００においてＮＯ）、処理はステップＳ５１０に移行する。一方、筋活動量が第１所定量よりも小さい状態が所定時間継続したと判定されると（ステップＳ５００においてＹＥＳ）、制御部１７０は、第１警告画像を表示するようにディスプレイ１４０を制御する（ステップＳ５１０）。 Referring to FIG. 9, control unit 170 determines whether the state in which the amount of muscle activity estimated in the muscle activity amount display process is smaller than the first predetermined amount continues for a predetermined period of time (step S500). If it is determined that the amount of muscle activity is equal to or greater than the first predetermined amount (NO in step S500), the process moves to step S510. On the other hand, if it is determined that the state in which the amount of muscle activity is smaller than the first predetermined amount continues for a predetermined time (YES in step S500), the control unit 170 controls the display 140 to display the first warning image. (Step S510).

図１０は、ディスプレイ１４０に表示される画像の一例を示す図である。図１０に示されるように、筋活動量が第１所定量よりも小さい状態が所定時間継続した場合には、第１警告画像２５０（「もっと口を動かして！」）がディスプレイ１４０に表示される。スマートフォン１００によれば、被訓練者１０の発話が所定要件を満たさない場合に第１警告画像２５０が表示されるため、被訓練者１０は、自らの発話が所定要件を満たしていないことを視覚的に認識することができる。 FIG. 10 is a diagram showing an example of an image displayed on the display 140. As shown in FIG. 10, if the muscle activity remains smaller than the first predetermined amount for a predetermined period of time, a first warning image 250 (“Move your mouth more!”) is displayed on the display 140. Ru. According to the smartphone 100, the first warning image 250 is displayed when the utterance of the trainee 10 does not meet the predetermined requirements, so the trainee 10 can visually see that his or her utterance does not meet the predetermined requirements. can be recognized visually.

再び図９を参照して、次に、制御部１７０は、音声特徴量表示処理において抽出された音声特徴量が第２所定量よりも小さい状態が所定時間継続したか否かを判定する（ステップＳ５２０）。音声特徴量が第２所定量以上であると判定されると（ステップＳ５２０においてＮＯ）、処理はステップＳ５００に移行する。一方、音声特徴量が第２所定量よりも小さい状態が所定時間継続したと判定されると（ステップＳ５２０においてＹＥＳ）、制御部１７０は、第２警告画像を表示するようにディスプレイ１４０を制御する（ステップＳ５３０）。 Referring again to FIG. 9, next, the control unit 170 determines whether the state in which the audio feature amount extracted in the audio feature amount display process is smaller than the second predetermined amount continues for a predetermined period of time (step S520). If it is determined that the audio feature amount is equal to or greater than the second predetermined amount (NO in step S520), the process moves to step S500. On the other hand, if it is determined that the state in which the audio feature amount is smaller than the second predetermined amount continues for a predetermined period of time (YES in step S520), the control unit 170 controls the display 140 to display the second warning image. (Step S530).

再び図１０を参照して、音声特徴量が第２所定量よりも小さい状態が所定時間継続した場合には、第２警告画像２６０（「もっと大きな声で！」）がディスプレイ１４０に表示される。スマートフォン１００によれば、被訓練者１０の発話が所定要件を満たさない場合に第２警告画像２６０が表示されるため、被訓練者１０は、自らの発話が所定要件を満たしていないことを視覚的に認識することができる。 Referring again to FIG. 10, if the voice feature amount continues to be smaller than the second predetermined amount for a predetermined period of time, a second warning image 260 (“Speak louder!”) is displayed on display 140. . According to the smartphone 100, the second warning image 260 is displayed when the utterance of the trainee 10 does not meet the predetermined requirements, so the trainee 10 can visually see that his or her utterance does not meet the predetermined requirements. can be recognized visually.

［５．特徴］
以上のように、本実施の形態に従うスマートフォン１００において、ディスプレイ１４０は、被訓練者１０の口の動きの量を示す画像を動画像に重畳して表示する。スマートフォン１００によれば、被訓練者１０の口の動きの量を示す画像が動画像に重畳して表示されるため、口の動きが不十分か否かを被訓練者１０に視覚的に認識させることができる。その結果、スマートフォン１００によれば、被訓練者１０が口の周りの筋肉を大きく動かすことを意識して発話訓練を行なうことができるため、より効果的に被訓練者１０の発話訓練を行なうことができる。 [5. Features]
As described above, in the smartphone 100 according to the present embodiment, the display 140 displays an image indicating the amount of mouth movement of the trainee 10 superimposed on a moving image. According to the smartphone 100, an image indicating the amount of mouth movement of the trainee 10 is displayed superimposed on a moving image, so that the trainee 10 can visually recognize whether or not the mouth movement is insufficient. can be done. As a result, according to the smartphone 100, the trainee 10 can perform speech training while being conscious of making large movements of the muscles around the mouth, so that the trainee 10 can perform speech training more effectively. I can do it.

なお、スマートフォン１００は、「発話訓練システム」の一例であり、カメラ１３０は、「撮像手段」の一例であり、ディスプレイ１４０は、「表示手段」の一例である。また、画素移動量算出部１３２は、「算出手段」及び「生成手段」の一例である。 Note that the smartphone 100 is an example of a "speech training system," the camera 130 is an example of an "imaging means," and the display 140 is an example of a "display means." Further, the pixel movement amount calculating unit 132 is an example of a "calculating means" and a "generating means".

［６．実験］
本発明者は、以下の実験を行なった。本実験は、防音室で行なわれた。実験に先立ち、実験参加者に実験の説明を行なった。次に、実験参加者に発話訓練の意義を説明し、意欲を持って実験に参加するよう依頼した。声量や話速は、高校の教室で朗読することをイメージするよう指示した。音声収録及び発話訓練は立位にて行った。訓練時は、ＰＣ（Personal Computer）のディスプレイ（EIZO EV2450）を実験参加者の顔の正面にくるよう配置し、正面を向いた状態で練習できるようにした。なお、本実験においては、上記実施の形態に従うスマートフォン１００において実装されたアプリケーションがＰＣにインストールされている。 [6. experiment]
The inventor conducted the following experiment. This experiment was conducted in a soundproof room. Prior to the experiment, we explained the experiment to the participants. Next, we explained to the experiment participants the significance of speech training and asked them to be willing to participate in the experiment. Regarding the volume and speed of speech, participants were asked to imagine reciting aloud in a high school classroom. Audio recording and speaking training were performed in a standing position. During training, a PC (Personal Computer) display (EIZO EV2450) was placed in front of the experiment participants' faces so that they could practice facing forward. Note that in this experiment, the application implemented in the smartphone 100 according to the above embodiment is installed on the PC.

実験では、まず実験参加者の訓練前の音声を録音し、ＶＡＳ（Visual analog scale）にてその発話がどの程度うまくできたかを自己評価させた。続いて，前歯で割り箸を噛んだ状態で，上記ＰＣ（発話訓練システム）を用いて３分間練習した。その後、訓練後の音声を収録し、再びＶＡＳを計測した。音声はコンデンサマイクロフォン(SonyECM-77B)とレコーダ(Marantz PMD671)とを用いて標本化周波数16 kHz、量子化16 bitにて収録した。 In the experiment, participants first recorded their pre-training speech and asked them to self-evaluate how well they were able to produce the utterances using a visual analog scale (VAS). Next, the subjects practiced for 3 minutes using the PC (speech training system) while biting disposable chopsticks with their front teeth. Thereafter, the post-training audio was recorded and the VAS was measured again. Audio was recorded using a condenser microphone (SonyECM-77B) and a recorder (Marantz PMD671) at a sampling frequency of 16 kHz and quantization of 16 bits.

図１１は、訓練前後に録音した音声の振幅を示す図である。図１２は、訓練前後に録音した音声の基本周波数の変化幅を示す図である。これらの結果は、各実験参加者の１４文の平均値の分布を示している。図１１及び図１２に示されるように、上記発話訓練システムを用いた訓練によって振幅と基本周波数の変化幅がともに上昇する傾向にあることがわかる。訓練前後の中央値の比較では，振幅が４．３ｄＢ、基本周波数の変化幅が１．１９semitone上昇した。図は示さないが、基本周波数の平均値も訓練後に上昇する傾向にあった。 FIG. 11 is a diagram showing the amplitude of audio recorded before and after training. FIG. 12 is a diagram showing the range of change in the fundamental frequency of voices recorded before and after training. These results show the distribution of the average values of the 14 sentences for each experimental participant. As shown in FIGS. 11 and 12, it can be seen that both the amplitude and the fundamental frequency change width tend to increase as a result of training using the speech training system. Comparing the median values before and after training, the amplitude increased by 4.3 dB, and the range of change in fundamental frequency increased by 1.19 semitones. Although not shown, the average fundamental frequency also tended to increase after training.

図１３は、訓練前後に計測したＶＡＳを示す図である。図１３に示されるように、ほぼ全ての実験参加者が、訓練によってうまく読めるようになったという自覚を持った。実験後には「ハキハキ言えるようになった」、「サ行，タ行が良くなった」、「(口の) 横の筋肉が動きやすくなった」などの肯定的なコメントが多く聞かれた。顔面の動きをフィードバックすることにより、わずか３分間の練習でも口の動きを改善する効果があったと考えられる。 FIG. 13 is a diagram showing VAS measured before and after training. As shown in Figure 13, almost all participants in the experiment felt that they had become able to read well through training. After the experiment, we received many positive comments such as ``I can now speak clearly,'' ``My sa and ta lines have improved,'' and ``The muscles on the sides of my mouth can move more easily.'' By providing feedback on facial movements, it is thought that even just three minutes of practice was effective in improving mouth movements.

［７．変形例］
以上、実施の形態について説明したが、本発明は、上記実施の形態に限定されるものではなく、その趣旨を逸脱しない限りにおいて、種々の変更が可能である。以下、変形例について説明する。 [7. Modified example]
Although the embodiments have been described above, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit thereof. Modifications will be described below.

（７－１）
上記実施の形態においては、筋活動量を示す画像（口の動きの量を示す画像）として線２４０が用いられた。しかしながら、筋活動量を示す画像は、線２４０に限定されない。筋活動量を示す画像は、たとえば、動きの方向及び大きさを示す矢印であってもよい。また、筋活動量を示すために、たとえば、動き量が多い部分と動き量が小さい部分とで色を異ならせてもよい。たとえば、動き量が大きい領域は赤色で表現し、動き量が小さい領域は青色で表現してもよい。 (7-1)
In the embodiment described above, the line 240 is used as an image indicating the amount of muscle activity (an image indicating the amount of mouth movement). However, the image showing the amount of muscle activity is not limited to line 240. The image indicating the amount of muscle activity may be, for example, an arrow indicating the direction and magnitude of movement. Furthermore, in order to indicate the amount of muscle activity, for example, parts with a large amount of movement and parts with a small amount of movement may be colored differently. For example, an area with a large amount of movement may be expressed in red, and an area with a small amount of movement may be expressed in blue.

（７－２）
上記実施の形態においては、カメラ１３０によって撮像された動画像全体のオプティカルフローが算出された。しかしながら、オプティカルフローが算出される範囲はこれに限定されない。たとえば、被訓練者１０の顔領域のみのオプティカルフローが算出されてもよいし、被訓練者１０の顔の下半分の領域のみのオプティカルフローが算出されてもよいし、被訓練者１０の口領域のみのオプティカルフローが算出されてもよい。オプティカルフローを算出する領域を絞ることで、制御部１７０による計算量を減らすことができる。 (7-2)
In the embodiment described above, the optical flow of the entire moving image captured by the camera 130 is calculated. However, the range in which optical flow is calculated is not limited to this. For example, the optical flow of only the face region of the trainee 10 may be calculated, the optical flow of only the lower half region of the trainee 10's face may be calculated, or the optical flow of only the lower half region of the trainee 10's face may be calculated. The optical flow of only the region may be calculated. By narrowing down the area in which the optical flow is calculated, the amount of calculation by the control unit 170 can be reduced.

（７－３）
上記実施の形態においては、被訓練者１０の画像及び音声を用いて発話訓練が行われた。しかしながら、被訓練者１０の音声は、必ずしも発話訓練に用いられる必要はない。 (7-3)
In the embodiment described above, speech training was performed using images and voices of the trainee 10. However, the voice of the trainee 10 does not necessarily need to be used for speech training.

（７－４）
上記実施の形態においては、被訓練者１０の口の動き量を得るためにオプティカルフローが算出された。しかしながら、必ずしもオプティカルフローが算出されなくてもよい。たとえば、単に動画像におけるフレーム間の差分を算出することによって、被訓練者１０の口の動き量が得られてもよい。 (7-4)
In the embodiment described above, the optical flow is calculated to obtain the amount of mouth movement of the trainee 10. However, optical flow does not necessarily need to be calculated. For example, the amount of movement of the trainee's 10 mouth may be obtained by simply calculating the difference between frames in a moving image.

（７－５）
上記実施の形態においては、スマートフォンにおいて発話訓練システムが実現されたが、本発明に従う発話訓練システムは、たとえば、ＰＣ、タブレット等によって実現されてもよい。 (7-5)
In the embodiments described above, the speech training system is implemented on a smartphone, but the speech training system according to the present invention may also be implemented on, for example, a PC, a tablet, or the like.

（７－６）
上記実施の形態において、発話訓練中に、ディスプレイ１４０に講師の手本動画があわせて表示されてもよい。 (7-6)
In the embodiment described above, an instructor's model video may also be displayed on the display 140 during speech training.

（７－７）
上記実施の形態においては、被訓練者１０の顔の領域毎のオプティカルフローが算出されている。したがって、たとえば、被訓練者１０の顔の何れの領域の動きが不足しているかを算出することも可能である。たとえば、被訓練者１０の顔の何れの領域の動きが不足しているかを示す警告画像がディスプレイ１４０に表示されてもよい。 (7-7)
In the embodiment described above, the optical flow is calculated for each region of the trainee's 10 face. Therefore, for example, it is also possible to calculate which region of the trainee's 10 face is lacking in movement. For example, a warning image indicating which region of the trainee's 10 face is lacking in movement may be displayed on the display 140.

（７－８）
上記実施の形態において、たとえば、被訓練者１０の発話訓練の履歴が順次記憶部１８０に記憶されてもよい。これにより、たとえば、被訓練者１０が新たに発話訓練を行なった場合に、前回と比較してどの部分が改善されたか、どの部分が悪くなったか等を被訓練者１０に知らせることができる。 (7-8)
In the above embodiment, for example, the history of speech training of the trainee 10 may be sequentially stored in the storage unit 180. Thereby, for example, when the trainee 10 performs a new speech training, it is possible to inform the trainee 10 which parts have improved or which parts have worsened compared to the previous time.

１０被訓練者、２０棒、１００スマートフォン、１３０カメラ、１３１顔領域抽出部、１３２画素移動量算出部、１３３顔移動量補正部、１３４筋活動量推定部、１３５第１判定部、１４０ディスプレイ、１５０マイク、１５１音声特徴抽出部、１５２第２判定部、１６０スピーカ、１７０制御部、１７２ＣＰＵ、１７４ＲＡＭ、１７６ＲＯＭ、１８０記憶部、１８１制御プログラム、１９０通信モジュール、２００顔枠、２１０文章、２２０，２３０レベルメータ、２４０線、２５０第１警告画像、２６０第２警告画像。 Reference Signs List 10 trainee, 20 bar, 100 smartphone, 130 camera, 131 face area extraction unit, 132 pixel movement amount calculation unit, 133 face movement amount correction unit, 134 muscle activity amount estimation unit, 135 first determination unit, 140 display, 150 microphone, 151 audio feature extraction unit, 152 second determination unit, 160 speaker, 170 control unit, 172 CPU, 174 RAM, 176 ROM, 180 storage unit, 181 control program, 190 communication module, 200 face frame, 210 text, 220, 230 level meter, 240 line, 250 first warning image, 260 second warning image.

Claims

A speech training system used for speech training of a trainee,
an imaging means for imaging the trainee's face and generating moving image data;
Display means for displaying a moving image indicated by the moving image data;
a muscle activity amount estimation unit that estimates a muscle activity amount, which is the amount of movement of the facial muscles of the trainee, based on the video image data;
a muscle activity amount determination unit that determines whether a state in which the muscle activity amount estimated by the muscle activity amount estimation unit is smaller than a first predetermined amount continues for a predetermined time;
a voice feature extraction unit that extracts a feature amount of the voice uttered by the trainee based on voice data generated from the voice uttered by the trainee by a microphone;
a voice feature determining unit that determines whether a state in which the voice feature extracted by the voice feature extracting unit is smaller than a second predetermined amount continues for the predetermined time;
Equipped with
The display means displays an image indicating the amount of mouth movement of the trainee superimposed on the moving image,
The display means is
When the muscle activity amount determining unit determines that the muscle activity amount is smaller than the first predetermined amount for the predetermined period of time, further displaying a first warning message;
further displaying a second warning message when the voice feature amount determining unit determines that the voice feature amount is smaller than the second predetermined amount for the predetermined period of time;
Speech training system.

The speech training system according to claim 1, wherein the display means further displays sentences to be read aloud by the trainee.

3. The speech training system according to claim 1, wherein the display means further displays evaluation results regarding the trainee's speech.

The speech training system according to any one of claims 1 to 3, wherein the image showing the amount of movement of the mouth is a line showing a trajectory of movement of the mouth.

Calculation means for calculating optical flow based on the moving image data;
The speech training system according to any one of claims 1 to 4, further comprising: generating means for generating an image indicating the amount of movement of the mouth based on the optical flow.

A speech training method for training a trainee regarding speech, the method comprising:
capturing an image of the trainee's face and generating moving image data;
Displaying a moving image indicated by the moving image data;
estimating the amount of muscle activity, which is the amount of movement of the facial muscles of the trainee, based on the video image data;
a first determination step of determining whether a state in which the amount of muscle activity estimated in the step of estimating the amount of muscle activity is smaller than a first predetermined amount continues for a predetermined time;
superimposing and displaying an image indicating the amount of mouth movement of the trainee on the moving image;
extracting a feature amount of the voice uttered by the trainee based on voice data generated from the voice uttered by the trainee by a microphone;
a second determination step of determining whether a state in which the extracted feature amount of the voice is smaller than a second predetermined amount continues for the predetermined time;
If it is determined in the first determination step that the state in which the amount of muscle activity is smaller than the first predetermined amount has continued for the predetermined time, further displaying a first warning message; further displaying a second warning message when it is determined that the voice feature amount is smaller than the second predetermined amount for the predetermined period of time;
speech training methods, including

A program used for speech training of a trainee,
causing an imaging means to image the trainee's face and generate moving image data;
displaying a moving image indicated by the moving image data on a display means;
estimating the amount of muscle activity, which is the amount of movement of the facial muscles of the trainee, based on the video image data;
a first determination step of determining whether a state in which the amount of muscle activity estimated in the step of estimating the amount of muscle activity is smaller than a first predetermined amount continues for a predetermined time;
displaying on the display means an image indicating the amount of mouth movement of the trainee, superimposed on the moving image;
extracting a feature amount of the voice uttered by the trainee based on voice data generated from the voice uttered by the trainee by a microphone;
a second determination step of determining whether a state in which the extracted feature amount of the voice is smaller than a second predetermined amount continues for the predetermined time;
If it is determined in the first determination step that the state in which the amount of muscle activity is smaller than the first predetermined amount has continued for the predetermined time, further displaying a first warning message; further displaying a second warning message when it is determined that the voice feature amount is smaller than the second predetermined amount for the predetermined period of time;
A program that causes a computer to execute.