JP2016157097A

JP2016157097A - Reading-aloud evaluation device, reading-aloud evaluation method, and program

Info

Publication number: JP2016157097A
Application number: JP2015132623A
Authority: JP
Inventors: 林　宏一; Koichi Hayashi; 宏一林
Original assignee: Brother Industries Ltd
Current assignee: Brother Industries Ltd
Priority date: 2015-02-24
Filing date: 2015-07-01
Publication date: 2016-09-01

Abstract

PROBLEM TO BE SOLVED: To provide a reading-aloud evaluation device, a reading-aloud evaluation method, and a program capable of presenting overall evaluation including a reading-aloud speed and interval to a speaker.SOLUTION: A reading-aloud evaluation device compares the time length of a speaker interval section specified based on the speaker voice waveform data with the time length of a model interval section specified based on the model voice waveform data to evaluate a reading-aloud speed for each phrase, and compares the time length of a speaker phrase section specified based on speaker voice waveform data with the time length of a model phrase section specified based on model voice waveform data to evaluate an interval. The reading-aloud evaluation device carries out overall evaluation for sentence reading-aloud based on at least the evaluation of the reading-aloud speed and the evaluation of the interval to display at least one of the evaluation of the reading-aloud speed, the evaluation of the interval and the overall evaluation for sentence reading-aloud.SELECTED DRAWING: Figure 5

Description

本発明は、話者が文を音読したときに発した音声に基づいて、文の音読に対する評価を行うシステム等の技術分野に関する。 The present invention relates to a technical field such as a system that evaluates reading of a sentence based on speech uttered when a speaker reads out a sentence.

近年、語学学習等の支援を目的とした様々なシステムが提案されている。これらのシステムでは、主として単語や文章単位でのピッチ（抑揚）や滑舌に主眼を置いて学習の支援が行われる。例えば、特許文献１に開示された技術では、単語毎に模範音声のピッチと、学習者の音声のピッチとを一緒に表示することで、学習者の音声と模範音声と異なる点を容易に知ることができる技術が開示されている。 In recent years, various systems for the purpose of supporting language learning and the like have been proposed. In these systems, learning is supported mainly by focusing on pitches (inflections) and smooth tongues in units of words and sentences. For example, in the technique disclosed in Patent Document 1, the pitch of the model voice and the pitch of the learner's voice are displayed together for each word, thereby easily knowing the difference between the learner's voice and the model voice. Techniques that can be used are disclosed.

特開２００７−１３９８６８号公報JP 2007-139868 A

しかしながら、アナウンスや朗読等の学習に対しては、抑揚や滑舌だけでなく話す（音読）“スピード”や“間（間合い）”の取り方が重要であるが、従来技術では、それらの要素は評価されるようになっていない。 However, for learning such as announcements and readings, it is important not only to use inflection and smooth tongue, but also to speak (read aloud) “speed” and “interval”. Has not been evaluated.

本発明は、以上の点に鑑みてなされたものであり、音読スピード及び間合いを含めた総合評価を話者に提示することが可能な音読評価装置、音読評価方法、及びプログラムを提供する。 The present invention has been made in view of the above points, and provides a reading aloud evaluation apparatus, a reading aloud evaluation method, and a program capable of presenting a speaker with a comprehensive evaluation including reading speed and interval.

上記課題を解決するために、請求項１に記載の発明は、複数のフレーズを含む文を音読するときの手本となる音声の波形を示す第１音声波形データに基づいて前記フレーズ毎に特定された第１フレーズ区間であって前記フレーズの開始タイミングから終了タイミングまでの第１フレーズ区間と、前記第１音声波形データに基づいて特定された第１インターバル区間であって前記複数のフレーズのうち何れかの前記フレーズの終了タイミングから次の前記フレーズの開始タイミングまでの第１インターバル区間を記憶する記憶手段と、話者が前記文を音読したときに発した音声の波形を示す第２音声波形データを入力する入力手段と、前記第２音声波形データに基づいて、前記フレーズの開始タイミングから終了タイミングまでの第２フレーズ区間をフレーズ毎に特定し、且つ前記複数の前記フレーズのうち何れかの前記フレーズの終了タイミングから次の前記フレーズの開始タイミングまでの第２インターバル区間を特定する特定手段と、前記記憶手段に記憶された前記第１フレーズ区間の時間長と、前記特定手段により特定された前記第２フレーズ区間の時間長とを比較して前記文を音読するスピードの評価をフレーズ毎に行うスピード評価手段と、前記記憶手段に記憶された前記第１インターバル区間の時間長と、前記特定手段により特定された前記第２インターバル区間の時間長とを比較して前記文を音読したときの間合いの評価を行う間合い評価手段と、少なくとも前記スピード評価手段による前記スピードの評価及び前記間合い評価手段による前記間合いの評価に基づいて、前記文の音読に対する総合評価を行う総合評価手段と、前記スピード評価手段による前記スピードの評価、前記間合い評価手段による前記間合いの評価、及び前記総合評価手段による前記総合評価のうち少なくとも何れか１つの評価を表示させる表示制御手段と、を備えることを特徴とする。 In order to solve the above-described problem, the invention according to claim 1 is specified for each phrase based on first speech waveform data indicating a waveform of a speech that serves as a model when a sentence including a plurality of phrases is read aloud. A first phrase section from the start timing to the end timing of the phrase, and a first interval section specified based on the first speech waveform data, and among the plurality of phrases Storage means for storing a first interval section from the end timing of any one of the phrases to the start timing of the next phrase, and a second voice waveform indicating a waveform of a voice uttered when the speaker reads the sentence aloud Based on the input means for inputting data and the second speech waveform data, the second phrase from the start timing to the end timing of the phrase Specifying the interval for each phrase, and specifying the second interval section from the end timing of any one of the phrases to the start timing of the next phrase, and storing in the storage unit Speed evaluation means for comparing the time length of the first phrase section and the time length of the second phrase section specified by the specifying means to evaluate the speed at which the sentence is read aloud for each phrase; A time period for evaluating the time when the sentence is read aloud by comparing the time length of the first interval section stored in the storage means with the time length of the second interval period specified by the specifying means. Based on the evaluation means, and at least the evaluation of the speed by the speed evaluation means and the evaluation of the gap by the gap evaluation means At least one of comprehensive evaluation means for performing comprehensive evaluation on the reading of the sentence aloud, evaluation of the speed by the speed evaluation means, evaluation of the space by the space evaluation means, and the comprehensive evaluation by the comprehensive evaluation means Display control means for displaying one evaluation.

請求項２に記載の発明は、請求項１に記載の音読評価装置において、前記第１音声波形データに基づいて音高と音量の少なくとも何れか一方の音情報を所定時間間隔で算出する第１算出手段と、前記第２音声波形データに基づいて音高と音量の少なくとも何れか一方の音情報を所定時間間隔で算出する第２算出手段と、を更に備え、前記表示制御手段は、前記第１算出手段により算出された音情報の時系列的な変化を表す第１グラフの全部または一部と、前記第２算出手段により算出された音情報の時系列的な変化を表す第２グラフの全部または一部とを比較可能に一画面に表示させることを特徴とする。 According to a second aspect of the present invention, in the reading aloud evaluation apparatus according to the first aspect, the sound information of at least one of pitch and volume is calculated at predetermined time intervals based on the first speech waveform data. Calculating means; and second calculating means for calculating sound information of at least one of pitch and volume based on the second sound waveform data at a predetermined time interval, wherein the display control means All or part of the first graph representing the time-series change of the sound information calculated by the one calculating means, and the second graph representing the time-series change of the sound information calculated by the second calculating means. All or part of the information is displayed on one screen so as to be comparable.

請求項３に記載の発明は、請求項２に記載の音読評価装置において、前記表示制御手段は、前記第１グラフの中で少なくとも１つの前記第１フレーズ区間に対応する第１グラフ部分と、前記第２グラフの中で少なくとも１つの前記第２フレーズ区間に対応する第２グラフ部分とを比較可能に一画面に表示させ、表示切り替え指示に応じて、前記一画面に表示されている第１グラフ部分を、前記一画面に表示されていないグラフ部分であって前記第１グラフの中で他の前記第１フレーズ区間に対応するグラフ部分に切り替え表示させ、且つ、前記一画面に表示されている第２グラフ部分を、前記一画面に表示されていないグラフ部分であって前記第２グラフの中で他の前記第２フレーズ区間に対応するグラフ部分に切り替え表示させることを特徴とする。 According to a third aspect of the present invention, in the reading aloud evaluation device according to the second aspect, the display control means includes a first graph portion corresponding to at least one first phrase section in the first graph, and A second graph portion corresponding to at least one second phrase section in the second graph is displayed on one screen for comparison, and the first graph displayed on the one screen in response to a display switching instruction. The graph portion is a graph portion that is not displayed on the one screen and is switched to a graph portion corresponding to the other first phrase section in the first graph, and is displayed on the one screen. The second graph portion that is not displayed on the one screen and is switched to a graph portion corresponding to the other second phrase section in the second graph. To.

請求項４に記載の発明は、請求項３に記載の音読評価装置において、前記表示制御手段は、前記記憶手段に記憶された前記第１インターバル区間と、前記特定手段により特定された前記第２インターバル区間とを比較可能に前記第１グラフ部分及び前記第２グラフ部分を一画面に表示させることを特徴とする。 According to a fourth aspect of the present invention, in the reading aloud evaluation device according to the third aspect, the display control means includes the first interval section stored in the storage means and the second specified by the specifying means. The first graph portion and the second graph portion are displayed on one screen so that the interval section can be compared.

請求項５に記載の発明は、請求項２に記載の音読評価装置において、前記表示制御手段は、前記第１グラフの中で少なくとも１つの前記第１フレーズ区間に対応する第１グラフ部分と前記第１フレーズ区間に続く前記第１インターバル区間と、前記第２グラフの中で少なくとも１つの前記第２フレーズ区間に対応する第２グラフ部分と前記第２フレーズ区間に続く前記第２インターバル区間とを比較可能に前記第１グラフ部分及び前記第２グラフ部分を一画面に表示させることを特徴とする。 The invention according to claim 5 is the reading aloud evaluation device according to claim 2, wherein the display control means includes a first graph portion corresponding to at least one first phrase section in the first graph, and the first graph section. The first interval interval following the first phrase interval, the second graph portion corresponding to at least one second phrase interval in the second graph, and the second interval interval following the second phrase interval. The first graph portion and the second graph portion are displayed on a single screen so that they can be compared.

請求項６に記載の発明は、請求項３乃至５の何れか一項に記載の音読評価装置において、前記表示制御手段は、前記第１フレーズ区間の開始位置と前記第２フレーズ区間の開始位置とを所定方向で一致させて前記第１グラフ部分及び前記第２グラフ部分を一画面に表示させることを特徴とする。 The invention according to claim 6 is the reading aloud evaluation device according to any one of claims 3 to 5, wherein the display control means includes a start position of the first phrase section and a start position of the second phrase section. And the first graph portion and the second graph portion are displayed on one screen by matching them in a predetermined direction.

請求項７に記載の発明は、請求項１乃至６の何れか一項に記載の音読評価装置において、前記表示制御手段は、１つの前記区間における前記総合評価と、複数の前記区間における前記総合評価とを同一階層の画面に表示させることを特徴とする。 According to a seventh aspect of the present invention, in the reading aloud evaluation device according to any one of the first to sixth aspects, the display control means includes the total evaluation in one of the sections and the total in the plurality of sections. The evaluation is displayed on the screen of the same hierarchy.

請求項８に記載の発明は、請求項１乃至６の何れか一項に記載の音読評価装置において、前記表示制御手段は、１つの前記区間における前記総合評価と、複数の前記区間における前記総合評価とを異なる階層の画面に表示させることを特徴とする。 According to an eighth aspect of the present invention, in the reading aloud evaluation device according to any one of the first to sixth aspects, the display control means includes the comprehensive evaluation in one of the sections and the total in the plurality of sections. The evaluation is displayed on a screen of a different hierarchy.

請求項９に記載の発明は、請求項１乃至８の何れか一項に記載の音読評価装置において、前記記憶手段は、前記第１音声波形データを記憶し、前記特定手段は、前記記憶手段に記憶された前記第１音声波形データに基づいて、前記第１フレーズ区間をフレーズ毎に特定し、且つ、前記第１インターバル区間を特定することを特徴とする。 According to a ninth aspect of the present invention, in the reading aloud evaluation device according to any one of the first to eighth aspects, the storage unit stores the first speech waveform data, and the specifying unit is the storage unit. The first phrase section is specified for each phrase based on the first speech waveform data stored in the section, and the first interval section is specified.

請求項１０に記載の発明は、１つ以上のコンピュータにより実行される音読評価方法であって、複数のフレーズを含む文を音読するときの手本となる音声の波形を示す第１音声波形データに基づいて前記フレーズ毎に特定された第１フレーズ区間であって前記フレーズの開始タイミングから終了タイミングまでの第１フレーズ区間と、前記第１音声波形データに基づいて特定された第１インターバル区間であって前記複数のフレーズのうち何れかの前記フレーズの終了タイミングから次の前記フレーズの開始タイミングまでの第１インターバル区間を記憶手段に記憶する記憶ステップと、話者が前記文を音読したときに発した音声の波形を示す第２音声波形データを入力する入力ステップと、前記第２音声波形データに基づいて、前記フレーズの開始タイミングから終了タイミングまでの第２フレーズ区間をフレーズ毎に特定し、且つ前記複数の前記フレーズのうち何れかの前記フレーズの終了タイミングから次の前記フレーズの開始タイミングまでの第２インターバル区間を特定する特定ステップと、前記記憶手段に記憶された前記第１フレーズ区間の時間長と、前記特定ステップにより特定された前記第２フレーズ区間の時間長とを比較して前記文を音読するスピードの評価をフレーズ毎に行うスピード評価ステップと、前記記憶手段に記憶された前記第１インターバル区間の時間長と、前記特定ステップにより特定された前記第２インターバル区間の時間長とを比較して前記文を音読したときの間合いの評価を行う間合い評価ステップと、少なくとも前記スピード評価ステップによる前記スピードの評価及び前記間合い評価ステップによる前記間合いの評価に基づいて、前記文の音読に対する総合評価を行う総合評価ステップと、前記スピード評価ステップによる前記スピードの評価、前記間合い評価ステップによる前記間合いの評価、及び前記総合評価ステップによる前記総合評価のうち少なくとも何れか１つの評価を表示させる制御ステップと、を含むことを特徴とする。 The invention according to claim 10 is a speech reading evaluation method executed by one or more computers, wherein the first speech waveform data indicates a speech waveform as a model when a sentence including a plurality of phrases is read aloud. A first phrase section specified for each phrase based on the first phrase section from the start timing to the end timing of the phrase, and a first interval section specified based on the first speech waveform data A storage step of storing in the storage means a first interval from the end timing of any one of the plurality of phrases to the start timing of the next phrase, and when the speaker reads the sentence aloud An input step of inputting second voice waveform data indicating a waveform of the emitted voice, and the frame based on the second voice waveform data. A second phrase interval from the start timing to the end timing is specified for each phrase, and a second interval interval from the end timing of any one of the phrases to the start timing of the next phrase is determined. The speed of reading the sentence aloud by comparing the specifying step, the time length of the first phrase section stored in the storage means, and the time length of the second phrase section specified by the specifying step The speed evaluation step for performing the evaluation for each phrase, the time length of the first interval section stored in the storage means, and the time length of the second interval section specified by the specifying step are compared with the sentence. A gap evaluation step for evaluating a gap when reading aloud, and at least the speed evaluation step. Based on the evaluation of the speed by the evaluation and the evaluation of the interval by the interval evaluation step, an overall evaluation step for performing an overall evaluation on the reading of the sentence, the speed evaluation by the speed evaluation step, and the interval by the interval evaluation step And a control step of displaying at least one of the comprehensive evaluations in the comprehensive evaluation step.

請求項１１に記載の発明は、複数のフレーズを含む文を音読するときの手本となる音声の波形を示す第１音声波形データに基づいて前記フレーズ毎に特定された第１フレーズ区間であって前記フレーズの開始タイミングから終了タイミングまでの第１フレーズ区間と、前記第１音声波形データに基づいて特定された第１インターバル区間であって前記複数のフレーズのうち何れかの前記フレーズの終了タイミングから次の前記フレーズの開始タイミングまでの第１インターバル区間を記憶手段に記憶する記憶ステップと、話者が前記文を音読したときに発した音声の波形を示す第２音声波形データを入力する入力ステップと、前記第２音声波形データに基づいて、前記フレーズの開始タイミングから終了タイミングまでの第２フレーズ区間をフレーズ毎に特定し、且つ前記複数の前記フレーズのうち何れかの前記フレーズの終了タイミングから次の前記フレーズの開始タイミングまでの第２インターバル区間を特定する特定ステップと、前記記憶手段に記憶された前記第１フレーズ区間の時間長と、前記特定ステップにより特定された前記第２フレーズ区間の時間長とを比較して前記文を音読するスピードの評価をフレーズ毎に行うスピード評価ステップと、前記記憶手段に記憶された前記第１インターバル区間の時間長と、前記特定ステップにより特定された前記第２インターバル区間の時間長とを比較して前記文を音読したときの間合いの評価を行う間合い評価ステップと、少なくとも前記スピード評価ステップによる前記スピードの評価及び前記間合い評価ステップによる前記間合いの評価に基づいて、前記文の音読に対する総合評価を行う総合評価ステップと、前記スピード評価ステップによる前記スピードの評価、前記間合い評価ステップによる前記間合いの評価、及び前記総合評価ステップによる前記総合評価のうち少なくとも何れか１つの評価を表示させる制御ステップと、をコンピュータに実行させることを特徴とする。 The invention according to claim 11 is the first phrase section specified for each phrase based on the first voice waveform data indicating the waveform of the voice that serves as a model when reading a sentence including a plurality of phrases. A first phrase interval from the start timing to the end timing of the phrase and a first interval interval specified based on the first speech waveform data, and the end timing of any of the phrases among the plurality of phrases A storage step of storing in the storage means a first interval section from the start timing of the next phrase to the start timing of the next phrase, and an input for inputting second speech waveform data indicating a waveform of speech uttered when the speaker reads the sentence aloud And a second phrase section from the start timing to the end timing of the phrase based on the step and the second speech waveform data A specific step of specifying for each raise and specifying a second interval section from the end timing of any one of the phrases to the start timing of the next phrase, and stored in the storage means Comparing the time length of the first phrase section with the time length of the second phrase section specified in the specifying step, and evaluating the speed at which the sentence is read aloud for each phrase; and the storage A time evaluation step of evaluating the time when the sentence is read aloud by comparing the time length of the first interval section stored in the means with the time length of the second interval section specified by the specifying step. And at least the speed evaluation by the speed evaluation step and the preliminary evaluation step Comprehensive evaluation step for performing comprehensive evaluation on the reading of the sentence based on evaluation of the interval, evaluation of the speed by the speed evaluation step, evaluation of the interval by the interval evaluation step, and the overall evaluation by the overall evaluation step And a control step of displaying at least one of the evaluations.

請求項１２に記載の発明は、複数の文要素を含む文を音読するときの手本となる音声の波形を示す第１音声波形データに基づいて前記文要素毎に特定された第１文要素区間であって前記文要素の開始タイミングから終了タイミングまでの第１文要素区間と、前記第１音声波形データに基づいて特定された第１インターバル区間であって前記複数の文要素のうち何れかの前記文要素の終了タイミングから次の前記文要素の開始タイミングまでの第１インターバル区間を記憶する記憶手段と、話者が前記文を音読したときに発した音声の波形を示す第２音声波形データを入力する入力手段と、前記第２音声波形データに基づいて、前記文要素の開始タイミングから終了タイミングまでの第２文要素区間を文要素毎に特定し、且つ前記複数の前記文要素のうち何れかの前記文要素の終了タイミングから次の前記文要素の開始タイミングまでの第２インターバル区間を特定する特定手段と、前記記憶手段に記憶された前記第１文要素区間の時間長と、前記特定手段により特定された前記第２文要素区間の時間長とを比較して前記文を音読するスピードの評価を文要素毎に行うスピード評価手段と、前記記憶手段に記憶された前記第１インターバル区間の時間長と、前記特定手段により特定された前記第２インターバル区間の時間長とを比較して前記文を音読したときの間合いの評価を行う間合い評価手段と、少なくとも前記スピード評価手段による前記スピードの評価及び前記間合い評価手段による前記間合いの評価に基づいて、前記文の音読に対する総合評価を行う総合評価手段と、前記スピード評価手段による前記スピードの評価、前記間合い評価手段による前記間合いの評価、及び前記総合評価手段による前記総合評価のうち少なくとも何れか１つの評価を表示させる表示制御手段と、を備えることを特徴とする。 The invention according to claim 12 is the first sentence element specified for each sentence element based on the first voice waveform data indicating a waveform of a voice that serves as a model when a sentence including a plurality of sentence elements is read aloud. A first sentence element section from the start timing to the end timing of the sentence element and a first interval section specified based on the first speech waveform data, and any one of the plurality of sentence elements Storage means for storing a first interval section from the end timing of the sentence element to the start timing of the next sentence element, and a second voice waveform indicating a waveform of a voice uttered when the speaker reads the sentence aloud Based on the input means for inputting data and the second speech waveform data, a second sentence element section from the start timing to the end timing of the sentence element is specified for each sentence element, and the plurality of the sentences A specifying means for specifying a second interval section from the end timing of any one of the sentence elements to a start timing of the next sentence element, and a time length of the first sentence element section stored in the storage means Comparing the time length of the second sentence element section specified by the specifying means and evaluating the speed at which the sentence is read aloud for each sentence element, and the storage means stored in the storage means A time evaluation means for comparing the time length of the first interval section and the time length of the second interval section specified by the specifying means to evaluate the time when the sentence is read aloud, and at least the speed evaluation Based on the evaluation of the speed by means and the evaluation of the gap by the gap evaluation means, comprehensive evaluation means for performing a comprehensive evaluation on the reading of the sentence; Display control means for displaying at least one of the evaluation of the speed by the speed evaluation means, the evaluation of the gap by the gap evaluation means, and the comprehensive evaluation by the comprehensive evaluation means, To do.

請求項１３に記載の発明は、１つ以上のコンピュータにより実行される音読評価方法であって、複数の文要素を含む文を音読するときの手本となる音声の波形を示す第１音声波形データに基づいて前記文要素毎に特定された第１文要素区間であって前記文要素の開始タイミングから終了タイミングまでの第１文要素区間と、前記第１音声波形データに基づいて特定された第１インターバル区間であって前記複数の文要素のうち何れかの前記文要素の終了タイミングから次の前記文要素の開始タイミングまでの第１インターバル区間を記憶手段に記憶する記憶ステップと、話者が前記文を音読したときに発した音声の波形を示す第２音声波形データを入力する入力ステップと、前記第２音声波形データに基づいて、前記文要素の開始タイミングから終了タイミングまでの第２文要素区間を文要素毎に特定し、且つ前記複数の前記文要素のうち何れかの前記文要素の終了タイミングから次の前記文要素の開始タイミングまでの第２インターバル区間を特定する特定ステップと、前記記憶手段に記憶された前記第１文要素区間の時間長と、前記特定ステップにより特定された前記第２文要素区間の時間長とを比較して前記文を音読するスピードの評価を文要素毎に行うスピード評価ステップと、前記記憶手段に記憶された前記第１インターバル区間の時間長と、前記特定ステップにより特定された前記第２インターバル区間の時間長とを比較して前記文を音読したときの間合いの評価を行う間合い評価ステップと、少なくとも前記スピード評価ステップによる前記スピードの評価及び前記間合い評価ステップによる前記間合いの評価に基づいて、前記文の音読に対する総合評価を行う総合評価ステップと、前記スピード評価ステップによる前記スピードの評価、前記間合い評価ステップによる前記間合いの評価、及び前記総合評価ステップによる前記総合評価のうち少なくとも何れか１つの評価を表示させる制御ステップと、を含むことを特徴とする。 The invention according to claim 13 is a reading aloud evaluation method executed by one or more computers, and a first voice waveform showing a waveform of a voice serving as a model when reading a sentence including a plurality of sentence elements aloud. A first sentence element section identified for each sentence element based on data, the first sentence element section from the start timing to the end timing of the sentence element, and identified based on the first speech waveform data A storage step of storing in the storage means a first interval section from the end timing of any one of the plurality of sentence elements to the start timing of the next sentence element in the storage section; An input step of inputting second voice waveform data indicating a waveform of a voice uttered when the sentence is read aloud, and a start timing of the sentence element based on the second voice waveform data And a second interval from the end timing of any one of the plurality of sentence elements to the start timing of the next sentence element. The step of specifying a section, the time length of the first sentence element section stored in the storage means, and the time length of the second sentence element section specified by the specifying step are compared to determine the sentence. A speed evaluation step for evaluating the speed of reading aloud for each sentence element, a time length of the first interval section stored in the storage means, and a time length of the second interval section specified by the specifying step A time evaluation step for evaluating the time when the sentence is read aloud in comparison, and at least the speed evaluation by the speed evaluation step and the A comprehensive evaluation step for performing a comprehensive evaluation on the reading of the sentence based on the evaluation of the clearance according to the clearance evaluation step; an evaluation of the speed by the speed evaluation step; an evaluation of the clearance by the clearance evaluation step; and the comprehensive evaluation And a control step of displaying at least one of the overall evaluations in the step.

請求項１４に記載の発明は、複数の文要素を含む文を音読するときの手本となる音声の波形を示す第１音声波形データに基づいて前記文要素毎に特定された第１文要素区間であって前記文要素の開始タイミングから終了タイミングまでの第１文要素区間と、前記第１音声波形データに基づいて特定された第１インターバル区間であって前記複数の文要素のうち何れかの前記文要素の終了タイミングから次の前記文要素の開始タイミングまでの第１インターバル区間を記憶手段に記憶する記憶ステップと、話者が前記文を音読したときに発した音声の波形を示す第２音声波形データを入力する入力ステップと、前記第２音声波形データに基づいて、前記文要素の開始タイミングから終了タイミングまでの第２文要素区間を文要素毎に特定し、且つ前記複数の前記文要素のうち何れかの前記文要素の終了タイミングから次の前記文要素の開始タイミングまでの第２インターバル区間を特定する特定ステップと、前記記憶手段に記憶された前記第１文要素区間の時間長と、前記特定ステップにより特定された前記第２文要素区間の時間長とを比較して前記文を音読するスピードの評価を文要素毎に行うスピード評価ステップと、前記記憶手段に記憶された前記第１インターバル区間の時間長と、前記特定ステップにより特定された前記第２インターバル区間の時間長とを比較して前記文を音読したときの間合いの評価を行う間合い評価ステップと、少なくとも前記スピード評価ステップによる前記スピードの評価及び前記間合い評価ステップによる前記間合いの評価に基づいて、前記文の音読に対する総合評価を行う総合評価ステップと、前記スピード評価ステップによる前記スピードの評価、前記間合い評価ステップによる前記間合いの評価、及び前記総合評価ステップによる前記総合評価のうち少なくとも何れか１つの評価を表示させる制御ステップと、をコンピュータに実行させることを特徴とする。 The invention according to claim 14 is the first sentence element specified for each sentence element based on the first voice waveform data indicating a waveform of a voice that serves as a model when a sentence including a plurality of sentence elements is read aloud. A first sentence element section from the start timing to the end timing of the sentence element and a first interval section specified based on the first speech waveform data, and any one of the plurality of sentence elements A storage step of storing in the storage means a first interval section from the end timing of the sentence element to the start timing of the next sentence element; and a second waveform showing a waveform of a voice uttered when the speaker reads the sentence aloud A second sentence element section from the start timing to the end timing of the sentence element is specified for each sentence element based on the input step of inputting two voice waveform data and the second voice waveform data; and A specifying step of specifying a second interval section from an end timing of any one of the plurality of sentence elements to a start timing of the next sentence element; and the first sentence stored in the storage unit A speed evaluation step for evaluating the speed at which the sentence is read aloud for each sentence element by comparing the time length of the element section and the time length of the second sentence element section specified in the specifying step; A time evaluation step of evaluating the time when the sentence is read aloud by comparing the time length of the first interval section stored in the time interval with the time length of the second interval section specified in the specifying step; , Based on at least the evaluation of the speed by the speed evaluation step and the evaluation of the clearance by the clearance evaluation step. Display of at least one of a comprehensive evaluation step for performing a comprehensive evaluation on reading, an evaluation of the speed by the speed evaluation step, an evaluation of the clearance by the gap evaluation step, and a comprehensive evaluation by the comprehensive evaluation step And a control step for causing the computer to execute the control step.

請求項１，９，１０，１１，１２，１３，１４に記載の発明によれば、語学学習、アナウンスや朗読などの発声発話訓練等を行う話者に対して、抑揚や滑舌などだけでなく、音読スピードの評価、間合いの評価、及びこれら評価を考慮した総合評価のうち少なくとも何れか１つの評価を提示することができ、その結果、音読スピードや間の取り方が適切かどうかを話者に自覚させ効果的な練習や訓練をさせることが可能となる。 According to the invention described in claims 1, 9, 10, 11, 12, 13, and 14, for speakers who conduct speech training such as language learning, announcements and readings, etc. It is possible to present at least one of a reading speed evaluation, a gap evaluation, and a comprehensive evaluation that considers these evaluations. It becomes possible to make the person aware and to carry out effective practice and training.

請求項２に記載の発明によれば、音読スピードや間の取り方に加えて音高や音量が適切かどうかを話者に自覚させ効果的な練習や訓練をさせることができる。 According to the second aspect of the present invention, it is possible to make the speaker aware of whether the pitch and the sound volume are appropriate in addition to the reading speed and how to make a space, thereby enabling effective practice and training.

請求項３に記載の発明によれば、第１グラフと第２グラフとが時間軸方向に一画面に収まらない場合であっても切り替え表示させることで、音高や音量が適切かどうかを長い期間に亘り話者に把握させることができる。 According to the third aspect of the present invention, even if the first graph and the second graph do not fit on one screen in the time axis direction, it is possible to display whether or not the pitch or volume is appropriate by switching the display. The speaker can be grasped over a period of time.

請求項４、５に記載の発明によれば、第１グラフと第２グラフとによって、間の取り方が適切かどうかを話者に一見して把握させることができる。 According to the fourth and fifth aspects of the present invention, the first graph and the second graph allow the speaker to grasp at a glance whether the layout is appropriate.

請求項６に記載の発明によれば、音読スピードが適切かどうかを話者に一見して把握させることができる。 According to the sixth aspect of the invention, it is possible to make the speaker understand at a glance whether or not the reading speed is appropriate.

請求項７に記載の発明によれば、１つの区間における総合評価と、複数の区間における総合評価とを話者に同時に把握させることができる。 According to the invention described in claim 7, it is possible to allow the speaker to simultaneously grasp the comprehensive evaluation in one section and the comprehensive evaluation in a plurality of sections.

請求項８に記載の発明によれば、１つの区間における総合評価と、複数の区間における総合評価とを話者に別々に分かり易く把握させることができる。 According to the invention described in claim 8, it is possible to make the speaker understand the comprehensive evaluation in one section and the comprehensive evaluation in a plurality of sections separately and easily.

本実施形態に係る音読評価装置Ｓの概要構成例を示す図である。It is a figure which shows the example of a schematic structure of the reading aloud evaluation apparatus S which concerns on this embodiment. 手本フレーズ区間Ｆ11〜Ｆ18及び手本インターバル区間I11〜I17と、話者フレーズ区間Ｆ21〜Ｆ26と話者インターバル区間I21〜I25との一例を示す図である。It is a figure which shows an example of example phrase area F11-F18, example interval area I11-I17, speaker phrase area F21-F26, and speaker interval area I21-I25. （Ａ）は、手本フレーズ区間の時間長と、話者フレーズ区間の時間長との比較例を示す図であり、（Ｂ）は、手本インターバル区間の時間長と、話者インターバル区間の時間長との比較例を示す図である。(A) is a figure which shows the comparative example with the time length of a sample phrase area, and the time length of a speaker phrase area, (B) is the time length of a sample interval area, and a speaker interval area. It is a figure which shows the comparative example with time length. 音読評価装置Ｓにおける制御部３の評価表示処理を示すフローチャートである。It is a flowchart which shows the evaluation display process of the control part 3 in the reading aloud evaluation apparatus S. 区間スクロール画面の一例を示す図である。It is a figure which shows an example of an area scroll screen. 区間選択画面の一例を示す図である。It is a figure which shows an example of an area selection screen. 選択されたフレーズ区間の詳細を表示する画面の一例を示す図である。It is a figure which shows an example of the screen which displays the detail of the selected phrase area. 選択されたインターバル区間の詳細を表示する画面の一例を示す図である。It is a figure which shows an example of the screen which displays the detail of the selected interval area.

以下、本発明の第１の実施形態を図面に基づいて説明する。 DESCRIPTION OF EXEMPLARY EMBODIMENTS Hereinafter, a first embodiment of the invention will be described with reference to the drawings.

［１.音読評価装置Ｓの構成及び機能］
始めに、図１を参照して、本発明の第１の実施形態に係る音読評価装置Ｓの構成及び機能について説明する。図１は、本実施形態に係る音読評価装置Ｓの概要構成例を示す図である。なお、音読評価装置の一例として、パーソナルコンピュータや、携帯型情報端末（スマートフォン等）などが挙げられる。図１に示すように、音読評価装置Ｓは、通信部１、記憶部２、制御部３、操作部４、及びインターフェース（ＩＦ）部５等を備えて構成され、これらの構成要素はバス６に接続されている。操作部４は、ユーザからの操作指示を受け付け、受け付けた操作に応じた信号を制御部３へ出力する。インターフェース部５には、マイクＭ、及びディスプレイＤ等が接続される。マイクＭは、語学学習や、アナウンス、朗読などの発声発話訓練等を行う話者が、複数のフレーズを含む文（文章）を音読したときに発した音声を集音する。ここで、フレーズとは、文章を読むときに一息で読む単位であり、文末を示す句点「。」や「．」までのこともあれば、読点「、」や「，」までのこともある。フレーズは、音節という場合もある。音読対象となる文の例として、例えば、語学学習またはアナウンス訓練や朗読訓練などで用いられる文章、または歌唱に用いられる歌詞文などが挙げられる。ディスプレイＤは、制御部３からの表示指令にしたがって、話者に提供する音声情報を画面に表示する。音声情報とは、文の音読に対する評価や、音高（ピッチまたは抑揚ともいう)と音量の少なくとも何れか一方の音情報の時系列的な変化を表すグラフなどの情報である。また、評価には、音読スピードの評価、間合いの評価、音高の評価、音量の評価、滑舌の評価、総合評価などがある。ここで、音読スピード、間合い、音高、音量、及び滑舌を評価項目という。なお、マイクＭ、及びディスプレイＤは、音読評価装置Ｓと一体型であってもよいし、別体であってもよい。 [1. Configuration and function of reading aloud evaluation device S]
First, the configuration and function of the reading aloud evaluation apparatus S according to the first embodiment of the present invention will be described with reference to FIG. FIG. 1 is a diagram illustrating a schematic configuration example of the reading aloud evaluation apparatus S according to the present embodiment. In addition, a personal computer, a portable information terminal (smartphone, etc.) etc. are mentioned as an example of a reading aloud evaluation apparatus. As shown in FIG. 1, the reading aloud evaluation device S includes a communication unit 1, a storage unit 2, a control unit 3, an operation unit 4, an interface (IF) unit 5, and the like. It is connected to the. The operation unit 4 receives an operation instruction from the user and outputs a signal corresponding to the received operation to the control unit 3. The interface unit 5 is connected to a microphone M, a display D, and the like. The microphone M collects a sound that is produced when a speaker who performs language learning, speech utterance training such as announcement, reading, etc. reads a sentence (sentence) including a plurality of phrases aloud. Here, a phrase is a unit that is read at a time when reading a sentence, and sometimes includes a punctuation mark “.” Or “.” Indicating the end of the sentence, or a punctuation mark “,” or “,”. . A phrase is sometimes called a syllable. Examples of sentences to be read aloud include, for example, sentences used for language learning, announcement training, reading training, and lyrics used for singing. The display D displays voice information provided to the speaker on the screen in accordance with a display command from the control unit 3. The speech information is information such as an evaluation for reading aloud a sentence and a graph representing a time-series change in sound information of at least one of pitch (also referred to as pitch or intonation) and volume. In addition, the evaluation includes reading speed evaluation, interval evaluation, pitch evaluation, sound volume evaluation, smooth tongue evaluation, comprehensive evaluation, and the like. Here, the reading speed, interval, pitch, volume, and smooth tongue are referred to as evaluation items. The microphone M and the display D may be integrated with the reading aloud evaluation device S or may be separate.

通信部１は、有線または無線によりネットワーク（図示せず）に接続してサーバ等と通信を行う。記憶部２は、例えばハードディスクドライブ等からなり、ＯＳ（オペレーティングシステム）、及び評価表示処理プログラム（本発明のプログラムの一例）等を記憶する。評価表示処理プログラムは、コンピュータとしての制御部３に、後述する評価表示処理を実行させるプログラムである。評価表示処理プログラムは、アプリケーションとして、所定のサーバからダウンロードされてもよいし、ＣＤ、ＤＶＤ等の記録媒体に記憶されて提供されてもよい。また、記憶部２は、複数のフレーズを含む文のテキストデータと、この文を音読するときの手本となる音声の波形を示す第１音声波形データ（以下、「手本音声波形データ」という）を記憶する。ここで、テキストデータには、例えば、各文字の発音タイミング（例えば、発音開始からの経過時間）が文字毎に対応付けられて含まれる。なお、手本音声波形データは、所定の音声ファイル形式で記憶される。 The communication unit 1 communicates with a server or the like by connecting to a network (not shown) by wire or wireless. The storage unit 2 includes, for example, a hard disk drive and stores an OS (Operating System), an evaluation display processing program (an example of the program of the present invention), and the like. The evaluation display process program is a program that causes the control unit 3 as a computer to execute an evaluation display process to be described later. The evaluation display processing program may be downloaded from a predetermined server as an application, or may be provided by being stored in a recording medium such as a CD or a DVD. In addition, the storage unit 2 includes text data of a sentence including a plurality of phrases, and first voice waveform data (hereinafter referred to as “example voice waveform data”) indicating a waveform of a voice as a model when reading the sentence aloud. ) Is memorized. Here, the text data includes, for example, the sound generation timing of each character (for example, the elapsed time from the start of sound generation) in association with each character. The model voice waveform data is stored in a predetermined voice file format.

制御部３は、コンピュータとしてのＣＰＵ（Center Processing Unit）、ＲＯＭ（Read
Only Memory）、及びＲＡＭ（Random Access Memory）等により構成される。制御部３は、評価表示処理プログラムにより、音声処理部３１、音読評価部３２、及び表示処理部３３として機能する。音声処理部３１は、本発明における入力手段、特定手段、第１算出手段、及び第２算出手段の一例である。音読評価部３２は、スピード評価手段、間合い評価手段、及び総合評価手段の一例である。表示処理部３３は、本発明における表示制御手段の一例である。記憶部２または制御部３におけるＲＡＭは、本発明における記憶手段の一例である。 The control unit 3 includes a CPU (Center Processing Unit) and a ROM (Read
Only Memory), RAM (Random Access Memory), and the like. The control unit 3 functions as a voice processing unit 31, a reading aloud evaluation unit 32, and a display processing unit 33 according to the evaluation display processing program. The voice processing unit 31 is an example of an input unit, a specifying unit, a first calculation unit, and a second calculation unit in the present invention. The reading aloud evaluation unit 32 is an example of a speed evaluation unit, a gap evaluation unit, and a comprehensive evaluation unit. The display processing unit 33 is an example of display control means in the present invention. The RAM in the storage unit 2 or the control unit 3 is an example of a storage unit in the present invention.

音声処理部３１は、所定の音声ファイル形式で記憶された手本音声波形データを記憶部２から入力する。また、音声処理部３１は、話者が上記文を音読したときに発した音声であってマイクＭにより集音された音声の波形を示す第２音声波形データ（以下、「話者音声波形データ」という）を入力する。手本音声波形データ及び話者音声波形データを総称して音声波形データという。なお、音声波形データは、離散化された時系列の音圧波形データであり、例えば、サンプリングレート44.1kHz、量子化16bit、及びモノラルの波形データである。なお、音圧とは、音波による空気の圧力の変化分（Pa）をいう。本実施形態では、音圧として、瞬時音圧（Pa）の二乗平均平方根（RMS）である実効音圧（Pa）の大きさを計算上扱い易い数値で表した音圧レベル(dB)を適用する。音圧レベル(dB)は、広義には音量ともいう。 The voice processing unit 31 inputs model voice waveform data stored in a predetermined voice file format from the storage unit 2. The speech processing unit 31 is a second speech waveform data (hereinafter referred to as “speaker speech waveform data”) indicating a waveform of speech collected by the microphone M, which is a speech produced when the speaker reads the sentence aloud. ”). The model voice waveform data and the speaker voice waveform data are collectively referred to as voice waveform data. The voice waveform data is discretized time-series sound pressure waveform data, for example, sampling rate 44.1 kHz, quantization 16 bits, and monaural waveform data. Note that the sound pressure refers to a change (Pa) in air pressure due to sound waves. In this embodiment, the sound pressure level (dB) that represents the magnitude of effective sound pressure (Pa), which is the root mean square (RMS) of instantaneous sound pressure (Pa), is expressed as a numerical value that is easy to handle in calculation. To do. The sound pressure level (dB) is also called volume in a broad sense.

音声処理部３１は、音声波形データから例えば所定時間（例えば、10ms）毎に切り出したデータから音圧レベル(dB)を音圧として所定時間毎に（所定時間間隔で）算出する。また、音声処理部３１は、音声波形データから例えば所定時間毎に切り出したデータから基本周波数（Hz）を算出し、算出した基本周波数（Hz）を音高として所定時間毎に算出する。なお、音高の算出方法には、例えば、ゼロクロス法やベクトル自己相関等の公知の手法を適用できる。また、音声処理部３１は、滑舌の評価に用いる声道特性を示す特徴量（音響特性）をフレーズ毎に算出する。例えば、音声処理部３１は、音声波形データをフレーズ（フレーズ区間）毎に切り出し、切り出したフレーズのデータを窓掛け（フレームともいう）で区切って（例えば、25ms毎にフレーム化）、フーリエ解析（ＦＦＴ）することで振幅スペクトルを求める。そして、音声処理部３１は、求めた振幅スペクトルにメルフィルタバンクをかけ、メルフィルタバンクの出力を対数化した値を離散コサイン変換（ＤＣＴ）することでＭＦＣＣ（メル周波数ケプストラム係数）を、声道特性を示す特徴量としてフレーズ毎に算出する。 For example, the sound processing unit 31 calculates a sound pressure level (dB) as sound pressure from data cut out from the sound waveform data every predetermined time (for example, 10 ms) at predetermined time intervals (at predetermined time intervals). Further, the voice processing unit 31 calculates a fundamental frequency (Hz) from data cut out at predetermined time intervals from the voice waveform data, and calculates the calculated fundamental frequency (Hz) at predetermined time intervals as a pitch. For the pitch calculation method, for example, a known method such as a zero cross method or vector autocorrelation can be applied. In addition, the voice processing unit 31 calculates a feature amount (acoustic characteristic) indicating a vocal tract characteristic used for evaluation of the smooth tongue for each phrase. For example, the voice processing unit 31 cuts out voice waveform data for each phrase (phrase section), divides the cut-out phrase data into windows (also referred to as frames) (for example, frames every 25 ms), and Fourier analysis ( The amplitude spectrum is obtained by performing FFT. Then, the speech processing unit 31 multiplies the obtained amplitude spectrum by a mel filter bank and performs a discrete cosine transform (DCT) on a logarithmic value of the output of the mel filter bank to obtain a MFCC (mel frequency cepstrum coefficient). It is calculated for each phrase as a feature amount indicating the characteristic.

また、音声処理部３１は、手本音声波形データに基づいて、各フレーズの開始タイミングから終了タイミングまでの第１フレーズ区間（以下、「手本フレーズ区間」という）をフレーズ毎に特定し、且つ複数のフレーズのうち何れかのフレーズの終了タイミングから次のフレーズの開始タイミングまでの第１インターバル区間（以下、「手本インターバル区間」という）を特定する。こうして特定された手本フレーズ区間及び手本インターバル区間のデータは、例えば、手本音声波形データの音声ファイルに対応付けられて記憶部２に記憶される。なお、フレーズ区間及びインターバル区間は、例えば波形の開始時点からの時間の範囲（例えば、01:00-03:00）で表される。また、音声処理部３１は、話者音声波形データに基づいて、各フレーズの開始タイミングから終了タイミングまでの第２フレーズ区間（以下、「話者フレーズ区間」という）をフレーズ毎に特定し、且つ複数のフレーズのうち何れかのフレーズの終了タイミングから次のフレーズの開始タイミングまでの第２インターバル区間（以下、「話者インターバル区間」という）を特定する。 In addition, the speech processing unit 31 specifies, for each phrase, a first phrase section (hereinafter referred to as “example phrase section”) from the start timing to the end timing of each phrase based on the model speech waveform data, and A first interval section (hereinafter referred to as “example interval section”) from the end timing of one of the phrases to the start timing of the next phrase is specified. The data of the model phrase section and the model interval section specified in this way are stored in the storage unit 2 in association with the voice file of the model voice waveform data, for example. Note that the phrase section and the interval section are represented by, for example, a time range from the waveform start time (for example, 01: 00-03: 00). Further, the speech processing unit 31 specifies, for each phrase, a second phrase section (hereinafter referred to as “speaker phrase section”) from the start timing to the end timing of each phrase based on the speaker speech waveform data, and A second interval section (hereinafter referred to as “speaker interval section”) from the end timing of one of the phrases to the start timing of the next phrase is specified.

ここで、開始タイミングと終了タイミングは、それぞれ、音声の波形から認識されてもよいし、上述したように算出された音圧レベル(dB)から認識されてもよい。例えば、音声処理部３１は、音声の波形の振幅幅が所定値以上になった時点を開始タイミングとして認識する。或いは、音声処理部３１は、音圧レベル(dB)が所定値以上になった時点を開始タイミングとして認識する。また、例えば、音声処理部３１は、音声の波形の振幅幅が所定値未満になった時点を終了タイミングとして認識する。或いは、音声処理部３１は、音圧レベル(dB)が所定値未満になった時点を終了タイミングとして認識する。 Here, the start timing and the end timing may be recognized from the sound waveform, or may be recognized from the sound pressure level (dB) calculated as described above. For example, the voice processing unit 31 recognizes the time point when the amplitude width of the voice waveform becomes a predetermined value or more as the start timing. Alternatively, the sound processing unit 31 recognizes the time point when the sound pressure level (dB) becomes a predetermined value or more as the start timing. Further, for example, the voice processing unit 31 recognizes the time point when the amplitude width of the voice waveform is less than a predetermined value as the end timing. Alternatively, the sound processing unit 31 recognizes the time point when the sound pressure level (dB) becomes less than a predetermined value as the end timing.

図２は、手本フレーズ区間Ｆ11〜Ｆ18及び手本インターバル区間I11〜I17と、話者フレーズ区間Ｆ21〜Ｆ26と話者インターバル区間I21〜I25との一例を示す図である。なお、図２は、所定の音圧レベルを閾値として、開始タイミングと終了タイミングを決めている例である。図２の例では、横軸（Ｘ）は時間を、縦軸（Ｙ）は音圧をそれぞれ示しており、音圧の時系列的な変化において、手本フレーズ区間Ｆ11〜Ｆ18及び手本インターバル区間I11〜I17と、話者フレーズ区間Ｆ21〜Ｆ26と話者インターバル区間I21〜I25とが特定される。特定された手本フレーズ区間Ｆ11〜Ｆ18及び手本インターバル区間I11〜I17には、それぞれ、例えば先頭から順番にシリアル番号が付与される。同様に、話者フレーズ区間Ｆ21〜Ｆ26と話者インターバル区間I21〜I25には、それぞれ、例えば先頭から順番にシリアル番号が付与される。これにより、手本フレーズ区間（例えば、Ｆ11）に対応する話者フレーズ区間（例えば、Ｆ21）を特定でき、また、手本インターバル区間（例えば、Ｉ11）に対応する話者インターバル区間（例えば、Ｉ21）を特定できる。 FIG. 2 is a diagram illustrating an example of model phrase sections F11 to F18 and model interval sections I11 to I17, speaker phrase sections F21 to F26, and speaker interval sections I21 to I25. FIG. 2 is an example in which the start timing and end timing are determined using a predetermined sound pressure level as a threshold. In the example of FIG. 2, the horizontal axis (X) indicates time, and the vertical axis (Y) indicates sound pressure. In the time-series change of sound pressure, the sample phrase intervals F11 to F18 and the sample interval. The sections I11 to I17, the speaker phrase sections F21 to F26, and the speaker interval sections I21 to I25 are specified. Serial numbers are assigned to the specified example phrase sections F11 to F18 and example interval sections I11 to I17, for example, in order from the top. Similarly, serial numbers are assigned to the speaker phrase sections F21 to F26 and the speaker interval sections I21 to I25, for example, sequentially from the top. Thereby, a speaker phrase section (for example, F21) corresponding to the model phrase section (for example, F11) can be specified, and a speaker interval section (for example, I21) corresponding to the model interval section (for example, I11) can be specified. ) Can be specified.

次に、音読評価部３２は、手本フレーズ区間の時間長（時間的長さ）と、話者フレーズ区間の時間長とを比較して文を音読するスピード（音読スピード）の評価をフレーズ（つまり、フレーズ区間）毎に行う。図３（Ａ）は、手本フレーズ区間の時間長と、話者フレーズ区間の時間長との比較例を示す図である。図３（Ａ）の例では、手本フレーズ区間Ｆ11の時間長と、手本フレーズ区間Ｆ11に対応する（例えばシリアル番号が一致する）話者フレーズ区間Ｆ21の時間長が比較され、手本フレーズ区間Ｆ12の時間長と、手本フレーズ区間Ｆ12に対応する話者フレーズ区間Ｆ22の時間長が比較され、手本フレーズ区間Ｆ13の時間長と、手本フレーズ区間Ｆ13に対応する話者フレーズ区間Ｆ23の時間長が比較されるようになっている。なお、手本フレーズ区間Ｆ13の後に続くフレーズ区間、及び話者フレーズ区間Ｆ23の後に続くフレーズ区間についても同様に比較される。音読評価部３２は、例えば、時間長の比較結果として、フレーズ（フレーズ区間）毎に、手本フレーズ区間の時間長と話者フレーズ区間の時間長との時間差を算出し、この時間差の絶対値に基づいて評価点を算出することで音読スピードの評価を行う。例えば、時間差の絶対値が０に近いほど、評価が高く（つまり、評価点が高く）なるように算出される。つまり、話者の音読スピードが、手本の音読スピードよりも速いまたは遅いほど時間差の絶対値が大きくなるので評価は低くなる。一方、話者の音読スピードが手本の音読スピードに近づくほど時間差の絶対値が小さくなるので評価は高くなる。このようにしてフレーズ毎に音読スピードの評価がなされる（つまり、評価点が算出される）。また、音読評価部３２は、フレーズ区間毎の音読スピードの評価に基づいて全てのフレーズ区間における音読スピードの評価を行う。全てのフレーズ区間における音読スピードの評価では、例えば、フレーズ区間毎に算出された音読スピードの評価点の平均値が全てのフレーズ区間における音読スピードのトータル評価点として算出される。 Next, the reading aloud evaluation unit 32 compares the time length (temporal length) of the model phrase section with the time length of the speaker phrase section, and evaluates the speed of reading the sentence (reading speed) by the phrase ( That is, it is performed for each phrase section. FIG. 3A is a diagram illustrating a comparative example of the time length of the model phrase section and the time length of the speaker phrase section. In the example of FIG. 3A, the time length of the sample phrase section F11 is compared with the time length of the speaker phrase section F21 corresponding to the sample phrase section F11 (for example, the serial numbers match), and the model phrase is compared. The time length of the section phrase F12 is compared with the time length of the speaker phrase section F22 corresponding to the model phrase section F12, and the time length of the model phrase section F13 and the speaker phrase section F23 corresponding to the model phrase section F13 are compared. The length of time is compared. The phrase section following the example phrase section F13 and the phrase section following the speaker phrase section F23 are similarly compared. The reading aloud evaluation unit 32 calculates, for example, the time difference between the time length of the model phrase section and the time length of the speaker phrase section for each phrase (phrase section) as a comparison result of time lengths, and the absolute value of this time difference The reading speed is evaluated by calculating the evaluation score based on the above. For example, the closer the absolute value of the time difference is to 0, the higher the evaluation (that is, the higher the evaluation score). That is, since the absolute value of the time difference increases as the speaker's reading speed is faster or slower than the reading speed of the model, the evaluation becomes lower. On the other hand, since the absolute value of the time difference becomes smaller as the speaker's reading speed approaches the reading speed of the model, the evaluation becomes higher. In this way, the reading speed is evaluated for each phrase (that is, an evaluation score is calculated). Moreover, the reading aloud evaluation part 32 evaluates the reading speed in all the phrase sections based on the evaluation of the reading speed for each phrase section. In the evaluation of the reading speed in all the phrase sections, for example, the average value of the reading speed evaluation points calculated for each phrase section is calculated as the total evaluation score of the reading speed in all the phrase sections.

また、音読評価部３２は、手本インターバル区間の時間長と、話者インターバル区間の時間長とを比較して文を音読したときの間合いの評価を行う。図３（Ｂ）は、手本インターバル区間の時間長と、話者インターバル区間の時間長との比較例を示す図である。図３（Ｂ）の例では、手本インターバル区間Ｉ11の時間長と、手本インターバル区間Ｉ11に対応する（例えばシリアル番号が一致する）話者インターバル区間Ｉ21の時間長が比較され、手本インターバル区間Ｉ12の時間長と、手本インターバル区間Ｉ12に対応する話者インターバル区間Ｉ22の時間長が比較され、手本インターバル区間Ｉ13の時間長と、手本インターバル区間Ｉ13に対応する話者インターバル区間Ｉ23の時間長が比較されるようになっている。なお、手本インターバル区間Ｉ13の後に続く手本インターバル区間、及び話者インターバル区間Ｉ23の後に続く話者インターバル区間についても同様に比較される。音読評価部３２は、例えば、時間長の比較結果として、手本インターバル区間の時間長と話者インターバル区間の時間長との時間差を算出し、この時間差の絶対値に基づいて評価点を算出することで間合いの評価を行う。この評価点は、音読スピードの評価と同様、例えば、時間差の絶対値が０に近いほど、評価が高く（つまり、評価点が高く）なるように算出される。つまり、話者の間の取り方が、手本の間の取り方よりも長いまたは短いほど時間差の絶対値が大きくなるので評価は低くなる。一方、話者の間の取り方が手本の間の取り方に近づくほど時間差の絶対値が小さくなるので評価は高くなる。このようにしてインターバル区間毎に間合いの評価がなされる（つまり、評価点が算出される）。また、音読評価部３２は、インターバル区間毎の間合いの評価に基づいて全てのインターバル区間における間合いの評価を行う。全てのインターバル区間における間合いの評価では、例えば、インターバル区間毎に算出された間合いの評価点の平均値が全てのインターバル区間における間合いのトータル評価点として算出される。 Moreover, the reading aloud evaluation unit 32 compares the time length of the sample interval section with the time length of the speaker interval section, and evaluates the time when the sentence is read aloud. FIG. 3B is a diagram illustrating a comparative example of the time length of the sample interval section and the time length of the speaker interval section. In the example of FIG. 3B, the time length of the sample interval section I11 is compared with the time length of the speaker interval section I21 corresponding to the sample interval section I11 (for example, the serial numbers match), and the sample interval is compared. The time length of the section I12 and the time length of the speaker interval section I22 corresponding to the sample interval section I12 are compared, and the time length of the sample interval section I13 and the speaker interval section I23 corresponding to the sample interval section I13 are compared. The length of time is compared. The comparison is also made in the same way for the sample interval section following the sample interval section I13 and the speaker interval section following the speaker interval section I23. The reading aloud evaluation unit 32 calculates, for example, a time difference between the time length of the sample interval section and the time length of the speaker interval section as a comparison result of time lengths, and calculates an evaluation score based on the absolute value of this time difference. To evaluate the time. This evaluation score is calculated so that the evaluation is higher (that is, the evaluation score is higher), for example, as the absolute value of the time difference is closer to 0, similarly to the evaluation of the reading speed. In other words, the absolute value of the time difference becomes larger as the way of taking between speakers is longer or shorter than the way of taking between models, so the evaluation becomes lower. On the other hand, since the absolute value of the time difference becomes smaller as the approach between speakers approaches the approach between models, the evaluation becomes higher. In this way, the evaluation of the time interval is made for each interval section (that is, the evaluation score is calculated). Moreover, the reading aloud evaluation part 32 evaluates the intervals in all interval intervals based on the evaluation of intervals for each interval interval. In the evaluation of the interval in all interval intervals, for example, the average value of the interval evaluation points calculated for each interval interval is calculated as the total evaluation score of intervals in all interval intervals.

また、音読評価部３２は、手本フレーズ区間の音高と、手本フレーズ区間に対応する話者フレーズ区間の音高とを比較して、音高の評価をフレーズ毎に行う。音読評価部３２は、例えば、音高の比較結果として、手本フレーズ区間の音高と話者フレーズ区間の音高との差を算出し、この差に基づいて評価点を算出することで音高の評価を行う。この評価点は、例えば、差が０に近いほど、評価が高く（つまり、評価点が高く）なるように算出される。つまり、話者の音高が手本の音高よりも高いまたは低いほど差が大きくなるので評価は低くなる。一方、話者の音高が手本の音高に近づくほど差が小さくなるので評価は高くなる。このようにしてフレーズ区間毎に音高の評価がなされる（つまり、評価点が算出される）。ところで、各フレーズ区間において、所定時間毎に算出された音高を比較する場合、音高の評価は、手本フレーズ区間と話者フレーズ区間との開始の時間位置を合わせて、フレーズ区間（フレーズ単位）を伸縮させてフレーズ区間の時間長を合わせて行われるとよい。このとき、単純に伸縮させて長さを合わせてもよいし、ＤＰマッチング等の手法を使い、フレーズ区間の中で動的に評価する位置を合わせるようにしても良い。ただし、比較される音高は、各フレーズ区間において所定時間毎に算出された音高の平均値としてもよく、この場合、フレーズ区間の開始の時間位置やフレーズ区間の時間長を合わせなくてもよい。また、音読評価部３２は、フレーズ区間毎の音高の評価に基づいて全てのフレーズ区間における音高の評価を行う。全てのフレーズ区間における音高の評価では、例えば、フレーズ区間毎に算出された音高の評価点の平均値が全てのフレーズ区間における音高のトータル評価点として算出される。 Moreover, the reading aloud evaluation part 32 compares the pitch of a model phrase area with the pitch of the speaker phrase area corresponding to a model phrase area, and performs pitch evaluation for every phrase. The reading aloud evaluation unit 32 calculates, for example, the difference between the pitch of the sample phrase section and the pitch of the speaker phrase section as a comparison result of the pitches, and calculates an evaluation score based on this difference. Make a high evaluation. For example, the evaluation score is calculated such that the evaluation is higher (that is, the evaluation score is higher) as the difference is closer to zero. That is, since the difference becomes larger as the pitch of the speaker is higher or lower than the pitch of the model, the evaluation becomes lower. On the other hand, the evaluation becomes higher because the difference becomes smaller as the pitch of the speaker approaches the pitch of the model. In this way, the pitch is evaluated for each phrase section (that is, an evaluation score is calculated). By the way, when comparing the pitches calculated every predetermined time in each phrase section, the pitch evaluation is performed by matching the start time positions of the model phrase section and the speaker phrase section, and the phrase section (phrase It is recommended that the time length of the phrase section be adjusted by expanding and contracting the unit. At this time, the length may be adjusted by simply expanding or contracting, or the position to be dynamically evaluated in the phrase section may be adjusted using a technique such as DP matching. However, the pitches to be compared may be an average value of pitches calculated every predetermined time in each phrase section, and in this case, the time position of the start of the phrase section and the time length of the phrase section need not be matched. Good. Moreover, the reading aloud evaluation part 32 evaluates the pitch in all the phrase sections based on the evaluation of the pitch for each phrase section. In the evaluation of pitches in all phrase sections, for example, an average value of pitch evaluation points calculated for each phrase section is calculated as a total evaluation score of pitches in all phrase sections.

また、音読評価部３２は、手本フレーズ区間の音量と、手本フレーズ区間に対応する話者フレーズ区間の音量とを比較して、音量の評価をフレーズ毎に行う。比較される音量には、例えば上述した音圧レベル(dB)が用いられる。音読評価部３２は、例えば、音量の比較結果として、手本フレーズ区間の音量と話者フレーズ区間の音量との差を算出し、この差に基づいて評価点を算出することで音量の評価を行う。この評価点は、例えば、差が０に近いほど、評価が高く（つまり、評価点が高く）なるように算出される。つまり、話者の音量が手本の音量よりも大きいまたは小さいほど差が大きくなるので評価は低くなる。一方、話者の音量が手本の音量に近づくほど差が小さくなるので評価は高くなる。このようにしてフレーズ区間毎に音量の評価がなされる（つまり、評価点が算出される）。また、音高の場合と同様、音量の評価は、手本フレーズ区間と話者フレーズ区間との開始の時間位置を合わせて、フレーズ区間（フレーズ単位）を伸縮させてフレーズ区間の時間長を合わせて行われるとよい。或いは、比較される音量は、各フレーズ区間において所定時間毎に算出された音量の平均値としてもよい。また、音読評価部３２は、フレーズ区間毎の音量の評価に基づいて全てのフレーズ区間における音量の評価を行う。全てのフレーズ区間における音量の評価では、例えば、フレーズ区間毎に算出された音量の評価点の平均値が全てのフレーズ区間における音量のトータル評価点として算出される。 Moreover, the reading aloud evaluation part 32 compares the volume of a model phrase area with the volume of the speaker phrase area corresponding to a model phrase area, and performs volume evaluation for every phrase. As the volume to be compared, for example, the above-described sound pressure level (dB) is used. The reading aloud evaluation unit 32 calculates, for example, the difference between the volume of the example phrase section and the volume of the speaker phrase section as a comparison result of the volume, and evaluates the volume by calculating an evaluation score based on this difference. Do. For example, the evaluation score is calculated such that the evaluation is higher (that is, the evaluation score is higher) as the difference is closer to zero. That is, since the difference increases as the speaker volume is larger or smaller than the model volume, the evaluation is lower. On the other hand, the evaluation becomes higher because the difference becomes smaller as the volume of the speaker approaches the volume of the model. In this way, the sound volume is evaluated for each phrase section (that is, an evaluation score is calculated). Similarly to the case of the pitch, the evaluation of the volume is performed by adjusting the time position of the phrase section by expanding and contracting the phrase section (phrase unit) by matching the start time positions of the sample phrase section and the speaker phrase section. It is good to be done. Alternatively, the volume to be compared may be an average value of the volumes calculated every predetermined time in each phrase section. Moreover, the reading aloud evaluation part 32 evaluates the sound volume in all the phrase sections based on the evaluation of the sound volume for each phrase section. In the evaluation of the sound volume in all phrase sections, for example, the average value of the sound volume evaluation points calculated for each phrase section is calculated as the total sound volume evaluation score in all phrase sections.

また、音読評価部３２は、手本フレーズ区間の滑舌と、手本フレーズ区間に対応する話者フレーズ区間の滑舌とを比較して、滑舌の評価をフレーズ毎に行う。音読評価部３２は、滑舌の評価では、例えば、フレーズ毎に算出された声道特性を示す特徴量（ＭＦＣＣ）が用いられる。音読評価部３２は、例えば、滑舌の比較結果として、手本フレーズ区間の特徴量と話者フレーズ区間の特徴量との類似度を算出し、この類似度に基づいて評価点を算出することで滑舌の評価を行う。この評価点は、例えば、類似度が高いほど、評価が高く（つまり、評価点が高く）なるように算出される。このようにしてフレーズ区間毎に滑舌の評価がなされる（つまり、評価点が算出される）。また、音読評価部３２は、フレーズ区間毎の滑舌の評価に基づいて全てのフレーズ区間における滑舌の評価を行う。全てのフレーズ区間における滑舌の評価では、例えば、フレーズ区間毎に算出された滑舌の評価点の平均値が全てのフレーズ区間における滑舌のトータル評価点として算出される。 Moreover, the aloud reading evaluation unit 32 compares the smooth tongue of the sample phrase section with the smooth tongue of the speaker phrase section corresponding to the sample phrase section, and evaluates the smooth tongue for each phrase. The reading aloud evaluation unit 32 uses, for example, a feature value (MFCC) indicating vocal tract characteristics calculated for each phrase in the smooth tongue evaluation. The reading aloud evaluation unit 32 calculates, for example, the similarity between the feature amount of the model phrase section and the feature amount of the speaker phrase section as a smoothing tongue comparison result, and calculates an evaluation score based on this similarity degree. The smooth tongue is evaluated. For example, the evaluation score is calculated such that the higher the similarity, the higher the evaluation (that is, the higher the evaluation score). In this way, the smooth tongue is evaluated for each phrase section (that is, an evaluation score is calculated). Moreover, the reading aloud evaluation part 32 evaluates the smooth tongue in all the phrase areas based on the evaluation of the smooth tongue for every phrase area. In the evaluation of the smooth tongue in all the phrase sections, for example, the average value of the smooth tongue evaluation points calculated for each phrase section is calculated as the total evaluation score of the smooth tongue in all the phrase sections.

そして、音読評価部３２は、例えば、少なくとも音読スピードの評価及び間合いの評価に基づいて、文の音読に対する総合評価を行う。この総合評価では、例えば、１つのフレーズ区間における音読スピードの評価点と、このフレーズ区間の前と後の少なくとも何れか一方のインターバル区間における間合いの評価点との平均値（或いは合計値でもよい）が１区間の総合評価点として算出される。或いは、音読評価部３２は、例えば、１つのフレーズ区間における音読スピードの評価と、このフレーズ区間の前と後の少なくとも何れか一方のインターバル区間における間合いの評価と、このフレーズ区間における音高の評価と、このフレーズ区間における音量の評価と、このフレーズ区間における滑舌の評価との少なくとも何れか２つの評価に基づいて、文の音読に対する１区間の総合評価を行ってもよい。この総合評価では、例えば、少なくとも何れか２つの評価で算出された評価点の平均値（或いは合計値でもよい）が１区間の総合評価点として算出される。 And the aloud reading evaluation part 32 performs the comprehensive evaluation with respect to the aloud of the sentence based on at least the evaluation of the aloud reading speed and the evaluation of the interval. In this comprehensive evaluation, for example, an average value (or a total value) of an evaluation point of reading aloud speed in one phrase section and an evaluation score of the interval in at least one interval section before and after this phrase section. Is calculated as a comprehensive evaluation score for one section. Alternatively, the reading aloud evaluation unit 32, for example, evaluates the reading speed in one phrase section, evaluates the interval in at least one interval section before and after this phrase section, and evaluates the pitch in this phrase section. Then, based on at least one of the evaluation of the sound volume in the phrase section and the evaluation of the smooth tongue in the phrase section, a comprehensive evaluation of one section may be performed for the reading of the sentence. In this comprehensive evaluation, for example, an average value (or a total value) of evaluation points calculated in at least any two evaluations is calculated as a comprehensive evaluation point of one section.

また、音読評価部３２は、例えば、少なくとも、全てのフレーズ区間における音読スピードの評価及び全てのインターバル区間における間合いの評価に基づいて、文の音読に対する総合評価を行う。この総合評価では、例えば、全てのフレーズ区間における音読スピードのトータル評価点と全てのインターバル区間における間合いのトータル評価点との平均値（或いは合計値でもよい）が全区間の総合評価点として算出される。或いは、音読評価部３２は、例えば、全てのフレーズ区間における音読スピードの評価と、全てのインターバル区間における間合いの評価と、全てのフレーズ区間における音高の評価と、全てのフレーズ区間における音量の評価と、全てのフレーズ区間における滑舌の評価との少なくとも何れか２つの評価に基づいて、文の音読に対する総合評価を行ってもよい。この総合評価では、例えば、少なくとも何れか２つの評価で算出されたトータル評価点の平均値（或いは合計値でもよい）が全区間の総合評価点として算出される。 Moreover, the reading aloud evaluation part 32 performs the comprehensive evaluation with respect to the reading aloud of a sentence based on the evaluation of the reading speed in all the phrase sections and the evaluation of the interval in all the interval sections, for example. In this comprehensive evaluation, for example, an average value (or a total value) of a total evaluation score of reading aloud speed in all phrase sections and a total evaluation score of intervals in all interval sections may be calculated as a total evaluation score of all sections. The Alternatively, the reading aloud evaluation unit 32, for example, evaluates reading speed in all phrase intervals, evaluation of intervals in all interval intervals, evaluation of pitches in all phrase intervals, and evaluation of volume in all phrase intervals. Based on at least one of the evaluations of the tongues in all the phrase sections, a comprehensive evaluation for reading aloud sentences may be performed. In this comprehensive evaluation, for example, an average value (or a total value) of the total evaluation points calculated in at least any two evaluations is calculated as a total evaluation point for all sections.

表示処理部３３は、音読スピードの評価、間合いの評価、及び総合評価のうち少なくとも何れか１つの評価をディスプレイＤの画面に表示させる。これにより、音読スピードや間の取り方が適切かどうかを話者に自覚させ効果的な練習や訓練をさせることができる。例えば、音読スピードの評価として、フレーズ区間毎の音読スピードの評価点を表示させてもよいし、全てのフレーズ区間における音読スピードのトータル評価点を表示させてもよい。また、間合いの評価として、インターバル区間毎の間合いの評価点を表示させてもよいし、全てのインターバル区間における間合いのトータル評価点を表示させてもよい。また、総合評価として、１区間の総合評価点を表示させてもよいし、全区間の総合評価点を表示させてもよい。なお、上記評価に加えて、表示処理部３３は、音高の評価、音量の評価、及び滑舌の評価のうち少なくとも何れか１つの評価をディスプレイＤの画面に表示させてもよい。この場合も、音高、音量、または滑舌の評価として、フレーズ区間毎の評価点を表示させてもよいし、全てのフレーズ区間におけるトータル評価点を表示させてもよい。 The display processing unit 33 displays at least one of the evaluation of the reading speed, the evaluation of the interval, and the comprehensive evaluation on the screen of the display D. As a result, it is possible to make the speaker aware of whether the speed of reading aloud or how to take a space is appropriate, and to allow effective practice and training. For example, as an evaluation of the reading speed, an evaluation point of the reading speed for each phrase section may be displayed, or a total evaluation point of the reading speed in all the phrase sections may be displayed. In addition, as evaluation of the interval, an evaluation score for the interval for each interval interval may be displayed, or a total evaluation score for the interval in all interval intervals may be displayed. Moreover, as a comprehensive evaluation, a comprehensive evaluation score of one section may be displayed, or a comprehensive evaluation score of all sections may be displayed. In addition to the above evaluation, the display processing unit 33 may display at least one of the evaluation of the pitch, the evaluation of the sound volume, and the evaluation of the tongue on the screen of the display D. In this case as well, the evaluation score for each phrase section may be displayed as the evaluation of pitch, volume, or smooth tongue, or the total evaluation score in all phrase sections may be displayed.

更に、表示処理部３３は、手本音声波形データに基づいて所定時間毎に算出された音高と音量の少なくとも何れか一方の時系列的な変化を表す第１グラフ（以下、「手本グラフ」という）の全部または一部と、話者音声波形データに基づいて所定時間毎に算出された音高と音量の少なくとも何れか一方の時系列的な変化を表す第２グラフ（以下、「話者グラフ」という）の全部または一部とを比較可能に一画面に表示させる。これにより、音読スピードや間の取り方に加えて音高や音量が適切かどうかを話者に自覚させ効果的な練習や訓練をさせることができる。 Further, the display processing unit 33 is a first graph (hereinafter referred to as “example graph”) showing a time-series change of at least one of pitch and volume calculated at predetermined time intervals based on the example voice waveform data. )) And a second graph (hereinafter referred to as “speech”) representing a time-series change of at least one of pitch and volume calculated every predetermined time based on all or part of the speech waveform data. All or a part of the “person graph” is displayed on a single screen for comparison. As a result, in addition to the reading speed and how to set the interval, it is possible to make the speaker aware of whether the pitch and volume are appropriate, and to perform effective practice and training.

［２.音読評価装置Ｓの動作例］
次に、図４等を参照して、第１の実施形態に係る音読評価装置Ｓの動作の一例について説明する。図４は、音読評価装置Ｓにおける制御部３の評価表示処理を示すフローチャートである。なお、図４に示す評価表示処理の前提として、手本音声波形データに基づいて特定された手本フレーズ区間及び手本インターバル区間のデータと、手本音声波形データに基づいて所定時間毎に算出された音圧及び音高のデータと、手本音声波形データに基づいて手本フレーズ区間毎に算出された声道の特徴量（ＭＦＣＣ）のデータが、例えば、手本音声波形データの音声ファイルに対応付けられて記憶部２に記憶されているものとする。 [2. Example of operation of the reading aloud evaluation device S]
Next, with reference to FIG. 4 etc., an example of operation | movement of the reading aloud evaluation apparatus S which concerns on 1st Embodiment is demonstrated. FIG. 4 is a flowchart showing the evaluation display process of the control unit 3 in the reading aloud evaluation apparatus S. As a premise of the evaluation display process shown in FIG. 4, calculation is performed at predetermined intervals based on the data of the model phrase section and the model interval section specified based on the model voice waveform data and the model voice waveform data. The vocal tract feature value (MFCC) data calculated for each sample phrase section based on the sampled sound pressure and pitch data and the model speech waveform data is, for example, a speech file of the model speech waveform data. And stored in the storage unit 2.

図４に示す処理は、例えば、話者が操作部４を介して音読対象のお手本となる所望の音声ファイルを指定して音読開始指示を行うことにより開始される。図４に示す処理が開始されると、制御部３は、マイク入力をオンにし、上記指定された音声ファイルに対応付けられた手本フレーズ区間、手本インターバル区間、音圧、音高、及び声道の特徴量（ＭＦＣＣ）のデータを記憶部２から読み込む（ステップＳ１）。なお、読み込まれたデータは、ＲＡＭに記憶される。なお、話者フレーズ区間、及び話者インターバル区間には、それぞれ、シリアル番号が付与される。そして、話者が文の音読を開始すると、この文の音読中の発せられた音声がマイクＭにより集音され、集音された音声の波形を示す話者音声波形データが、インターフェース部５を介して音読評価装置Ｓに入力される。 The process shown in FIG. 4 is started, for example, when a speaker designates a desired voice file to be a model of a reading target and gives a reading start instruction via the operation unit 4. When the process shown in FIG. 4 is started, the control unit 3 turns on the microphone input, and the example phrase interval, the example interval interval, the sound pressure, the pitch, and the associated voice file are specified. Data of the vocal tract feature value (MFCC) is read from the storage unit 2 (step S1). The read data is stored in the RAM. A serial number is assigned to each of the speaker phrase section and the speaker interval section. When the speaker starts reading the sentence, the voice generated during the reading of the sentence is collected by the microphone M, and the speaker voice waveform data indicating the waveform of the collected voice is sent to the interface unit 5. Is input to the reading aloud evaluation device S.

音読評価装置Ｓの制御部３は、入力された話者音声波形データを記憶部２に記憶（録音）しつつ、入力された話者音声波形データに基づいて、上述したように、話者フレーズ区間、及び話者インターバル区間を順次特定する（ステップＳ２）。特定された話者フレーズ区間及び話者インターバル区間のデータには、それぞれ、シリアル番号が付与されてＲＡＭに記憶される。こうして記憶された各話者フレーズ区間、及び各話者インターバル区間のデータは、後述する評価に用いられる。 As described above, the control unit 3 of the reading aloud evaluation apparatus S stores (records) the input speaker voice waveform data in the storage unit 2 and, based on the input speaker voice waveform data, as described above. Sections and speaker interval sections are sequentially identified (step S2). Serial numbers are assigned to the data of the specified speaker phrase section and speaker interval section, and are stored in the RAM. The data of each speaker phrase section and each speaker interval section stored in this way are used for evaluation described later.

次いで、制御部３は、入力された話者音声波形データに基づいて、上述したように、所定時間毎に音圧及び音高を算出し、且つ、手本フレーズ区間毎に声道の特徴量（ＭＦＣＣ）を算出する（ステップＳ３）。算出された音圧、音高、及び声道の特徴量（ＭＦＣＣ）のデータはＲＡＭに記憶される。こうして記憶された音圧、音高、及び声道の特徴量（ＭＦＣＣ）のデータは、後述する評価に用いられる。 Next, as described above, the control unit 3 calculates the sound pressure and the pitch for each predetermined time based on the input speaker voice waveform data, and the vocal tract feature amount for each example phrase section. (MFCC) is calculated (step S3). The calculated sound pressure, pitch, and vocal tract feature value (MFCC) data are stored in the RAM. The sound pressure, pitch, and vocal tract feature value (MFCC) data stored in this way are used for evaluation described later.

次いで、制御部３は、手本フレーズ区間の時間長と話者フレーズ区間の時間長とをシリアル番号順に比較して音読スピードの評価を行う（ステップＳ４）。音読スピードの評価により、評価結果として、例えば、上述したように、フレーズ区間毎の音読スピードの評価点と、全てのフレーズ区間における音読スピードのトータル評価点とが算出される。 Next, the control unit 3 compares the time length of the model phrase section and the time length of the speaker phrase section in order of serial numbers, and evaluates the reading speed (step S4). As a result of the evaluation of reading aloud speed, for example, as described above, the reading score evaluation score for each phrase section and the total reading speed evaluation score for all phrase sections are calculated.

次いで、制御部３は、手本インターバル区間の時間長と話者インターバル区間の時間長とをシリアル番号順に比較して間合いの評価を行う（ステップＳ５）。間合いの評価により、評価結果として、例えば、上述したように、インターバル区間毎の間合いの評価点と、全てのインターバル区間における間合いのトータル評価点とが算出される。 Next, the control unit 3 compares the time length of the sample interval section and the time length of the speaker interval section in the order of serial numbers and evaluates the time interval (step S5). As a result of the evaluation of the interval, for example, as described above, an evaluation score for the interval for each interval interval and a total evaluation score for the interval in all interval intervals are calculated.

次いで、制御部３は、手本フレーズ区間の音高と話者フレーズ区間の音高とをシリアル番号順に比較して音高の評価を行う（ステップＳ６）。音高の評価により、評価結果として、例えば、上述したように、フレーズ区間毎の音高の評価点と、全てのフレーズ区間における音高のトータル評価点とが算出される。 Next, the control unit 3 evaluates the pitch by comparing the pitch of the model phrase section and the pitch of the speaker phrase section in the order of serial numbers (step S6). As a result of the evaluation of the pitch, for example, as described above, the evaluation score of the pitch for each phrase section and the total evaluation score of the pitch in all the phrase sections are calculated.

次いで、制御部３は、手本フレーズ区間の音量と話者フレーズ区間の音量とをシリアル番号順に比較して音量の評価を行う（ステップＳ７）。音量の評価により、評価結果として、例えば、上述したように、フレーズ区間毎の音量の評価点と、全てのフレーズ区間における音量のトータル評価点とが算出される。 Next, the control unit 3 compares the volume of the model phrase section and the volume of the speaker phrase section in the order of serial numbers, and evaluates the volume (step S7). As a result of the evaluation of the sound volume, for example, as described above, a sound volume evaluation point for each phrase section and a total sound volume evaluation score for all phrase sections are calculated as the evaluation results.

次いで、制御部３は、手本フレーズ区間の声道特性を示す特徴量（ＭＦＣＣ）と話者フレーズ区間の声道特性を示す特徴量（ＭＦＣＣ）とをシリアル番号順に比較して滑舌の評価を行う（ステップＳ８）。滑舌の評価により、評価結果として、例えば、上述したように、フレーズ区間毎の滑舌の評価点と、全てのフレーズ区間における滑舌のトータル評価点とが算出される。 Next, the control unit 3 compares the feature quantity (MFCC) indicating the vocal tract characteristics of the model phrase section and the feature quantity (MFCC) showing the vocal tract characteristics of the speaker phrase section in the order of serial numbers, thereby evaluating the smooth tongue. Is performed (step S8). As described above, for example, as described above, a smooth tongue evaluation score for each phrase section and a total evaluation score for the smooth tongue in all phrase sections are calculated by the smooth tongue evaluation.

次いで、制御部３は、音読スピードの評価結果、間合いの評価結果、音高の評価結果、音量の評価結果、及び滑舌の評価結果に基づいて、文の音読に対する総合評価を行う（ステップＳ９）。総合評価により、総合評価結果として、例えば、上述したように、１区間の総合評価点と、全区間の総合評価点とが算出される。 Next, the control unit 3 performs a comprehensive evaluation on the reading of the sentence based on the reading result of the reading speed, the evaluation result of the interval, the evaluation result of the pitch, the evaluation result of the volume, and the evaluation result of the smooth tongue (step S9). ). As a result of the comprehensive evaluation, for example, as described above, a comprehensive evaluation score for one section and a comprehensive evaluation score for all sections are calculated as the comprehensive evaluation result.

次いで、制御部３は、区間スクロール画面を表示する（ステップＳ１０）。区間スクロール画面は、手本グラフ及び話者グラフをスクロール可能に表示すると共に、ステップＳ４〜ステップＳ９の評価結果に基づく情報を表示する画面である。 Next, the control unit 3 displays a section scroll screen (step S10). The section scroll screen is a screen that displays the model graph and the speaker graph in a scrollable manner and displays information based on the evaluation results of Steps S4 to S9.

図５は、区間スクロール画面の一例を示す図である。図５に示す区間スクロール画面には、グラフ表示欄５１、評価結果表示欄５２、「切換」キー５３、スクロールバー５４、及び「閉じる」キー５５が設けられている。グラフ表示欄５１の上段には、手本グラフの中で一部の手本フレーズ区間に対応する音高の手本グラフ（線状）と音量の手本グラフ（棒状）が表示され、同表示欄５１の下段には、話者グラフの中で一部の話者フレーズ区間に対応する音高の話者グラフ（線状）と音量の話者グラフ（棒状）が表示されている。なお、グラフ表示欄５１の縦軸（Ｙ軸）には、音量と音高が割り当てられ、グラフ表示欄５１の横軸（Ｘ軸）には、時間が割り当てられている。また、グラフ表示欄５１内の表示領域は、例えばスクロールバー５４の操作による表示切り替え指示に応じて、時間軸方向にスクロール表示される。また、グラフ表示欄５１の上段の表示領域と下段の表示領域とを分離し、それぞれ独立してスクロール表示されるように構成してもよい。 FIG. 5 is a diagram illustrating an example of a section scroll screen. The section scroll screen shown in FIG. 5 includes a graph display field 51, an evaluation result display field 52, a “switch” key 53, a scroll bar 54, and a “close” key 55. In the upper part of the graph display column 51, a pitch example graph (line shape) and a volume example graph (bar shape) corresponding to some example phrase sections in the example graph are displayed. In the lower part of the column 51, a speaker graph (pitch) of pitches corresponding to a part of speaker phrase sections in the speaker graph and a speaker graph (bars) of volume are displayed. Note that volume and pitch are assigned to the vertical axis (Y-axis) of the graph display column 51, and time is assigned to the horizontal axis (X-axis) of the graph display column 51. Further, the display area in the graph display column 51 is scroll-displayed in the time axis direction in response to a display switching instruction by operating the scroll bar 54, for example. Alternatively, the upper display area and the lower display area of the graph display field 51 may be separated and scroll-displayed independently.

また、グラフ表示欄５１には、フレーズ区間毎の総合評価が、絵柄５１ａ，５１ｂで表示されている。絵柄５１ａは二重丸で表されており、これは、１区間の総合評価点が例えば７０〜１００点の間を意味する。また、絵柄５１ｂは三角で表されており、これは、１区間の総合評価点が例えば３０〜４９点の間を意味する。なお、晴れ、曇り、雨等の天気を表す絵柄で１区間の総合評価点を表してもよいし、総合評価点で総合評価を表してもよい。また、グラフ表示欄５１における手本フレーズ区間及び話者フレーズ区間には、それぞれ、そのフレーズが含む文字列が対応付けられて表示されている。なお、話者フレーズ区間に対応付けられて表示される各文字の位置は、例えば、ラベリング処理により決定される。ラベリング処理は、フレーズを含む文のテキストデータと、話者音声波形データと、話者音声波形データの周波数スペクトログラムとに基づいて、音読（発話）内容に則した音素ラベルの付与と、音素間の境界位置の特定を行う処理である。ラベリング処理には、公知の様々な手法を適用できるので、詳しい説明を省略する。 Further, in the graph display column 51, the comprehensive evaluation for each phrase section is displayed as a pattern 51a, 51b. The pattern 51a is represented by a double circle, which means that the overall evaluation score of one section is between 70 and 100 points, for example. Moreover, the pattern 51b is represented by a triangle, which means that the overall evaluation score of one section is between 30 and 49 points, for example. It should be noted that a comprehensive evaluation score for one section may be represented by a pattern representing weather such as sunny, cloudy, rainy, etc., or a comprehensive evaluation may be represented by a comprehensive evaluation score. In addition, the model phrase section and the speaker phrase section in the graph display column 51 are displayed in association with the character strings included in the phrases, respectively. In addition, the position of each character displayed in association with the speaker phrase section is determined by, for example, a labeling process. The labeling process is based on text data of sentences including phrases, speaker voice waveform data, and frequency spectrogram of speaker voice waveform data. This is processing for specifying the boundary position. Since various known methods can be applied to the labeling process, detailed description thereof is omitted.

また、評価結果表示欄５２には、全区間の総合評価点（総合：７８）、音高のトータル評価点（音高：８２）、音量のトータル評価点（音量：８９）、滑舌のトータル評価点（滑舌：６８）、音読スピードのトータル評価点（スピード：８１）、及び間合いのトータル評価点（間合い：９０）が表示されている。すなわち、この場合、図５に示す区間スクロール画面において、全区間（複数の区間の一例）の総合評価（５２）と、１区間の総合評価（５１ａまたは５１ｂ）とが同一階層の画面に表示（つまり、キー操作等無しで表示）される。これにより、全ての区間における総合評価と、１つの区間における総合評価とを話者に同時に把握させることができる。 In the evaluation result display column 52, a total evaluation point (total: 78) for all sections, a total evaluation point for pitch (pitch: 82), a total evaluation point for volume (volume: 89), and a total smoothing tongue An evaluation point (smooth tongue: 68), a total evaluation point for reading aloud speed (speed: 81), and a total evaluation point for shortage (distance: 90) are displayed. That is, in this case, in the section scroll screen shown in FIG. 5, the overall evaluation (52) of all sections (an example of a plurality of sections) and the overall evaluation (51a or 51b) of one section are displayed on the same level screen ( That is, it is displayed without any key operation. Thereby, the speaker can be made to grasp simultaneously the comprehensive evaluation in all the sections and the comprehensive evaluation in one section.

このような区間スクロール画面の表示状態において、制御部３は、「切換」キー５３が指定されたか否かを判定する（ステップＳ１１）。例えば話者が操作部４を介して「切換」キー５３を指定することにより、制御部３が、「切換」キー５３が指定されたと判定した場合（ステップＳ１１：ＹＥＳ）、ステップＳ１５へ進む。一方、制御部３は、「切換」キー５３が指定されていないと判定した場合（ステップＳ１１：ＮＯ）、ステップＳ１２へ進む。 In such a section scroll screen display state, the control unit 3 determines whether or not the “switch” key 53 has been designated (step S11). For example, when the speaker designates the “switch” key 53 via the operation unit 4 and the control unit 3 determines that the “switch” key 53 is designated (step S11: YES), the process proceeds to step S15. On the other hand, when it is determined that the “switch” key 53 is not designated (step S11: NO), the control unit 3 proceeds to step S12.

ステップＳ１２では、制御部３は、スクロールバー５４が操作されたか否かを判定する。例えば話者が操作部４を介してスクロールバー５４を操作する（横方向に移動させる）ことにより、制御部３が、スクロールバー５４が操作されたと判定した場合（ステップＳ１２：ＹＥＳ）、グラフ表示欄５１内の表示領域を更新し（ステップＳ１３）、グラフ表示欄５１内の表示領域が更新された区間スクロール画面を表示（スクロール表示）する（ステップＳ１０）。すなわち、制御部３は、手本グラフの中で少なくとも１つの手本フレーズ区間に対応する手本グラフ部分と、話者グラフの中で少なくとも１つの話者フレーズ区間に対応する話者グラフ部分とを比較可能に一画面に表示させる。このとき、制御部３は、手本インターバル区間と、話者インターバル区間とを比較可能に手本グラフ部分及び話者グラフ部分を一画面に表示させる。そして、制御部３は、スクロールバー５４の操作による表示切り替え指示に応じて、一画面に表示されている手本グラフ部分を、一画面に表示されていない手本グラフ部分（つまり、手本グラフの中で他の手本フレーズ区間に対応する手本グラフ部分）に切り替え表示させ、且つ、一画面に表示されている話者グラフ部分を、一画面に表示されていない話者グラフ部分（つまり、話者グラフの中で他の話者フレーズ区間に対応する話者グラフ部分）に切り替え表示させる。これにより、手本グラフと話者グラフとが時間軸方向に一画面に収まらない場合であっても切り替え表示させることで、音高や音量が適切かどうかを長い期間に亘り話者に把握させることができる。一方、制御部３は、スクロールバーが操作されていないと判定した場合（ステップＳ１２：ＮＯ）、ステップＳ１４へ進む。 In step S12, the control unit 3 determines whether or not the scroll bar 54 has been operated. For example, when the control unit 3 determines that the scroll bar 54 has been operated by operating the scroll bar 54 (moving in the horizontal direction) via the operation unit 4 (step S12: YES), a graph is displayed. The display area in the column 51 is updated (step S13), and the section scroll screen in which the display area in the graph display column 51 is updated is displayed (scrolled) (step S10). That is, the control unit 3 includes a model graph portion corresponding to at least one model phrase interval in the sample graph, and a speaker graph portion corresponding to at least one speaker phrase interval in the speaker graph. Are displayed on one screen for comparison. At this time, the control unit 3 displays the model graph portion and the speaker graph portion on one screen so that the sample interval section and the speaker interval section can be compared. Then, in response to the display switching instruction by the operation of the scroll bar 54, the control unit 3 changes the model graph part displayed on one screen to the model graph part not displayed on one screen (that is, the model graph). The speaker graph portion that is displayed on one screen is changed to the speaker graph portion that is not displayed on one screen (i.e., the sample graph portion corresponding to another example phrase section). In the speaker graph, a speaker graph portion corresponding to another speaker phrase section is switched and displayed. As a result, even if the model graph and speaker graph do not fit on one screen in the time axis direction, the speaker can grasp whether the pitch and volume are appropriate for a long period of time by switching display. be able to. On the other hand, when it determines with the control part 3 not being operated with the scroll bar (step S12: NO), it progresses to step S14.

ステップＳ１４では、制御部３は、「閉じる」キー５５が指定されたか否かを判定する。制御部３は、「閉じる」キー５５が指定されていないと判定した場合（ステップＳ１４：ＮＯ）、ステップＳ１１に戻る。一方、制御部３は、「閉じる」キー５５が指定されたと判定した場合（ステップＳ１４：ＹＥＳ）、図４に示す処理を終了する。 In step S <b> 14, the control unit 3 determines whether or not the “close” key 55 is designated. When it is determined that the “close” key 55 is not designated (step S14: NO), the control unit 3 returns to step S11. On the other hand, when it is determined that the “close” key 55 is designated (step S14: YES), the control unit 3 ends the process illustrated in FIG.

ステップＳ１５では、制御部３は、区間選択画面を表示する。区間選択画面は、フレーズ区間及びインターバル区間を選択可能に表示すると共に、ステップＳ４〜ステップＳ９の評価結果に基づく情報を表示する画面である。 In step S15, the control unit 3 displays a section selection screen. The section selection screen is a screen that displays a phrase section and an interval section in a selectable manner, and displays information based on the evaluation results of steps S4 to S9.

図６は、区間選択画面の一例を示す図である。図６に示す区間選択画面には、グラフ表示欄６１、評価結果表示欄６２、「切換」キー６３、「詳細」キー６４、及び「閉じる」キー６５が設けられている。グラフ表示欄６１には、手本グラフの中で全部の手本フレーズ区間に対応する音高の手本グラフ（線状）と音量の手本グラフ（棒状）が表示され、同表示欄６１の下段には、話者グラフの中で全部の話者フレーズ区間に対応する音高の話者グラフ（線状）と音量の話者グラフ（棒状）が表示されている。また、グラフ表示欄６１内の表示領域における各フレーズ区間と各インターバル区間は、例えば話者が操作部４を介してカーソルにより、それぞれ選択可能になっている。図６の例では、先頭から２番目の手本フレーズ区間６１ａがカーソルにより選択された状態になっており、当該カーソルを移動させることで、例えば、２番目の手本インターバル区間６１ｂ、または３番目の手本フレーズ区間６１ｃが選択されることになる。なお、グラフ表示欄６１の最下端部６１ｄには、フレーズ区間毎の総合評価が絵柄で表示されているが、これは表示されなくてもよい。また、評価結果表示欄６２の表示内容は、評価結果表示欄５２と同様である。 FIG. 6 is a diagram illustrating an example of a section selection screen. The section selection screen shown in FIG. 6 includes a graph display field 61, an evaluation result display field 62, a “switch” key 63, a “detail” key 64, and a “close” key 65. In the graph display column 61, a pitch sample graph (linear) and a volume sample graph (bar) corresponding to all the sample phrase sections in the sample graph are displayed. In the lower row, a speaker graph (pitch) of pitches and a speaker graph (bars) of volume corresponding to all speaker phrase sections in the speaker graph are displayed. Further, each phrase section and each interval section in the display area in the graph display column 61 can be selected by the speaker using the cursor via the operation unit 4, for example. In the example of FIG. 6, the second example phrase section 61a from the beginning is selected by the cursor, and by moving the cursor, for example, the second example interval section 61b or the third example phrase section 61a is selected. This example phrase section 61c is selected. In addition, although the comprehensive evaluation for every phrase area is displayed on the lowermost end part 61d of the graph display column 61 with the design, it does not need to be displayed. The display contents of the evaluation result display column 62 are the same as those of the evaluation result display column 52.

このような区間選択画面の表示状態において、制御部３は、「切換」キー６３が指定されたか否かを判定する（ステップＳ１６）。制御部３は、「切換」キー６３が指定されたと判定した場合（ステップＳ１６：ＹＥＳ）、ステップＳ１０へ戻り、区間スクロール画面を表示する。一方、制御部３は、「切換」キー６３が指定されていないと判定した場合（ステップＳ１６：ＮＯ）、ステップＳ１７へ進む。ステップＳ１７では、制御部３は、「詳細」キー６４が指定されたか否かを判定する。例えば話者が操作部４を介してカーソルにより何れかの区間を選択してから「詳細」キー６４を指定することにより、制御部３が、「詳細」キー６４が指定されたと判定した場合（ステップＳ１７：ＹＥＳ）、ステップＳ１９へ進む。一方、制御部３は、「詳細」キー６４が指定されていないと判定した場合（ステップＳ１７：ＮＯ）、ステップＳ１８へ進む。 In such a section selection screen display state, the control unit 3 determines whether or not the “switch” key 63 is designated (step S16). When it is determined that the “switch” key 63 is designated (step S16: YES), the control unit 3 returns to step S10 and displays the section scroll screen. On the other hand, when it is determined that the “switch” key 63 is not designated (step S16: NO), the control unit 3 proceeds to step S17. In step S <b> 17, the control unit 3 determines whether or not the “detail” key 64 has been designated. For example, when the control unit 3 determines that the “detail” key 64 is designated by selecting a “detail” key 64 after the speaker selects any section with the cursor via the operation unit 4 ( Step S17: YES), the process proceeds to Step S19. On the other hand, when it is determined that the “detail” key 64 is not designated (step S17: NO), the control unit 3 proceeds to step S18.

ステップＳ１８では、制御部３は、「閉じる」キー６５が指定されたか否かを判定する。制御部３は、「閉じる」キー６５が指定されていないと判定した場合（ステップＳ１８：ＮＯ）、ステップＳ１６に戻る。一方、制御部３は、「閉じる」キー６５が指定されたと判定した場合（ステップＳ１８：ＹＥＳ）、図４に示す処理を終了する。 In step S18, the control unit 3 determines whether or not the “close” key 65 is designated. When it is determined that the “close” key 65 is not designated (step S18: NO), the control unit 3 returns to step S16. On the other hand, when it is determined that the “close” key 65 is designated (step S18: YES), the control unit 3 ends the process shown in FIG.

ステップＳ１９では、制御部３は、区間詳細画面を表示する。区間詳細画面には、選択されたフレーズ区間の詳細を表示する画面と、選択されたインターバル区間の詳細を表示する画面との２種類がある。図７は、選択されたフレーズ区間の詳細を表示する画面の一例を示す図であり、図８は、選択されたインターバル区間の詳細を表示する画面の一例を示す図である。 In step S19, the control unit 3 displays a section detail screen. There are two types of section detail screens: a screen that displays the details of the selected phrase section and a screen that displays the details of the selected interval section. FIG. 7 is a diagram illustrating an example of a screen that displays details of a selected phrase section, and FIG. 8 is a diagram illustrating an example of a screen that displays details of a selected interval section.

図７に示す区間詳細画面には、グラフ表示欄７１、評価結果表示欄７２、「次へ」キー７３、「前へ」キー７４、及び「戻る」キー７５が設けられている。グラフ表示欄７１の上段には、グラフ表示欄６１において選択された手本フレーズ区間Ｆ12に対応する音高の手本グラフ部分（線状）と音量の手本グラフ部分（棒状）が表示され、同表示欄７１の下段には、グラフ表示欄６１において選択された手本フレーズ区間Ｆ12に対応する話者フレーズ区間Ｆ22に対応する音高の話者グラフ部分（線状）と音量の話者グラフ部分（棒状）が表示されている。この表示のとき、制御部３は、図７に示すように、手本フレーズ区間の開始位置７１ｂと話者フレーズ区間の開始位置７１ｂとを時間軸方向で一致させる。これにより、音読スピードが適切かどうかを話者に一見して把握させることができる。 The section detail screen shown in FIG. 7 includes a graph display field 71, an evaluation result display field 72, a “next” key 73, a “previous” key 74, and a “return” key 75. In the upper part of the graph display column 71, a sample graph portion (line shape) of the pitch and a sample graph portion (bar shape) of the volume corresponding to the sample phrase section F12 selected in the graph display column 61 are displayed. In the lower part of the display field 71, the speaker graph part (linear) of the pitch corresponding to the speaker phrase period F22 corresponding to the model phrase period F12 selected in the graph display field 61 and the speaker graph of the volume are shown. The part (bar shape) is displayed. At this time, as shown in FIG. 7, the control unit 3 matches the start position 71 b of the model phrase section with the start position 71 b of the speaker phrase section in the time axis direction. Thereby, the speaker can grasp at a glance whether the reading speed is appropriate.

また、評価結果表示欄７２には、選択された手本フレーズ区間Ｆ12に対応する１区間の総合評価点（総合：５２）、この１区間における音高の評価点（音高：７８）、この１区間における音量の評価点（音量：５６）、この１区間における滑舌の評価点（滑舌：６８）、この１区間における音読スピードの評価点（スピード：７８）、及びこの１区間における間合いの評価点（間合い：９０）が表示されている。すなわち、この場合、図６に示す区間選択画面における全区間（複数の区間の一例）の総合評価（６２）と、図７に示す区間詳細画面における１区間の総合評価（７２）とは、異なる階層の画面に表示される。これにより、全て区間における総合評価と、１つの区間における総合評価とを話者に別々に分かり易く把握させることができる。 The evaluation result display field 72 includes an overall evaluation score (total: 52) for one section corresponding to the selected example phrase section F12, an evaluation score for the pitch (pitch: 78) in this one section, Evaluation point of sound volume in one section (volume: 56), evaluation point of smooth tongue in this one section (smooth tongue: 68), evaluation point of reading speed in this one section (speed: 78), and interval in this one section Evaluation points (interval: 90) are displayed. That is, in this case, the overall evaluation (62) of all the sections (an example of a plurality of sections) on the section selection screen shown in FIG. 6 is different from the overall evaluation (72) of one section on the section detail screen shown in FIG. Displayed on the hierarchy screen. Thereby, it is possible to make the speaker understand the comprehensive evaluation in all sections and the comprehensive evaluation in one section separately and easily.

このような区間選択画面の表示状態において、制御部３は、「次へ」キー７３が指定されたか否かを判定する（ステップＳ２０）。制御部３は、「次へ」キー７３が指定されたと判定した場合（ステップＳ２０：ＹＥＳ）、次の区間（図７の例では、フレーズ区間）を選択し（ステップＳ２１）、選択した区間（図７の例では、手本フレーズ区間Ｆ13及び話者フレーズ区間Ｆ23）に対応する区間詳細画面を表示する（ステップＳ１９）。一方、制御部３は、「次へ」キー７３が指定されていないと判定した場合（ステップＳ２０：ＮＯ）、ステップＳ２２へ進む。 In such a section selection screen display state, the control unit 3 determines whether or not the “next” key 73 is designated (step S20). When it is determined that the “next” key 73 is designated (step S20: YES), the control unit 3 selects the next section (phrase section in the example of FIG. 7) (step S21), and selects the selected section ( In the example of FIG. 7, the section detail screen corresponding to the model phrase section F13 and the speaker phrase section F23) is displayed (step S19). On the other hand, when it is determined that the “next” key 73 is not designated (step S20: NO), the control unit 3 proceeds to step S22.

ステップＳ２２では、制御部３は、「前へ」キー７４が指定されたか否かを判定する。制御部３は、「前へ」キー７４が指定されたと判定した場合（ステップＳ２２：ＹＥＳ）、前の区間（図７の例では、フレーズ区間）を選択し（ステップＳ２３）、選択した区間（図７の例では、手本フレーズ区間Ｆ11及び話者フレーズ区間Ｆ21）に対応する区間詳細画面を表示する（ステップＳ１９）。一方、制御部３は、「前へ」キー７４が指定されていないと判定した場合（ステップＳ２２：ＮＯ）、ステップＳ２４へ進む。 In step S <b> 22, the control unit 3 determines whether or not the “Previous” key 74 is designated. If it is determined that the “Previous” key 74 has been designated (step S22: YES), the control unit 3 selects the previous section (phrase section in the example of FIG. 7) (step S23), and selects the selected section ( In the example of FIG. 7, a section detail screen corresponding to the model phrase section F11 and the speaker phrase section F21) is displayed (step S19). On the other hand, when it is determined that the “Previous” key 74 is not designated (step S22: NO), the control unit 3 proceeds to step S24.

ステップＳ２４では、制御部３は、「戻る」キー７５が指定されたか否かを判定する。制御部３は、「戻る」キー７５が指定されていないと判定した場合（ステップＳ２４：ＮＯ）、ステップＳ２０に戻る。一方、制御部３は、「戻る」キー７５が指定されたと判定した場合（ステップＳ２４：ＹＥＳ）、再び、区間選択画面を表示する（ステップＳ１５）。 In step S24, the control unit 3 determines whether or not the “return” key 75 has been designated. When it is determined that the “return” key 75 is not designated (step S24: NO), the control unit 3 returns to step S20. On the other hand, when it determines with the "return" key 75 having been designated (step S24: YES), the control part 3 displays an area selection screen again (step S15).

一方、図８に示す区間詳細画面には、グラフ表示欄８１、評価結果表示欄８２、「次へ」キー８３、「前へ」キー８４、及び「戻る」キー８５が設けられている。グラフ表示欄８１の上段には、グラフ表示欄６１において選択された手本インターバル区間Ｉ11の前後の音高の手本グラフ部分（線状）と音量の手本グラフ部分（棒状）が表示され、同表示欄８１の下段には、グラフ表示欄６１において選択された手本インターバル区間Ｉ11に対応する話者インターバル区間Ｉ21の前後の音高の話者グラフ部分（線状）と音量の話者グラフ部分（棒状）が表示されている。つまり、制御部３は、手本フレーズ区間に対応する手本グラフ部分とこの手本フレーズ区間に続く手本インターバル区間と、話者フレーズ区間に対応する話者グラフ部分とこの話者フレーズ区間に続く話者インターバル区間とを比較可能に話者グラフ部分及び話者グラフ部分を一画面に表示させる。この表示のとき、制御部３は、図８に示すように、手本インターバル区間の開始位置８１ｂと話者インターバル区間の開始位置８１ｂとを時間軸方向で一致させる。これにより、間の取り方が適切かどうかを話者に一見して把握させることができる。 On the other hand, the section detail screen shown in FIG. 8 includes a graph display field 81, an evaluation result display field 82, a “next” key 83, a “previous” key 84, and a “return” key 85. In the upper part of the graph display column 81, a sample graph portion (line shape) of the pitch before and after the sample interval section I11 selected in the graph display column 61 and a sample graph portion (bar shape) of the volume are displayed. In the lower part of the display field 81, the speaker graph part (linear) of the pitch before and after the speaker interval section I21 corresponding to the sample interval section I11 selected in the graph display field 61 and the speaker graph of the volume level are displayed. The part (bar shape) is displayed. That is, the control unit 3 includes a sample graph portion corresponding to the sample phrase interval, a sample interval interval following the sample phrase interval, a speaker graph portion corresponding to the speaker phrase interval, and the speaker phrase interval. The speaker graph portion and the speaker graph portion are displayed on one screen so that the subsequent speaker interval section can be compared. At this time, as shown in FIG. 8, the control unit 3 matches the start position 81b of the sample interval section with the start position 81b of the speaker interval section in the time axis direction. This allows the speaker to grasp at a glance whether the layout is appropriate.

また、評価結果表示欄８２には、選択された手本インターバル区間Ｉ11に対応する１区間の総合評価点（総合：６２）、及びこの１区間における間合いの評価点（間合い：６２）が表示されている。なお、この場合の総合評価点は、間合いの評価のみに基づいて算出されているので、間合いの評価点と一致している。 The evaluation result display field 82 displays a total evaluation score (total: 62) for one section corresponding to the selected sample interval section I11 and an evaluation score for the interval in this one section (interval: 62). ing. Note that the overall evaluation score in this case is calculated based only on the evaluation of the availability, and therefore coincides with the evaluation score of the availability.

このような区間選択画面の表示状態において、制御部３は、「次へ」キー８３が指定されたと判定した場合（ステップＳ２０：ＹＥＳ）、次の区間（図８の例では、インターバル区間）を選択し（ステップＳ２１）、選択した区間（図８の例では、手本インターバル区間Ｉ12及び話者インターバル区間Ｉ22）に対応する区間詳細画面を表示する（ステップＳ１９）。この後に、例えば、制御部３は、「前へ」キー８４が指定されたと判定した場合（ステップＳ２２：ＹＥＳ）、前の区間（図８の例では、インターバル区間）を選択し（ステップＳ２３）、選択した区間に対応する区間詳細画面を表示する（ステップＳ１９）。また、制御部３は、「戻る」キー８５が指定されたと判定した場合（ステップＳ２４：ＹＥＳ）、再び、区間選択画面を表示する（ステップＳ１５）。 In such a section selection screen display state, when the control unit 3 determines that the “next” key 83 is designated (step S20: YES), the control section 3 selects the next section (interval section in the example of FIG. 8). Selection is made (step S21), and a section detail screen corresponding to the selected section (in the example of FIG. 8, the sample interval section I12 and the speaker interval section I22) is displayed (step S19). After this, for example, when it is determined that the “Previous” key 84 has been designated (step S22: YES), the control unit 3 selects the previous section (interval section in the example of FIG. 8) (step S23). The section detail screen corresponding to the selected section is displayed (step S19). If it is determined that the “return” key 85 is designated (step S24: YES), the control unit 3 displays the section selection screen again (step S15).

以上説明したように、上記実施形態によれば、手本音声波形データに基づいて特定した手本フレーズ区間の時間長と話者音声波形データに基づいて特定した話者フレーズ区間の時間長とを比較して音読スピードの評価をフレーズ毎に行い、且つ、手本音声波形データに基づいて特定した手本インターバル区間の時間長と話者音声波形データに基づいて特定した話者インターバル区間の時間長とを比較して間合い評価を行い、少なくとも音読スピードの評価及び間合いの評価に基づいて、文の音読に対する総合評価を行って、音読スピードの評価、間合いの評価、及び総合評価のうち少なくとも何れか１つの評価を画面に表示させるように構成した。そのため、語学学習、アナウンスや朗読などの発声発話訓練等を行う話者に対して、抑揚や滑舌などだけでなく、音読スピードの評価、間合いの評価、及びこれら評価を考慮した総合評価のうち少なくとも何れか１つの評価を提示することができ、その結果、音読スピードや間の取り方が適切かどうかを話者に自覚させ効果的な練習や訓練をさせることが可能となる。 As described above, according to the above embodiment, the time length of the model phrase section specified based on the model voice waveform data and the time length of the speaker phrase section specified based on the speaker voice waveform data are calculated. The reading speed is evaluated for each phrase in comparison, and the time length of the sample interval section specified based on the model voice waveform data and the time length of the speaker interval section specified based on the speaker voice waveform data And at least one of a reading speed evaluation, a gap evaluation, and a comprehensive evaluation based on at least a reading speed evaluation and a reading evaluation. One evaluation was configured to be displayed on the screen. Therefore, not only inflection and smooth tongue, but also speech reading speed evaluation, interval evaluation, and comprehensive evaluation that considers these evaluations for speakers who conduct speech learning training such as language learning, announcements and readings. At least one of the evaluations can be presented, and as a result, it is possible to make the speaker aware of whether the speed of reading aloud or how to set the interval is appropriate and to make effective practice and training.

［３.第２の実施形態］
次に、本発明の第２の実施形態に係る音読評価装置Ｓの構成及び機能等について説明する。なお、第１の実施形態に係る音読評価装置Ｓの基本的な構成及び機能等は、第２の実施形態に係る音読評価装置Ｓに対しても適用されるため、以下では、主として、第１の実施形態に係る音読評価装置Ｓと異なる構成及び機能等について説明する。第２の実施形態において、音読評価装置Ｓの音声処理部３１は、手本音声波形データに基づいて文を構成する文要素の開始タイミングから終了タイミングまでの手本文要素区間（第１文要素区間の一例）を文要素毎に特定し、話者音声波形データに基づいて上記文要素の開始タイミングから終了タイミングまでの話者文要素区間（第２文要素区間の一例）を文要素毎に特定するように構成してもよい。ここで、文要素とは、文を構成する単位である。文要素の例として、上述したフレーズ、文節、単語、結合フレーズ等が挙げられる。結合フレーズは、複数のフレーズの結合により構成される。また、他の実施形態におけるフレーズは、１以上の文節から構成される。つまり、１つのフレーズが１つの文節から構成される場合もあるし、１つのフレーズが複数の文節から構成される場合もある。文節は、例えば、１つ以上の単語のまとまりである。単語には、名詞、動詞、形容詞、副詞、及び接続詞等の自立語（単独で文節を構成できる品詞）や、助動詞及び助詞等の付属語（単独で文節を構成できない品詞）などがある。音読対象となる文の例として、語学学習や、アナウンス、朗読などで用いられる文章などが挙げられる。 [3. Second embodiment]
Next, the configuration, function, and the like of the reading aloud evaluation apparatus S according to the second embodiment of the present invention will be described. The basic configuration and functions of the reading aloud evaluation device S according to the first embodiment are also applied to the reading aloud evaluation device S according to the second embodiment. A configuration, functions, and the like different from the reading aloud evaluation apparatus S according to the embodiment will be described. In the second embodiment, the speech processing unit 31 of the reading aloud evaluation device S includes a hand text element section (first sentence element section from the start timing to the end timing of sentence elements constituting a sentence based on the model speech waveform data. For each sentence element, and for each sentence element, a speaker sentence element section (an example of the second sentence element section) from the start timing to the end timing of the sentence element is specified based on the speaker speech waveform data. You may comprise. Here, the sentence element is a unit constituting a sentence. Examples of sentence elements include the phrases, phrases, words, combined phrases, and the like described above. The combined phrase is formed by combining a plurality of phrases. Moreover, the phrase in other embodiment is comprised from one or more clauses. That is, one phrase may be composed of one phrase, and one phrase may be composed of a plurality of phrases. A phrase is a group of one or more words, for example. Words include independent words such as nouns, verbs, adjectives, adverbs, and conjunctions (parts of speech that can constitute a phrase alone), adjuncts such as auxiliary verbs and particles (parts of speech that cannot constitute a phrase alone), and the like. Examples of sentences that can be read aloud include sentences used in language learning, announcements, and reading.

第２の実施形態においても、開始タイミングと終了タイミングは、それぞれ、音声の波形から認識されてもよいし、上述したように算出された音圧レベル(dB)から認識されてもよい。例えば、音声処理部３１は、音声の波形の振幅が所定値以上になった時点を開始タイミングとして認識する。或いは、音声処理部３１は、音圧レベル(dB)が所定値以上になった時点を開始タイミングとして認識する。また、例えば、音声処理部３１は、音声の波形の振幅幅が所定値未満になった時点を終了タイミングとして認識する。或いは、音声処理部３１は、音圧レベル(dB)が所定値未満になった時点を終了タイミングとして認識する。なお、例えば、音圧レベル(dB)が所定値未満になった時点から、音圧レベル(dB)が所定値以上になった時点までの時間（無音時間）が閾値以上である場合に限り、音圧レベル(dB)が所定値未満になった時点が終了タイミングとして認識され、且つ音圧レベル(dB)が所定値以上になった時点が開始タイミングとして認識されるとよい（音声の波形の振幅についても同様）。これは、無音時間が閾値より短い場合、その区間で文要素を区切らない趣旨である。ところで、例えば、「車内では（間合い）携帯電話は（間合い）マナーモードに設定の上（間合い）通話はご遠慮下さい」と区切り区切りゆっくり音読するお手本の音声波形データがあるとすると、上記の方法で開始タイミングと終了タイミングとを認識することにより、「車内では」、「携帯電話は」、「マナーモードに設定の上」、「通話はご遠慮下さい」というように、４つのフレーズ毎に対応する手本文要素区間に区切られて特定される。また、話者が、同じ文を、手本と同じ間合いで区切り区切り音読した場合に、上記の方法で開始タイミングと終了タイミングとを認識することで、「車内では」、「携帯電話は」、「マナーモードに設定の上」、「通話はご遠慮下さい」というように、４つのフレーズ毎に対応する話者文要素区間に区切られて特定される。これに対し、話者が、例えば、上記文のうち、「マナーモードに設定の上」と「通話はご遠慮下さい」の部分を、一息で「マナーモードに設定の上通話はご遠慮下さい」と素早く音読した場合、この部分が一息で読むフレーズとなり、上記の方法で開始タイミングと終了タイミングとを認識すると、上記部分は特に区切られずに話者文要素区間が特定されることになる。このように、手本により音読される複数のフレーズが、話者により音読される１つのフレーズに対応している場合、手本により音読されるフレーズに対応する手本文要素区間と、話者により音読されるフレーズの話者文要素区間とを比較し難くなる。このため、このような場合、音声処理部３１は、話者により音読されるフレーズ（「マナーモードに設定の上通話はご遠慮下さい」）を、手本により音読されるフレーズに合わせるように複数の文節または単語に区分して話者文要素区間を特定するとよい。 Also in the second embodiment, the start timing and the end timing may be recognized from the sound waveform, or may be recognized from the sound pressure level (dB) calculated as described above. For example, the voice processing unit 31 recognizes the time point when the amplitude of the voice waveform has reached a predetermined value or more as the start timing. Alternatively, the sound processing unit 31 recognizes the time point when the sound pressure level (dB) becomes a predetermined value or more as the start timing. Further, for example, the voice processing unit 31 recognizes the time point when the amplitude width of the voice waveform is less than a predetermined value as the end timing. Alternatively, the sound processing unit 31 recognizes the time point when the sound pressure level (dB) becomes less than a predetermined value as the end timing. For example, only when the time from when the sound pressure level (dB) becomes less than a predetermined value until the time when the sound pressure level (dB) becomes more than a predetermined value (silence time) is equal to or more than a threshold value, A point in time when the sound pressure level (dB) becomes less than a predetermined value is recognized as the end timing, and a point in time when the sound pressure level (dB) exceeds the predetermined value may be recognized as the start timing (sound waveform waveform). The same applies to the amplitude). This means that when the silent time is shorter than the threshold value, the sentence element is not divided in the section. By the way, for example, if there is a sample voice waveform data that is separated and read slowly, “Please do not call (wait) on a (pause) mobile phone in the car” (pause) in the manner mode. By recognizing the start timing and end timing, it is possible to respond to each of the four phrases such as “in the car”, “cell phone”, “set to manner mode”, and “please refrain from talking”. It is specified by being divided into body element sections. In addition, when the speaker reads the same sentence at the same interval as the model, it recognizes the start timing and end timing by the above method, so that "in the car", "cell phone" Each of the four phrases is specified by being divided into corresponding speaker sentence element sections, such as “on manner mode” and “please refrain from talking”. On the other hand, for example, in the above sentence, the speaker quickly reads “Please set the manner mode” and “Please refrain from talking”, and quickly “Please refrain from talking after setting the manner mode”. When reading aloud, this part becomes a phrase to be read at a breath. When the start timing and the end timing are recognized by the above method, the speaker sentence element section is specified without particularly dividing the part. As described above, when a plurality of phrases read aloud by the model correspond to one phrase read aloud by the speaker, the hand text element section corresponding to the phrase read aloud by the model and the speaker It becomes difficult to compare with the speaker sentence element section of the phrase read aloud. For this reason, in such a case, the speech processing unit 31 sets a plurality of phrases so that a phrase read aloud by a speaker ("Please refrain from calling after setting the manner mode") is matched with a phrase aloud by a model. The speaker sentence element section may be specified by segmenting into phrases or words.

より具体的には、音声処理部３１は、例えば、記憶されている手本音声波形データが示す音声の波形から音圧の累積値（以下、「手本音圧累積値」という）を手本文要素区間毎にあらかじめ算出し、手本文要素区間毎に対応付けてＲＡＭ等に記憶しておく。手本音圧累積値は、例えば「車内では」→「携帯電話は」→「マナーモードに設定の上」→「通話はご遠慮下さい」というように音読されるフレーズ順に算出される。つまり、第１の手本音圧累積値、第２の手本音圧累積値・・・というように算出される。また、音声処理部３１は、例えば、記憶されている手本音声波形データが示す音声の波形から音素（以下、「手本音素」という）の数を手本文要素区間毎にあらかじめ算出し、手本文要素区間毎に対応付けてＲＡＭ等に記憶しておく。音素の例として、母音のみ、子音のみ、子音と母音との組合せの３つが挙げられる。母音には、ａ（あ）、ｉ（い）、ｕ（う）、ｅ（え）、ｏ（お）の５母音がある。子音には、母音以外の音成分（例えば、ｋ、ｓ、ｔ、ｎ、ｈ、ｍ、ｙ、ｒ、ｗ・・・など）がある。手本音素の数は、例えば「車内では」→「携帯電話は」→「マナーモードに設定の上」→「通話はご遠慮下さい」というように音読されるフレーズ順に算出される。つまり、第１の手本音素の数、第２の手本音素の数・・・・というように特定される。なお、音素の特定方法は、ラベリング手法等で公知であるので詳しい説明を省略する。 More specifically, the voice processing unit 31 calculates, for example, a cumulative value of sound pressure (hereinafter referred to as a “sample sound pressure cumulative value”) from a voice waveform indicated by stored model voice waveform data. It is calculated in advance for each section, and stored in a RAM or the like in association with each hand text element section. The sample sound pressure cumulative value is calculated in the order of phrases read aloud, for example, “In the car” → “Mobile phone is” → “Set to manner mode” → “Please refrain from talking”. That is, the first example sound pressure accumulated value, the second example sound pressure accumulated value, and so on are calculated. In addition, the speech processing unit 31 calculates, in advance, the number of phonemes (hereinafter referred to as “example phonemes”) from the speech waveform indicated by the stored example speech waveform data for each hand text element section. Each text element section is associated with each other and stored in a RAM or the like. Examples of phonemes include three vowels only, consonants only, and combinations of consonants and vowels. There are five vowels: a (a), i (i), u (u), e (e), and o (o). Consonants include sound components other than vowels (for example, k, s, t, n, h, m, y, r, w...). The number of model phonemes is calculated in the order of phrases read aloud, for example, “In the car” → “Mobile phone is” → “Set to manner mode” → “Please refrain from talking”. That is, the number of first example phonemes, the number of second example phonemes, and so on are specified. The phoneme identification method is well-known as a labeling method and the like, so detailed description thereof will be omitted.

そして、音声処理部３１は、話者が上記文を音読したときの音声の波形を示す話者音声波形データを入力し、入力された話者音声波形データが示す音声の波形から時系列で音圧を積算し、且つ時系列で音素の数を特定していく。この期間中、音声処理部３１は、音圧の積算値がどれぐらいになったか、何個の音素が特定されたかなど複合的に判断して、区切タイミング（開始タイミングまたは終了タイミングに相当）を特定する。例えば、音声処理部３１は、音圧の積算値と手本音圧累積値とを比較（第１の手本音圧累積値、第２の手本音圧累積値・・・という順に比較）し、その差が閾値以内になった第１のタイミングと、特定した音素の数と手本音素の数とを比較（第１の手本音素の数、第２の手本音素の数・・・という順に比較）しその差が閾値以内になった第２のタイミングから区切タイミングを順次特定する。例えば、第１のタイミング（例えば音声の波形の開始位置からの経過時間）と、第２のタイミングとの間の時間が区切タイミングとして特定される。そして、音声処理部３１は、上述したように音圧レベル等により特定した話者文要素区間を区切タイミングでさらに区切ることで最終的な話者文要素区間を順次特定する。この場合、例えば、「マナーモードに設定の上通話はご遠慮下さい」の部分は、「マナーモードに設定の上」と「通話はご遠慮下さい」とに区切られることで、話者文要素区間が特定されることになる。つまり、上記区分タイミングは、例えば、「マナーモードに設定の上」に対応する話者文要素区間の終了タイミングとなり、且つ「通話はご遠慮下さい」に対応する話者文要素区間の開始タイミングとなる。これに伴い、「マナーモードに設定の上通話はご遠慮下さい」を表すテキストは、「マナーモードに設定の上」を表すテキストと、「通話はご遠慮下さい」を表すテキストとに分割されることになる。これにより、手本により音読されるフレーズに対応する手本文要素区間と、話者により音読される例えば文節（フレーズが複数に区切られた文節）に対応する話者文要素区間とが比較されることになる。 Then, the voice processing unit 31 inputs speaker voice waveform data indicating a voice waveform when the speaker reads the sentence aloud, and generates a sound in time series from the voice waveform indicated by the input speaker voice waveform data. The pressure is integrated and the number of phonemes is specified in time series. During this period, the speech processing unit 31 determines a delimiter timing (corresponding to a start timing or an end timing) by making a composite determination such as how much the integrated value of the sound pressure has been reached and how many phonemes have been specified. Identify. For example, the sound processing unit 31 compares the integrated value of the sound pressure with the sample sound pressure accumulated value (comparison in the order of the first sample sound pressure accumulated value, the second sample sound pressure accumulated value,...) The first timing when the difference is within the threshold is compared with the number of specified phonemes and the number of example phonemes (number of first example phonemes, number of second example phonemes,... The separation timing is sequentially specified from the second timing when the difference is within the threshold. For example, the time between the first timing (for example, the elapsed time from the start position of the sound waveform) and the second timing is specified as the division timing. Then, the speech processing unit 31 sequentially specifies the final speaker sentence element section by further dividing the speaker sentence element section specified by the sound pressure level or the like as described above at the division timing. In this case, for example, the section of “Please refrain from calling after setting manner mode” is divided into “Please set to manner mode” and “Please refrain from calling”, so that the sentence element section is specified. Will be. In other words, for example, the segment timing is the end timing of the speaker sentence element section corresponding to “After setting the manner mode” and the start timing of the speaker sentence element section corresponding to “Please refrain from talking”. . Along with this, the text that says "Please refrain from calling after setting manner mode" will be divided into text that represents "after setting to manner mode" and text that represents "Please refrain from calling" Become. Thereby, the hand text element section corresponding to the phrase read aloud by the model and the speaker sentence element section corresponding to, for example, a phrase (a phrase in which the phrase is divided into a plurality of phrases) read aloud by the speaker are compared. It will be.

上記とは逆に、例えば、「車内では（間合い）携帯電話はマナーモードに設定の上通話はご遠慮下さい」というように、一部素早く音読するお手本の音声波形データがあるとすると、上記の方法で開始タイミングと終了タイミングとを認識することにより、「車内では」、「携帯電話はマナーモードに設定の上通話はご遠慮下さい」というように、２つのフレーズ毎に対応する手本文要素区間に区切られて特定される。これに対し、話者が、「車内では（間合い）携帯電話は（間合い）マナーモードに設定の上（間合い）通話はご遠慮下さい」と区切り区切りゆっくり音読した場合、上記の方法で開始タイミングと終了タイミングとを認識することで、「車内では」、「携帯電話は」、「マナーモードに設定の上」、「通話はご遠慮下さい」というように、４つのフレーズ毎に対応する話者文要素区間に区切られて特定されることになる。このように、手本により音読される１つのフレーズが、話者により音読される複数のフレーズに対応している場合も、手本により音読されるフレーズに対応する手本文要素区間と、話者により音読されるフレーズの話者文要素区間とを比較し難くなる。このため、このような場合、音声処理部３１は、例えば、手本により音読されるフレーズに合わせるように、例えば「携帯電話は」と「マナーモードに設定の上」と「通話はご遠慮下さい」という３つのフレーズを含む結合フレーズに対応する話者文要素区間を特定するとよい。 Contrary to the above, for example, if there is a model voice waveform data to be read aloud quickly, for example, "Please refrain from talking on the phone in the manner mode in the car" By recognizing the start timing and end timing in, it is divided into hand text element sections corresponding to every two phrases, such as “in the car” and “Please refrain from calling after setting the mobile phone to silent mode” Identified. On the other hand, if the speaker reads “Let's refrain from calling (waiting) with (waiting) mobile phone set in (waiting)” in the car and separating it slowly, start timing and end using the above method. By recognizing timing, speaker sentence element sections corresponding to each of four phrases such as “in the car”, “cell phone”, “set to manner mode”, “please refrain from talking” It will be specified by being separated by. As described above, even when one phrase read aloud by the model corresponds to a plurality of phrases read aloud by the speaker, the hand text element section corresponding to the phrase read aloud by the model and the speaker This makes it difficult to compare with the speaker sentence element section of the phrase read aloud. For this reason, in such a case, the voice processing unit 31, for example, “cell phone”, “set to manner mode”, and “please refrain from talking” so as to match the phrase read aloud by the model. It is good to specify the speaker sentence element area corresponding to the combined phrase containing these three phrases.

この場合も、上記と同様、音声処理部３１は、例えば、手本音圧累積値と手本音素の数とを手本文要素区間毎に記憶しておく。そして、音声処理部３１は、話者が上記文を音読したときの音声の波形を示す話者音声波形データを入力し、入力された話者音声波形データが示す音声の波形から時系列で音圧を積算し、且つ時系列で音素の数を特定していく。音声処理部３１は、例えば、音圧の積算値と手本音圧累積値とを比較し、その差が閾値以内になったタイミング（例えば、音圧の積算値が手本音圧累積値に到達したタイミング）と、特定した音素の数と手本音素の数とを比較しその差が閾値以内になったタイミング（例えば、特定した音素の数が手本音素の数に到達したタイミング）とから結合フレーズの終了タイミングを特定する。なお、結合フレーズの開始タイミングは、結合フレーズが含む先頭のフレーズの開始タイミングに相当する。そして、音声処理部３１は、結合フレーズの開始タイミングから終了タイミングまでの区間を、最終的な手本文要素区間として特定する。この場合、例えば、「携帯電話は」と「マナーモードに設定の上」と「通話はご遠慮下さい」を含む３つのフレーズは、「携帯電話はマナーモードに設定の上通話はご遠慮下さい」という結合フレーズとして、その話者文要素区間が特定されることになる。これに伴い、「携帯電話は」を表すテキストと、「マナーモードに設定の上」を表すテキストと、「通話はご遠慮下さい」を表すテキストとは、「携帯電話はマナーモードに設定の上通話はご遠慮下さい」を表すテキストに結合されることになる。これにより、手本により音読されるフレーズに対応する手本文要素区間と、話者により音読される結合フレーズに対応する話者文要素区間とが比較されることになる。 Also in this case, as described above, the speech processing unit 31 stores, for example, the model sound pressure accumulated value and the number of model phonemes for each model body element section. Then, the voice processing unit 31 inputs speaker voice waveform data indicating a voice waveform when the speaker reads the sentence aloud, and generates a sound in time series from the voice waveform indicated by the input speaker voice waveform data. The pressure is integrated and the number of phonemes is specified in time series. For example, the sound processing unit 31 compares the integrated value of the sound pressure with the sample sound pressure accumulated value, and the timing at which the difference falls within the threshold (for example, the integrated value of the sound pressure has reached the sample sound pressure accumulated value). (Timing) and the number of identified phonemes compared to the number of example phonemes, and the difference is within a threshold (for example, the timing when the number of identified phonemes reaches the number of example phonemes) Specify the phrase end timing. Note that the start timing of the combined phrase corresponds to the start timing of the first phrase included in the combined phrase. Then, the speech processing unit 31 specifies the section from the start timing to the end timing of the combined phrase as the final hand text element section. In this case, for example, three phrases including “cell phone is set to silent mode” and “please refrain from talking” are combined with the phrase “cell phone should be set to silent mode and refrain from talking” The speaker sentence element section is specified as the phrase. Along with this, the text indicating "mobile phone is", the text indicating "Set to manner mode", and the text indicating "Please refrain from talking" Please refrain from "". As a result, the hand text element section corresponding to the phrase read aloud by the model and the speaker sentence element section corresponding to the combined phrase read aloud by the speaker are compared.

また、第２の実施形態において、音声処理部３１は、第１の実施形態と同様、音声波形データから例えば所定時間毎に切り出したデータから音圧レベル(dB)を音圧として所定時間毎に算出する。また、音声処理部３１は、音声波形データから例えば所定時間毎に切り出したデータから基本周波数（Hz）を算出し、算出した基本周波数（Hz）を音高として所定時間毎に算出する。また、音声処理部３１は、滑舌の評価に用いる声道特性を示す特徴量（音響特性）を文要素毎に算出する。例えば、音声処理部３１は、音声波形データを文要素区間毎に切り出し、切り出した文要素のデータを窓掛けで区切って、フーリエ解析することで振幅スペクトルを求める。そして、音声処理部３１は、求めた振幅スペクトルにメルフィルタバンクをかけ、メルフィルタバンクの出力を対数化した値を離散コサイン変換（ＤＣＴ）することでＭＦＣＣ（メル周波数ケプストラム係数）を、声道特性を示す特徴量として文要素毎に算出する。 In the second embodiment, as in the first embodiment, the voice processing unit 31 uses the sound pressure level (dB) as the sound pressure from data cut out from the voice waveform data, for example, every predetermined time. calculate. Further, the voice processing unit 31 calculates a fundamental frequency (Hz) from data cut out at predetermined time intervals from the voice waveform data, and calculates the calculated fundamental frequency (Hz) at predetermined time intervals as a pitch. The voice processing unit 31 calculates a feature amount (acoustic characteristic) indicating a vocal tract characteristic used for the evaluation of the smooth tongue for each sentence element. For example, the voice processing unit 31 cuts voice waveform data for each sentence element section, divides the cut sentence element data by windowing, and obtains an amplitude spectrum by performing Fourier analysis. Then, the speech processing unit 31 multiplies the obtained amplitude spectrum by a mel filter bank and performs a discrete cosine transform (DCT) on a logarithmic value of the output of the mel filter bank to obtain a MFCC (mel frequency cepstrum coefficient). It is calculated for each sentence element as a feature amount indicating the characteristic.

また、音声処理部３１は、手本音声波形データに基づいて、各文要素の開始タイミングから終了タイミングまでの手本文要素区間を文要素毎に特定し、且つ複数の文要素のうち何れかの文要素の終了タイミングから次の文要素の開始タイミングまでの手本インターバル区間を特定する。また、音声処理部３１は、話者音声波形データに基づいて、各文要素の開始タイミングから終了タイミングまでの話者文要素区間を文要素毎に特定し、且つ複数の文要素のうち何れかの文要素の終了タイミングから次の文要素の開始タイミングまでの話者インターバル区間を特定する。 Further, the speech processing unit 31 specifies, for each sentence element, a hand text element section from the start timing to the end timing of each sentence element based on the model speech waveform data, and any one of the plurality of sentence elements The sample interval section from the end timing of the sentence element to the start timing of the next sentence element is specified. In addition, the speech processing unit 31 specifies, for each sentence element, a speaker sentence element section from the start timing to the end timing of each sentence element based on the speaker voice waveform data, and any one of a plurality of sentence elements The speaker interval interval from the end timing of the next sentence element to the start timing of the next sentence element is specified.

次に、音読評価部３２は、手本文要素区間の時間長と、話者文要素区間の時間長とを比較して文を音読するスピード（音読スピード）の評価を文要素区間毎に行う。音読評価部３２は、例えば、時間長の比較結果として、文要素区間毎に、手本文要素区間の時間長と話者文要素区間の時間長との時間差を算出し、この時間差の絶対値に基づいて評価点を算出することで音読スピードの評価を行う。また、音読評価部３２は、文要素区間毎の音読スピードの評価に基づいて全ての文要素区間における音読スピードの評価を行う。全ての文要素区間における音読スピードの評価では、例えば、文要素区間毎に算出された音読スピードの評価点の平均値が全ての文要素区間における音読スピードのトータル評価点として算出される。 Next, the reading aloud evaluation unit 32 compares the time length of the hand text element section with the time length of the speaker sentence element section and evaluates the speed at which the sentence is read aloud (speech reading speed) for each sentence element section. The reading aloud evaluation unit 32 calculates, for example, the time difference between the time length of the hand text element section and the time length of the speaker sentence element section for each sentence element section as a comparison result of time lengths, and the absolute value of this time difference is calculated. The reading speed is evaluated by calculating the evaluation score based on this. Moreover, the reading aloud evaluation part 32 evaluates the reading aloud speed in all the sentence element sections based on the reading speed evaluation for each sentence element section. In the evaluation of the reading speed in all sentence element sections, for example, the average value of the reading speed evaluation points calculated for each sentence element section is calculated as the total evaluation point of the reading speed in all sentence element sections.

また、第２の実施形態でも第１の実施形態と同様に、音読評価部３２は、手本インターバル区間の時間長と、話者インターバル区間の時間長とを比較して文を音読したときの間合いの評価を行う。また、音読評価部３２は、手本文要素区間の音高と、手本文要素区間に対応する話者文要素区間の音高とを比較して、音高の評価を文要素毎に行う。音読評価部３２は、例えば、音高の比較結果として、手本文要素区間の音高と話者文要素区間の音高との差を算出し、この差に基づいて評価点を算出することで音高の評価を行う。また、音読評価部３２は、文要素区間毎の音高の評価に基づいて全ての文要素区間における音高の評価を行う。全ての文要素区間における音高の評価では、例えば、文要素区間毎に算出された音高の評価点の平均値が全ての文要素区間における音高のトータル評価点として算出される。また、音読評価部３２は、手本文要素区間の音量と、手本文要素区間に対応する話者文要素区間の音量とを比較して、音量の評価を文要素毎に行う。また、音読評価部３２は、文要素区間毎の音量の評価に基づいて全ての文要素区間における音量の評価を行う。全ての文要素区間における音量の評価では、例えば、文要素区間毎に算出された音量の評価点の平均値が全ての文要素区間における音量のトータル評価点として算出される。 In the second embodiment, as in the first embodiment, the reading evaluation unit 32 compares the time length of the sample interval section with the time length of the speaker interval section and reads the sentence aloud. Evaluate time interval. Moreover, the reading aloud evaluation unit 32 compares the pitch of the hand text element section with the pitch of the speaker sentence element section corresponding to the hand text element section, and evaluates the pitch for each sentence element. The reading aloud evaluation unit 32 calculates, for example, a difference between the pitch of the hand text element section and the pitch of the speaker sentence element section as a comparison result of the pitches, and calculates an evaluation score based on the difference. Evaluate the pitch. Moreover, the reading aloud evaluation part 32 evaluates the pitch in all the sentence element sections based on the evaluation of the pitch for each sentence element section. In the evaluation of the pitches in all sentence element sections, for example, the average value of the pitch evaluation points calculated for each sentence element section is calculated as the total evaluation score of the pitches in all sentence element sections. Moreover, the reading aloud evaluation unit 32 compares the volume of the hand text element section with the volume of the speaker sentence element section corresponding to the hand text element section, and evaluates the volume for each sentence element. Moreover, the reading aloud evaluation part 32 evaluates the sound volume in all the sentence element sections based on the evaluation of the sound volume for each sentence element section. In the evaluation of the sound volume in all sentence element sections, for example, the average value of the sound volume evaluation points calculated for each sentence element section is calculated as the total sound volume evaluation score in all sentence element sections.

また、音読評価部３２は、手本文要素区間の滑舌と、手本文要素区間に対応する話者文要素区間の滑舌とを比較して、滑舌の評価を文要素毎に行う。音読評価部３２は、滑舌の評価では、例えば、文要素毎に算出された声道特性を示す特徴量（ＭＦＣＣ）が用いられる。音読評価部３２は、例えば、滑舌の比較結果として、手本文要素区間の特徴量と話者文要素区間の特徴量との類似度を算出し、この類似度に基づいて評価点を算出することで滑舌の評価を行う。また、音読評価部３２は、文要素区間毎の滑舌の評価に基づいて全ての文要素区間における滑舌の評価を行う。全ての文要素区間における滑舌の評価では、例えば、文要素区間毎に算出された滑舌の評価点の平均値が全ての文要素区間における滑舌のトータル評価点として算出される。 Moreover, the reading aloud evaluation unit 32 compares the smooth tongue of the hand text element section with the smooth tongue of the speaker sentence element section corresponding to the hand text element section, and evaluates the smooth tongue for each sentence element. The reading aloud evaluation unit 32 uses, for example, a feature amount (MFCC) indicating vocal tract characteristics calculated for each sentence element in the evaluation of the smooth tongue. The reading aloud evaluation unit 32 calculates, for example, the similarity between the feature amount of the hand text element section and the feature amount of the speaker sentence element section as a smoothing tongue comparison result, and calculates an evaluation score based on the similarity. The tongue is evaluated. Moreover, the reading aloud evaluation part 32 evaluates the smooth tongue in all the sentence element sections based on the evaluation of the smooth tongue for each sentence element section. In the evaluation of the smooth tongue in all sentence element sections, for example, the average value of the smooth tongue evaluation points calculated for each sentence element section is calculated as the total evaluation score of the smooth tongue in all sentence element sections.

そして、音読評価部３２は、第１の実施形態と同様、例えば、少なくとも音読スピードの評価及び間合いの評価に基づいて、文の音読に対する総合評価を行う。この総合評価では、例えば、１つの文要素区間における音読スピードの評価点と、この文要素区間の前と後の少なくとも何れか一方のインターバル区間における間合いの評価点との平均値（或いは合計値でもよい）が１区間の総合評価点として算出される。或いは、音読評価部３２は、例えば、１つの文要素区間における音読スピードの評価と、この文要素区間の前と後の少なくとも何れか一方のインターバル区間における間合いの評価と、この文要素区間における音高の評価と、この文要素区間における音量の評価と、この文要素区間における滑舌の評価との少なくとも何れか２つの評価に基づいて、文の音読に対する１区間の総合評価を行ってもよい。この総合評価では、例えば、少なくとも何れか２つの評価で算出された評価点の平均値（或いは合計値でもよい）が１区間の総合評価点として算出される。 And the reading aloud evaluation part 32 performs the comprehensive evaluation with respect to the reading aloud of a sentence based on the evaluation of at least reading aloud speed and evaluation of a gap, like 1st Embodiment. In this comprehensive evaluation, for example, an average value (or a total value) of an evaluation score of reading aloud speed in one sentence element section and an evaluation score of a gap in at least one interval section before and after this sentence element section. Good) is calculated as the overall evaluation score of one section. Alternatively, the reading aloud evaluation unit 32, for example, evaluates the reading speed in one sentence element section, evaluates the interval in at least one interval section before and after this sentence element section, and sounds in the sentence element section. Based on at least one of the high evaluation, the sound volume evaluation in this sentence element section, and the smooth tongue evaluation in this sentence element section, a comprehensive evaluation of one section for reading aloud sentences may be performed. . In this comprehensive evaluation, for example, an average value (or a total value) of evaluation points calculated in at least any two evaluations is calculated as a comprehensive evaluation point of one section.

また、音読評価部３２は、第１の実施形態と同様、例えば、少なくとも、全ての文要素区間における音読スピードの評価及び全てのインターバル区間における間合いの評価に基づいて、文の音読に対する総合評価を行う。この総合評価では、例えば、全ての文要素区間における音読スピードのトータル評価点と全てのインターバル区間における間合いのトータル評価点との平均値（或いは合計値でもよい）が全区間の総合評価点として算出される。或いは、音読評価部３２は、例えば、全ての文要素区間における音読スピードの評価と、全てのインターバル区間における間合いの評価と、全ての文要素区間における音高の評価と、全ての文要素区間における音量の評価と、全ての文要素区間における滑舌の評価との少なくとも何れか２つの評価に基づいて、文の音読に対する総合評価を行ってもよい。この総合評価では、例えば、少なくとも何れか２つの評価で算出されたトータル評価点の平均値（或いは合計値でもよい）が全区間の総合評価点として算出される。 In addition, as in the first embodiment, the reading aloud evaluation unit 32 performs a comprehensive evaluation on reading aloud sentences based on, for example, at least the evaluation of reading speed in all sentence element sections and the evaluation of interval in all interval sections. Do. In this comprehensive evaluation, for example, an average value (or may be a total value) of a total evaluation score of reading aloud speed in all sentence element sections and a total evaluation score of intervals in all interval sections is calculated as a total evaluation score of all sections. Is done. Alternatively, the reading aloud evaluation unit 32, for example, evaluates reading speed in all sentence element sections, evaluation of intervals in all interval sections, evaluation of pitches in all sentence element sections, and in all sentence element sections. Based on at least one of the evaluation of the volume and the evaluation of the smooth tongue in all sentence element sections, a comprehensive evaluation for reading aloud sentences may be performed. In this comprehensive evaluation, for example, an average value (or a total value) of the total evaluation points calculated in at least any two evaluations is calculated as a total evaluation point for all sections.

表示処理部３３は、第１の実施形態と同様、音読スピードの評価、間合いの評価、及び総合評価のうち少なくとも何れか１つの評価をディスプレイＤの画面に表示させる。これにより、音読スピードや間の取り方が適切かどうかを話者に自覚させ効果的な練習や訓練をさせることができる。例えば、音読スピードの評価として、文要素区間毎の音読スピードの評価点を表示させてもよいし、全ての文要素区間における音読スピードのトータル評価点を表示させてもよい。また、間合いの評価として、インターバル区間毎の間合いの評価点を表示させてもよいし、全てのインターバル区間における間合いのトータル評価点を表示させてもよい。また、総合評価として、１区間の総合評価点を表示させてもよいし、全区間の総合評価点を表示させてもよい。なお、上記評価に加えて、表示処理部３３は、音高の評価、音量の評価、及び滑舌の評価のうち少なくとも何れか１つの評価をディスプレイＤの画面に表示させてもよい。この場合も、音高、音量、または滑舌の評価として、文要素区間毎の評価点を表示させてもよいし、全ての文要素区間におけるトータル評価点を表示させてもよい。更に、表示処理部３３は、第１の実施形態と同様、手本音声波形データに基づいて所定時間毎に算出された音高と音量の少なくとも何れか一方の時系列的な変化を表す手本グラフの全部または一部と、話者音声波形データに基づいて所定時間毎に算出された音高と音量の少なくとも何れか一方の時系列的な変化を表す話者グラフの全部または一部とを比較可能に一画面に表示させる。これにより、音読スピードや間の取り方に加えて音高や音量が適切かどうかを話者に自覚させ効果的な練習や訓練をさせることができる。 Similar to the first embodiment, the display processing unit 33 displays at least one of the evaluation of the reading speed, the evaluation of the interval, and the comprehensive evaluation on the screen of the display D. As a result, it is possible to make the speaker aware of whether the speed of reading aloud or how to take a space is appropriate, and to allow effective practice and training. For example, as the reading speed evaluation, a reading speed evaluation score for each sentence element section may be displayed, or a total reading speed evaluation score for all sentence element sections may be displayed. In addition, as evaluation of the interval, an evaluation score for the interval for each interval interval may be displayed, or a total evaluation score for the interval in all interval intervals may be displayed. Moreover, as a comprehensive evaluation, a comprehensive evaluation score of one section may be displayed, or a comprehensive evaluation score of all sections may be displayed. In addition to the above evaluation, the display processing unit 33 may display at least one of the evaluation of the pitch, the evaluation of the sound volume, and the evaluation of the tongue on the screen of the display D. In this case as well, the evaluation score for each sentence element section may be displayed as the evaluation of pitch, volume, or smooth tongue, or the total evaluation score in all sentence element sections may be displayed. Further, as in the first embodiment, the display processing unit 33 is a model representing a time-series change in at least one of a pitch and a volume calculated every predetermined time based on the model voice waveform data. All or part of the graph, and all or part of the speaker graph representing a time-series change of at least one of the pitch and the volume calculated every predetermined time based on the speaker voice waveform data. Display on one screen for comparison. As a result, in addition to the reading speed and how to set the interval, it is possible to make the speaker aware of whether the pitch and volume are appropriate, and to perform effective practice and training.

また、表示処理部３３は、第１の実施形態と同様の表示処理によって、手本グラフの中で少なくとも１つの手本文要素区間に対応する手本グラフ部分と、話者グラフの中で少なくとも１つの話者文要素区間に対応する話者グラフ部分とを比較可能に一画面に表示させる。このとき、表示処理部３３は、手本インターバル区間と、話者インターバル区間とを比較可能に手本グラフ部分及び話者グラフ部分を一画面に表示させる。そして、表示処理部３３は、表示切り替え指示に応じて、一画面に表示されている手本グラフ部分を、一画面に表示されていない手本グラフ部分（つまり、手本グラフの中で他の手本文要素区間に対応する手本グラフ部分）に切り替え表示させ、且つ、一画面に表示されている話者グラフ部分を、一画面に表示されていない話者グラフ部分（つまり、話者グラフの中で他の話者文要素区間に対応する話者グラフ部分）に切り替え表示させる。 In addition, the display processing unit 33 performs display processing similar to that of the first embodiment, and at least one example graph portion corresponding to at least one example body text element section in the example graph and at least one in the speaker graph. The speaker graph portion corresponding to one speaker sentence element section is displayed on one screen so as to be comparable. At this time, the display processing unit 33 displays the model graph portion and the speaker graph portion on one screen so that the sample interval section and the speaker interval section can be compared. Then, in response to the display switching instruction, the display processing unit 33 converts the model graph portion displayed on one screen to the sample graph portion not displayed on the one screen (that is, other model graphs in the sample graph). The speaker graph portion that is displayed on one screen is switched to the speaker graph portion that is not displayed on one screen (that is, the speaker graph portion of the speaker graph). Among them, the display is switched to the speaker graph portion corresponding to another speaker sentence element section.

１通信部
２記憶部
３制御部
４操作部
５インターフェース部
６バス
３１音声処理部
３２音読評価部
３３表示処理部
Ｓ音読評価装置 DESCRIPTION OF SYMBOLS 1 Communication part 2 Memory | storage part 3 Control part 4 Operation part 5 Interface part 6 Bus 31 Sound processing part 32 Reading aloud evaluation part 33 Display processing part S Reading aloud evaluation apparatus

Claims

A first phrase section specified for each phrase based on first voice waveform data indicating a waveform of a voice that serves as a model when a sentence including a plurality of phrases is read aloud, from the start timing to the end timing of the phrase From the end timing of one of the plurality of phrases to the start timing of the next phrase, the first phrase section up to and a first interval section specified based on the first speech waveform data Storage means for storing the first interval section;
Input means for inputting second speech waveform data indicating a waveform of a speech uttered when the speaker reads the sentence aloud;
Based on the second audio waveform data, a second phrase section from the start timing to the end timing of the phrase is specified for each phrase, and the next timing is determined from the end timing of any of the phrases. A specifying means for specifying the second interval section until the start timing of the phrase;
The time length of the first phrase section stored in the storage means and the time length of the second phrase section specified by the specifying means are compared, and the speed at which the sentence is read aloud is evaluated for each phrase. Speed evaluation means,
A time period for evaluating the time when the sentence is read aloud by comparing the time length of the first interval section stored in the storage means with the time length of the second interval period specified by the specifying means. An evaluation means;
Comprehensive evaluation means for performing comprehensive evaluation on the reading of the sentence based on at least the evaluation of the speed by the speed evaluation means and the evaluation of the space by the space evaluation means;
Display control means for displaying at least one of the evaluation of the speed by the speed evaluation means, the evaluation of the gap by the gap evaluation means, and the comprehensive evaluation by the comprehensive evaluation means;
A reading aloud evaluation apparatus comprising:

First calculation means for calculating sound information of at least one of pitch and volume at predetermined time intervals based on the first sound waveform data;
Second calculating means for calculating sound information of at least one of pitch and volume based on the second sound waveform data at predetermined time intervals;
Further comprising
The display control means includes all or part of the first graph representing the time-series change of the sound information calculated by the first calculation means, and the time-series of the sound information calculated by the second calculation means. 2. The reading aloud evaluation apparatus according to claim 1, wherein all or part of the second graph representing a change is displayed on a single screen so as to be comparable.

The display control means includes a first graph portion corresponding to at least one first phrase interval in the first graph, and a second graph portion corresponding to at least one second phrase interval in the second graph. The graph portion is displayed on one screen so that the graph portion can be compared, and in response to a display switching instruction, the first graph portion displayed on the one screen is a graph portion not displayed on the one screen and the first graph portion is displayed. In the graph, the graph portion corresponding to the other first phrase section is switched and displayed, and the second graph portion displayed on the one screen is a graph portion not displayed on the one screen. The reading aloud evaluation apparatus according to claim 2, wherein the second graph is switched to a graph portion corresponding to the other second phrase section in the second graph.

The display control means is configured to compare the first graph portion and the second graph portion so that the first interval section stored in the storage means and the second interval section specified by the specifying means can be compared. The reading aloud evaluation apparatus according to claim 3, wherein the reading aloud evaluation apparatus is displayed on a screen.

The display control means includes a first graph portion corresponding to at least one first phrase section in the first graph, the first interval section following the first phrase section, and the second graph. The first graph portion and the second graph portion are displayed on one screen so that the second graph portion corresponding to at least one of the second phrase intervals and the second interval interval following the second phrase interval can be compared. The reading aloud evaluation apparatus of Claim 2 characterized by the above-mentioned.

The display control means displays the first graph part and the second graph part on one screen by matching the start position of the first phrase section and the start position of the second phrase section in a predetermined direction. The reading aloud evaluation apparatus according to claim 3, wherein the reading aloud evaluation apparatus is characterized.

The said display control means displays the said comprehensive evaluation in one said area, and the said comprehensive evaluation in the said some area on the screen of the same hierarchy, The Claim 1 thru | or 6 characterized by the above-mentioned. Reading device.

The said display control means displays the said comprehensive evaluation in the said one area, and the said comprehensive evaluation in the said some area on the screen of a different hierarchy, The Claim 1 thru | or 6 characterized by the above-mentioned. Reading device.

The storage means stores the first speech waveform data,
The specifying means specifies the first phrase section for each phrase based on the first speech waveform data stored in the storage means, and specifies the first interval section. The reading aloud evaluation apparatus as described in any one of claim | item 1 thru | or 8.

A reading aloud evaluation method executed by one or more computers,
A first phrase section specified for each phrase based on first voice waveform data indicating a waveform of a voice that serves as a model when a sentence including a plurality of phrases is read aloud, from the start timing to the end timing of the phrase From the end timing of one of the plurality of phrases to the start timing of the next phrase, the first phrase section up to and a first interval section specified based on the first speech waveform data A storage step for storing the first interval section in the storage means;
An input step of inputting second voice waveform data indicating a waveform of a voice uttered when the speaker reads the sentence aloud;
Based on the second audio waveform data, a second phrase section from the start timing to the end timing of the phrase is specified for each phrase, and the next timing is determined from the end timing of any of the phrases. A specifying step of specifying a second interval section until the start timing of the phrase;
The time length of the first phrase section stored in the storage means is compared with the time length of the second phrase section specified in the specifying step, and the speed of reading the sentence is evaluated for each phrase. A speed evaluation step,
A time interval for evaluating the time when the sentence is read aloud by comparing the time length of the first interval interval stored in the storage means with the time length of the second interval interval specified by the specifying step. An evaluation step;
A comprehensive evaluation step for performing a comprehensive evaluation on the reading of the sentence based on at least the evaluation of the speed by the speed evaluation step and the evaluation of the space by the space evaluation step;
A control step for displaying at least one of the evaluation of the speed by the speed evaluation step, the evaluation of the interval by the interval evaluation step, and the overall evaluation by the comprehensive evaluation step;
A reading aloud evaluation method characterized by including:

A first phrase section specified for each phrase based on first voice waveform data indicating a waveform of a voice that serves as a model when a sentence including a plurality of phrases is read aloud, from the start timing to the end timing of the phrase From the end timing of one of the plurality of phrases to the start timing of the next phrase, the first phrase section up to and a first interval section specified based on the first speech waveform data A storage step for storing the first interval section in the storage means;
An input step of inputting second voice waveform data indicating a waveform of a voice uttered when the speaker reads the sentence aloud;
Based on the second audio waveform data, a second phrase section from the start timing to the end timing of the phrase is specified for each phrase, and the next timing is determined from the end timing of any of the phrases. A specifying step of specifying a second interval section until the start timing of the phrase;
The time length of the first phrase section stored in the storage means is compared with the time length of the second phrase section specified in the specifying step, and the speed of reading the sentence is evaluated for each phrase. A speed evaluation step,
A time interval for evaluating the time when the sentence is read aloud by comparing the time length of the first interval interval stored in the storage means with the time length of the second interval interval specified by the specifying step. An evaluation step;
A comprehensive evaluation step for performing a comprehensive evaluation on the reading of the sentence based on at least the evaluation of the speed by the speed evaluation step and the evaluation of the space by the space evaluation step;
A control step for displaying at least one of the evaluation of the speed by the speed evaluation step, the evaluation of the interval by the interval evaluation step, and the overall evaluation by the comprehensive evaluation step;
A program that causes a computer to execute.

A first sentence element section specified for each sentence element based on first voice waveform data indicating a waveform of a voice that serves as a model when a sentence including a plurality of sentence elements is read aloud, and the start of the sentence element The first sentence element section from the timing to the end timing, and the first interval section specified based on the first speech waveform data, and from the end timing of any one of the plurality of sentence elements to the next Storage means for storing a first interval section until the start timing of the sentence element;
Input means for inputting second speech waveform data indicating a waveform of a speech uttered when the speaker reads the sentence aloud;
Based on the second speech waveform data, a second sentence element section from the start timing to the end timing of the sentence element is specified for each sentence element, and any one of the sentence elements of the plurality of sentence elements is specified. A specifying means for specifying a second interval section from the end timing to the start timing of the next sentence element;
The time length of the first sentence element section stored in the storage means is compared with the time length of the second sentence element section specified by the specifying means to evaluate the speed at which the sentence is read aloud. Speed evaluation means to be performed every time,
A time period for evaluating the time when the sentence is read aloud by comparing the time length of the first interval section stored in the storage means with the time length of the second interval period specified by the specifying means. An evaluation means;
Comprehensive evaluation means for performing comprehensive evaluation on the reading of the sentence based on at least the evaluation of the speed by the speed evaluation means and the evaluation of the space by the space evaluation means;
Display control means for displaying at least one of the evaluation of the speed by the speed evaluation means, the evaluation of the gap by the gap evaluation means, and the comprehensive evaluation by the comprehensive evaluation means;
A reading aloud evaluation apparatus comprising:

A reading aloud evaluation method executed by one or more computers,
A first sentence element section specified for each sentence element based on first voice waveform data indicating a waveform of a voice that serves as a model when a sentence including a plurality of sentence elements is read aloud, and the start of the sentence element The first sentence element section from the timing to the end timing, and the first interval section specified based on the first speech waveform data, and from the end timing of any one of the plurality of sentence elements to the next A storage step of storing in the storage means a first interval section up to the start timing of the sentence element;
An input step of inputting second voice waveform data indicating a waveform of a voice uttered when the speaker reads the sentence aloud;
Based on the second speech waveform data, a second sentence element section from the start timing to the end timing of the sentence element is specified for each sentence element, and any one of the sentence elements of the plurality of sentence elements is specified. A specifying step of specifying a second interval section from the end timing to the start timing of the next sentence element;
The time length of the first sentence element section stored in the storage means and the time length of the second sentence element section specified by the specifying step are compared to evaluate the speed at which the sentence is read aloud. A speed evaluation step to be performed every time,
A time interval for evaluating the time when the sentence is read aloud by comparing the time length of the first interval interval stored in the storage means with the time length of the second interval interval specified by the specifying step. An evaluation step;
A comprehensive evaluation step for performing a comprehensive evaluation on the reading of the sentence based on at least the evaluation of the speed by the speed evaluation step and the evaluation of the space by the space evaluation step;
A control step for displaying at least one of the evaluation of the speed by the speed evaluation step, the evaluation of the interval by the interval evaluation step, and the overall evaluation by the comprehensive evaluation step;
A reading aloud evaluation method characterized by including:

A first sentence element section specified for each sentence element based on first voice waveform data indicating a waveform of a voice that serves as a model when a sentence including a plurality of sentence elements is read aloud, and the start of the sentence element The first sentence element section from the timing to the end timing, and the first interval section specified based on the first speech waveform data, and from the end timing of any one of the plurality of sentence elements to the next A storage step of storing in the storage means a first interval section up to the start timing of the sentence element;
An input step of inputting second voice waveform data indicating a waveform of a voice uttered when the speaker reads the sentence aloud;
Based on the second speech waveform data, a second sentence element section from the start timing to the end timing of the sentence element is specified for each sentence element, and any one of the sentence elements of the plurality of sentence elements is specified. A specifying step of specifying a second interval section from the end timing to the start timing of the next sentence element;
The time length of the first sentence element section stored in the storage means and the time length of the second sentence element section specified by the specifying step are compared to evaluate the speed at which the sentence is read aloud. A speed evaluation step to be performed every time,
A time interval for evaluating the time when the sentence is read aloud by comparing the time length of the first interval interval stored in the storage means with the time length of the second interval interval specified by the specifying step. An evaluation step;
A comprehensive evaluation step for performing a comprehensive evaluation on the reading of the sentence based on at least the evaluation of the speed by the speed evaluation step and the evaluation of the space by the space evaluation step;
A control step for displaying at least one of the evaluation of the speed by the speed evaluation step, the evaluation of the interval by the interval evaluation step, and the overall evaluation by the comprehensive evaluation step;
A program that causes a computer to execute.