JP2011242465A

JP2011242465A - Speech element database creating device, alternative speech model creating device, speech synthesizer, speech element database creating method, alternative speech model creating method, program

Info

Publication number: JP2011242465A
Application number: JP2010112373A
Authority: JP
Inventors: Mitsuaki Isogai; 光昭磯貝; Hideyuki Mizuno; 秀之水野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-05-14
Filing date: 2010-05-14
Publication date: 2011-12-01
Anticipated expiration: 2030-05-14
Also published as: JP5449022B2

Abstract

PROBLEM TO BE SOLVED: To create a complete speech element database by creating an alternative speech model even when all the speech models can not be created from a speech waveform database.SOLUTION: A speech element database creating device comprises: a phoneme-diphone section converting section 1200 receiving the input of speech waveform data and outputting a diphone section and a diphone label; a speech parameter series converting section 1300 receiving the input of the speech waveform data and outputting a speech parameter series; a speech model creating section 1400 receiving the input of the speech parameter series and outputting a speech model; a missing diphone label output section 1600 receiving the input of the diphone label and a defined diphone label list 1500 and outputting a missing diphone label; a half-phone creating section 1800 receiving the input of the speech model and the diphone label and outputting a half-phone; and an alternative speech model creating section 1900 receiving the input of the half-phone and the missing diphone label and outputting an alternative speech model.

Description

この発明は、人間が発声した音声を記録した音声波形データベースから、テキスト音声合成技術に用いることができる音声素片データベースを作成する音声素片データベース作成装置に関する。 The present invention relates to a speech segment database creation device that creates a speech segment database that can be used in a text speech synthesis technique from a speech waveform database that records speech uttered by a human.

音素、音節、あるいは音韻連鎖といった単語より短い単位の音声を音声合成単位として、この音声合成単位を連結して合成音声を生成する技術が知られている。音声合成単位は発声者が文リストを読み上げた音声を記録した音声波形データベースより収集する。音声合成単位として以下が知られている（非特許文献１、２参照）。 2. Description of the Related Art There is known a technique for generating synthesized speech by connecting speech synthesis units using speech units shorter than words such as phonemes, syllables, or phonological chains as speech synthesis units. The speech synthesis unit is collected from the speech waveform database that records the speech that the speaker has read out the sentence list. The following are known as speech synthesis units (see Non-Patent Documents 1 and 2).

音素単位：母音（Ｖｏｗｅｌ）、子音（Ｃｏｎｓｏｎａｎｔ）を音声合成単位とする。収集しなければならない音声合成単位の総数が少なくて済む。しかし、音声の調音結合情報が含まれていないため、合成音声の音質は低い。 Phoneme units: Vowels and consonants are used as speech synthesis units. The total number of speech synthesis units that must be collected is small. However, since the sound articulation combination information is not included, the sound quality of the synthesized speech is low.

音節（ＣＶ）単位：子音（Ｃｏｎｓｏｎａｎｔ）と母音（Ｖｏｗｅｌ）の組み合わせを音声合成単位とする。日本語の音節に適しており、子音から母音に変化する際の調音結合が保存されている。収集しなければならない音声合成単位の総数が少なくて済む。しかし音節（ＣＶ）単位前後の調音結合情報が含まれていないため、やはり合成音声の音質は低い。 Syllable (CV) unit: A combination of consonant and vowel is used as a speech synthesis unit. It is suitable for Japanese syllables, and preserves articulation combinations when changing from consonants to vowels. The total number of speech synthesis units that must be collected is small. However, the tone quality of the synthesized speech is still low because the articulation combination information around the syllable (CV) unit is not included.

ダイフォン単位：ＣＶ、ＶＣ、ＶＶなど、２個の音素の組み合わせを音声合成単位とする。音声合成単位間の連結は音素の中心部にて行われる。日本語に出現する全ての調音結合情報を含んでいるため、音節（ＣＶ）単位よりも必要な音声合成単位の総数が多くなるが、合成音声は高品質となる。 Diphone unit: A combination of two phonemes such as CV, VC, and VV is used as a speech synthesis unit. Connection between speech synthesis units is performed at the center of phonemes. Since all the articulation combination information appearing in Japanese is included, the total number of necessary speech synthesis units is larger than the syllable (CV) unit, but the synthesized speech has high quality.

音素単位とダイフォン単位とを併用する方法（非特許文献３参照）：母音同士を連結する場合にはダイフォン単位を音声合成単位に用いるが、それ以外の連結には音素単位を用いる。ダイフォン単位を用いる場合には、音声合成単位間の連結は音素の中心部にて行われる。音素単位を用いる場合には、音声合成単位間の連結は音素の境界部にて行われる。母音同士を連結する場合には、音素中心部において連結するほうが音素境界部において連結する場合よりもなめらかな音声合成結果が得られる。したがってこの方法によれば、音素境界部と音素中心部のうち、より滑らかに連結することができる連結点において、合成音声単位を連結することができる。音素単位のみを音声合成単位とする場合よりも自然な合成が可能であり、合成音声は高品質となる。しかし、音声合成時の素片探索処理量が大きいという問題がある。 A method of using a phoneme unit and a diphone unit in combination (see Non-Patent Document 3): When connecting vowels, a diphone unit is used as a speech synthesis unit, but a phoneme unit is used for other connections. When diphone units are used, the speech synthesis units are connected at the center of the phoneme. When phoneme units are used, the speech synthesis units are connected at the phoneme boundary. When vowels are connected, a smoother speech synthesis result is obtained when connecting at the phoneme center than when connecting at the phoneme boundary. Therefore, according to this method, the synthesized speech units can be connected at a connection point where the phoneme boundary part and the phoneme center part can be connected more smoothly. More natural synthesis is possible than when only phoneme units are used as speech synthesis units, and the synthesized speech has high quality. However, there is a problem in that the amount of segment search processing during speech synthesis is large.

この他に、環境付音素（トライフォン）や、合成の都度音声コーパスから適切な音声合成単位を選択する、可変長合成単位などが提案されている。 In addition to this, a variable length synthesis unit for selecting an appropriate speech synthesis unit from a speech phone with environment (triphone) or a speech corpus for each synthesis has been proposed.

阿部匡伸、「コーパスベース音声合成技術の動向［II］」、電子情報通信学会誌、社団法人電子情報通信学会、平成１６年２月、第８７巻、第２号、pp129〜134.Abe Yoshinobu, “Trends in Corpus-Based Speech Synthesis Technology [II]”, IEICE Journal, The Institute of Electronics, Information and Communication Engineers, February 2004, Vol. 87, No. 2, pp129-134. 小池恒彦著、「音声情報工学」、ＮＴＴアドバンステクノロジ社、昭和６２年、pp66〜67.Koike Tsunehiko, "Voice Information Engineering", NTT Advanced Technology, 1987, pp66-67. 戸田智基、河合恒、津崎実、鹿野清宏、「素片接続型日本語テキスト音声合成における音素単位とダイフォン単位に基づく素片選択」、電子情報通信学会論文誌、社団法人電子情報通信学会、平成１４年１２月、D-II、vol.J85-D-II、no.12、pp1760〜1770.Toda Tomoki, Kawai Tsune, Tsuzaki Minoru, Shikahiro Shikano, “Fragment Selection Based on Phoneme Units and Diphone Units in Segment-Connected Japanese Text-to-Speech Synthesis”, IEICE Transactions, The Institute of Electronics, Information and Communication Engineers, Heisei Heisei December 2014, D-II, vol.J85-D-II, no.12, pp 1760-1770.

ダイフォン単位を音声合成単位として用いれば、音節（ＣＶ）単位を音声合成単位として用いた場合と比較して必要な音声合成単位の総数がそれほど多くはならずに、品質の良い合成音声を得ることができる。しかしながら発声者が文リストを読み上げた音声を記録した音声波形データベースから、ダイフォン単位による音声合成に必要な音声合成単位を収集する場合、前記文リストの規模が十分でなく、音声波形データベースから必要な全ての音声合成単位を収集できない場合がある。この場合には、必要な全ての音声合成単位を音声モデルとして保有する音声素片データベースを作成することができず、この不完全な音声素片データベースによっては、音声の欠落なしに合成音声を作成することができない。 If the diphone unit is used as the speech synthesis unit, the total number of required speech synthesis units is not so much compared with the case where the syllable (CV) unit is used as the speech synthesis unit, and high-quality synthesized speech can be obtained. Can do. However, when collecting speech synthesis units necessary for speech synthesis in units of diphones from a speech waveform database that records the speech that the speaker has read out the sentence list, the scale of the sentence list is not sufficient and is necessary from the speech waveform database. It may not be possible to collect all speech synthesis units. In this case, it is not possible to create a speech segment database that holds all necessary speech synthesis units as speech models. Depending on this incomplete speech segment database, synthesized speech can be created without missing speech. Can not do it.

前述の音素単位とダイフォン単位とを併用する方法によれば、ダイフォン単位による音声合成に必要な音声合成単位は母音同士の連結に限られるため、音声合成単位の総数が少なくて済む。従って前記文リストが小規模であっても、必要な全ての音声合成単位を得ることは容易である。しかしながら、音声合成時の素片探索の範囲が広がることにより、素片探索処理量が大きくなってしまう。 According to the above-described method using both phoneme units and diphone units, the total number of speech synthesis units can be reduced because the speech synthesis unit necessary for speech synthesis by diphone units is limited to the connection of vowels. Therefore, even if the sentence list is small, it is easy to obtain all necessary speech synthesis units. However, since the range of segment search at the time of speech synthesis is expanded, the segment search processing amount is increased.

本発明では、音声波形データベースから必要な全ての音声モデルを生成できなかった場合に、代替音声モデルを生成して完全な音声素片データベースを生成することができる音声素片データベース作成装置が提供される。本発明の音声素片データベース作成装置は、音素−ダイフォン区間変換部と、音声パラメータ系列変換部と、音声モデル生成部と、欠落ダイフォンラベル出力部と、ハーフフォン生成部と、代替音声モデル生成部とを備える。 According to the present invention, there is provided a speech unit database creation device capable of generating a substitute speech model and generating a complete speech unit database when all necessary speech models cannot be generated from the speech waveform database. The The speech segment database creation device of the present invention includes a phoneme-diphone section converter, a speech parameter series converter, a speech model generator, a missing diphone label output unit, a half phone generator, and an alternative speech model generator. A part.

前記音素−ダイフォン区間変換部は、音素区間長さごとに音素ラベルを付与された音声波形データを入力とし、任意の隣り合う二つの音素区間のうち先の音素区間の後半部と、後の音素区間の前半部とを連結してダイフォン区間とし、当該先の音素区間の音素ラベルと当該後の音素区間の音素ラベルとを連結してダイフォンラベルとし、当該ダイフォン区間と当該ダイフォンラベルとを対応付けて出力する。 The phoneme-diphone section conversion unit receives speech waveform data to which a phoneme label is assigned for each phoneme section length, and inputs the latter half of the preceding phoneme section and the subsequent phonemes of any two adjacent phoneme sections. The first half of the section is connected to form a diphone section, the phoneme label of the previous phoneme section and the phoneme label of the subsequent phoneme section are connected to form a diphone label, and the diphone section and the diphone label are Output in association.

前記音声パラメータ系列変換部は、前記音声波形データと前記ダイフォンラベルと前記ダイフォン区間とを入力とし、前記音声波形データをダイフォン区間ごとに、一定のフレーム長ごとに音声パラメータに変換し、ダイフォン区間ごとの音声パラメータの列を音声パラメータ系列とし、当該音声パラメータ系列を当該ダイフォン区間と対応付けて出力する。 The speech parameter series conversion unit receives the speech waveform data, the diphone label, and the diphone section, and converts the speech waveform data into speech parameters for each diphone section and for each fixed frame length. Each voice parameter column is set as a voice parameter series, and the voice parameter series is output in association with the diphone section.

前記音声モデル生成部は、前記音声パラメータ系列と前記ダイフォンラベルと前記ダイフォン区間とを入力とし、ダイフォン区間ごとに、ダイフォン区間に対応付けられた音声パラメータ系列のうちから１以上の音声パラメータを選択して代表パタンとし、当該代表パタンよりなる音声モデルを生成し、当該ダイフォン区間と対応付いたダイフォンラベルと、当該音声モデルとを対応付けて出力する。 The voice model generation unit receives the voice parameter series, the diphone label, and the diphone section, and selects one or more voice parameters from the voice parameter series associated with the diphone section for each diphone section. As a representative pattern, a voice model composed of the representative pattern is generated, and a diphone label associated with the diphone section and the voice model are output in association with each other.

前記欠落ダイフォンラベル出力部は、前記ダイフォンラベルと、定義済ダイフォンラベルリストとを入力とし、前記定義済ダイフォンラベルリストに存在するが、前記ダイフォンラベルとして入力されていないダイフォンラベルを欠落ダイフォンラベルとして出力する。 The missing diphone label output unit receives the diphone label and a defined diphone label list, and is present in the defined diphone label list but is not input as the diphone label. Is output as a missing diphone label.

前記ハーフフォン生成部は、前記音声モデルと前記ダイフォンラベルとを入力とし、前記音声モデルを前半部と後半部に分割して双方をハーフフォンとし、当該分割された音声モデルと対応付いたダイフォンラベルの前半部をハーフフォンラベルとして、当該分割された音声モデルの前半部からなるハーフフォンと対応付けて出力し、当該分割された音声モデルと対応付いたダイフォンラベルの後半部をハーフフォンラベルとして、当該分割された音声モデルの後半部からなるハーフフォンと対応付けて出力する。 The half phone generation unit receives the voice model and the diphone label as input, divides the voice model into a first half and a second half to make both half phones, and a die associated with the divided voice model. The first half of the phone label is set as a half phone label and output in association with the half phone consisting of the first half of the divided voice model, and the second half of the diphone label associated with the divided voice model is half phone. As a label, it is output in association with a half phone comprising the latter half of the divided speech model.

前記代替音声モデル生成部は、前記ハーフフォンと、前記ハーフフォンラベルと、前記欠落ダイフォンラベルとを入力とし、任意の欠落ダイフォンラベルの前半部と同一もしくは類似のハーフフォンラベルと対応付いたハーフフォンと、当該欠落ダイフォンラベルの後半部と同一もしくは類似のハーフフォンラベルと対応付いたハーフフォンとを連結し、代替音声モデルとして出力する。 The alternative speech model generation unit inputs the half phone, the half phone label, and the missing diphone label, and associates the half phone label with the same or similar half phone label as the first half of any missing diphone label. The half phone and the half phone associated with the same or similar half phone label as the latter half of the missing diphone label are connected and output as an alternative voice model.

これらにより、ハーフフォンを連結して代替音声モデルを生成するため、音声波形データベースから必要な全ての音声モデルを生成できなかった場合にも、完全な音声素片データベースを生成することができ、当該音声素片データベースを用いて、音声の欠落がない合成音声を作成することができる。また、音声素片データベース作成時にあらかじめ適切な代替音声モデルを生成しておくため、素片探索処理量の増大を避けることができる。 As a result, half phone is connected to generate an alternative speech model, so that even if all required speech models cannot be generated from the speech waveform database, a complete speech unit database can be generated, Using the speech segment database, synthesized speech with no missing speech can be created. In addition, since an appropriate alternative speech model is generated in advance when the speech unit database is created, an increase in the amount of segment search processing can be avoided.

また、本発明の音声素片データベース作成装置における代替音声モデル生成部は、任意の欠落ダイフォンラベルの前半部と同一のハーフフォンラベルと対応付いたハーフフォンと、当該欠落ダイフォンラベルの後半部と同一のハーフフォンラベルと対応付いたハーフフォンの、少なくともいずれか一方が複数存在する場合に、前記前半部のハーフフォンと、前記後半部のハーフフォンとのＦ０ギャップが最小となる組み合わせを当該欠落ダイフォンラベルの連結対象として選択してもよい。 The alternative speech model generation unit in the speech unit database creation device of the present invention includes a half phone associated with the same half phone label as the first half of any missing diphone label, and the latter half of the missing diphone label. A combination that minimizes the F0 gap between the half phone in the first half and the half phone in the second half when there is a plurality of half phones associated with the same half phone label. You may select as a connection object of a missing diphone label.

これにより、代替音声モデルの接続部のＦ０変化量が減少し、当該代替音声モデルを用いた合成音声が高品質となる。 As a result, the F0 change amount at the connection portion of the alternative speech model is reduced, and the synthesized speech using the alternative speech model becomes high quality.

また、本発明の音声素片データベース作成装置における代替音声モデル生成部は、任意の欠落ダイフォンラベルの前半部と同一のハーフフォンラベルと対応付いたハーフフォンと、当該欠落ダイフォンラベルの後半部と同一のハーフフォンラベルと対応付いたハーフフォンの、少なくともいずれか一方が複数存在する場合に、前記前半部のハーフフォンと、前記後半部のハーフフォンを、予め定義されたＦ０値域で区切られた２以上のカテゴリに分類し、同一もしくは近接するカテゴリに分類された前記前半部のハーフフォンと前記後半部のハーフフォンとからなる組み合わせを当該欠落ダイフォンラベルの連結対象として選択してもよい。 The alternative speech model generation unit in the speech unit database creation device of the present invention includes a half phone associated with the same half phone label as the first half of any missing diphone label, and the latter half of the missing diphone label. When there are a plurality of halfphones associated with the same halfphone label, the first halfphone and the second halfphone are separated by a predefined F0 range. The combination of the first half half phone and the second half half phone classified into two or more categories and the same or close categories may be selected as the connection target of the missing diphone label. .

これにより、代替音声モデルの接続部のＦ０変化量が減少し、当該代替音声モデルを用いた合成音声が高品質となる。また、このようにして作成された音声素片データベースは、同一の欠落ダイフォンラベルについて、平均Ｆ０の異なる代替音声モデルを複数（最多の場合、カテゴリ数と同数）有する可能性が高くなる。このため、上記音声素片データベースを用いて音声合成処理を行う場合、韻律生成によって生成したＦ０に近似したＦ０値を持つ音声モデルを、前記音声素片データベースから選択できる可能性が高くなり、Ｆ０変化量が減少することにより合成音声がさらに高品質となる。また、同一もしくは近接するカテゴリに分類されたハーフフォン同士を組み合わせて代替音声モデルとするため、すべてのハーフフォンの組み合わせを代替音声モデルとして記憶することとした場合に比べ、音声素片データベースに記憶する代替音声モデルの総数を著しく少なく抑えることができ、素片探索時間の増加やデータベースサイズの増加を避けることができる。 As a result, the F0 change amount at the connection portion of the alternative speech model is reduced, and the synthesized speech using the alternative speech model becomes high quality. In addition, the speech unit database created in this way has a high possibility of having a plurality of alternative speech models having the same average F0 (the same number as the number of categories in the maximum case) for the same missing diphone label. Therefore, when speech synthesis processing is performed using the speech unit database, it is highly possible that a speech model having an F0 value approximated to F0 generated by prosody generation can be selected from the speech unit database. As the amount of change is reduced, the synthesized speech is of higher quality. In addition, since the half-phones classified into the same or close categories are combined into an alternative voice model, the combination of all the half-phones is stored in the voice unit database as compared with the case where all the half-phone combinations are stored as alternative voice models. The total number of alternative speech models to be performed can be significantly reduced, and an increase in segment search time and an increase in database size can be avoided.

また、本発明の音声素片データベース作成装置における代替音声モデル生成部は、任意の欠落ダイフォンラベルの前半部と同一のハーフフォンラベルと対応付いたハーフフォン、もしくは任意の欠落ダイフォンラベルの後半部と同一のハーフフォンラベルと対応付いたハーフフォンが存在しない場合に、予め定義された音素間距離マトリクスが最小となるハーフフォンを、前記存在しないハーフフォンの替わりに当該欠落ダイフォンラベルの連結対象として選択してもよい。 The alternative speech model generation unit in the speech unit database creation device of the present invention is a half phone associated with the same half phone label as the first half of any missing diphone label, or the latter half of any missing diphone label. If there is no halfphone associated with the same halfphone label as the part, the halfphone with the smallest interphoneme distance matrix is connected to the missing diphone label instead of the nonexistent halfphone. You may select as an object.

これにより、音声波形データベース中に全く存在しない音素についても、代替音声モデルを生成することができるため、必要な全ての音声モデルを保有する音声素片データベースを生成することができ、当該音声素片データベースを用いて、音声の欠落がない合成音声を作成することができる。 As a result, an alternative speech model can be generated even for a phoneme that does not exist in the speech waveform database. Therefore, a speech unit database having all necessary speech models can be generated, and the speech unit Using the database, synthesized speech without missing speech can be created.

また、あらかじめダイフォンラベルを付与された音声モデルを入力することとすれば、音素−ダイフォン区間変換部と、音声パラメータ系列変換部と、音声モデル生成部とを省略することができる。したがって、本発明では、欠落ダイフォンラベル出力部と、ハーフフォン生成部と、代替音声モデル生成部とを備える代替音声モデル作成装置が提供される。この代替音声モデル作成装置における各部の働きは、前記音声素片データベース作成装置における同一名称の各部における働きと同じである。 Also, if a speech model to which a diphone label is assigned in advance is input, the phoneme-diphone section converter, the speech parameter series converter, and the speech model generator can be omitted. Therefore, in the present invention, an alternative speech model creation device including a missing diphone label output unit, a half phone generation unit, and an alternative speech model generation unit is provided. The function of each part in this alternative speech model creation apparatus is the same as the function of each part having the same name in the speech segment database creation apparatus.

また本発明は、上記の音声素片データベース作成装置、もしくは代替音声モデル作成装置によって作成された音声素片データベースを用いて、テキストから音声を合成する音声合成装置を提供する。この音声合成装置は、テキスト解析部と、韻律生成部と、音声モデル選択部と、音声合成部とを有する。 The present invention also provides a speech synthesizer that synthesizes speech from text using the speech segment database created by the speech segment database creation device or the alternative speech model creation device. The speech synthesizer includes a text analysis unit, a prosody generation unit, a speech model selection unit, and a speech synthesis unit.

テキスト解析部は、テキストを入力とし、読み、アクセント、音韻系列を出力する。韻律生成部は、読み、アクセントを入力とし、Ｆ０、パワー、音韻長を出力する。音声モデル選択部は、Ｆ０、パワー、音韻系列を入力とし、音声素片データベースから音声モデルを選択して出力する。音声合成部は、音声モデル、Ｆ０、パワー、音韻長を入力とし、合成音声を出力する。 The text analysis unit takes text as input and outputs readings, accents, and phoneme sequences. The prosody generation unit receives readings and accents, and outputs F0, power, and phoneme length. The speech model selection unit receives F0, power, and phoneme series as input, and selects and outputs a speech model from the speech unit database. The speech synthesizer receives the speech model, F0, power, and phoneme length, and outputs synthesized speech.

これらにより、ハーフフォンを連結して代替音声モデルを生成し、音声素片データベースを作成するため、音声波形データベースから必要な全ての音声モデルを生成できなかった場合にも、完全な音声素片データベースを生成することができ、当該音声素片データベースを用いて、音声の欠落がない合成音声を作成することができる。また、音声素片データベース作成時にあらかじめ適切な代替音声モデルを生成しておくため、素片探索処理量の増大を避けることができる。 As a result, half phone is connected to generate an alternative speech model and a speech unit database is created. Therefore, even if all necessary speech models cannot be generated from the speech waveform database, a complete speech unit database is created. Can be generated, and using the speech segment database, synthesized speech with no missing speech can be created. In addition, since an appropriate alternative speech model is generated in advance when the speech unit database is created, an increase in the amount of segment search processing can be avoided.

以上のように、本発明では、音声波形データベースから必要な全ての音声モデルを生成できなかった場合にも、代替音声モデルを生成することにより完全な音声素片データベースを生成することができ、当該音声素片データベースを用いて、音声の欠落がない合成音声を作成することができる。また、音声素片データベース作成時にあらかじめ適切な代替音声モデルを生成しておくため、素片探索処理量の増大を避けることができる。 As described above, in the present invention, even when all necessary speech models cannot be generated from the speech waveform database, a complete speech segment database can be generated by generating an alternative speech model, Using the speech segment database, synthesized speech with no missing speech can be created. In addition, since an appropriate alternative speech model is generated in advance when the speech unit database is created, an increase in the amount of segment search processing can be avoided.

音声素片データベース作成装置、代替音声モデル作成装置の構成を示すブロック図。The block diagram which shows the structure of an audio | voice element database production apparatus and an alternative audio | voice model production apparatus. 音声素片データベース作成装置、代替音声モデル作成装置の動作を示すフローチャート。The flowchart which shows operation | movement of a speech segment database creation apparatus and an alternative speech model creation apparatus. 音声素片データベース作成装置の音素セグメンテーション部と、音素−ダイフォン区間変換部の出力例を示す図。The figure which shows the example of an output of the phoneme segmentation part of a speech segment database preparation apparatus, and a phoneme-diphone area conversion part. 音声素片データベース作成装置の音声パラメータ系列変換部と、音声モデル生成部の出力例を示す図。The figure which shows the example of an output of the speech parameter series conversion part of a speech unit database production apparatus, and a speech model production | generation part. 音声素片データベース作成装置の代替音声モデル生成部の構成を示すブロック図。The block diagram which shows the structure of the alternative speech model production | generation part of a speech unit database preparation apparatus. 音声素片データベース作成装置の代替音声モデル生成部のハーフフォン選択手段の出力例を示す図。The figure which shows the example of an output of the half phone selection means of the alternative speech model production | generation part of a speech unit database production apparatus. 音声合成装置の構成を示すブロック図。The block diagram which shows the structure of a speech synthesizer. 音声合成装置の動作を示すフローチャート。The flowchart which shows operation | movement of a speech synthesizer. 音声合成装置の音声合成部の出力例を示す図。The figure which shows the output example of the speech synthesis part of a speech synthesizer. 音声素片データベースに記憶される音声モデル、代替音声モデルの例を示す表。The table | surface which shows the example of the speech model memorize | stored in a speech unit database, and an alternative speech model.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the structure part which has the same function, and duplication description is abbreviate | omitted.

図１〜６、図１０を参照して本発明の音声素片データベース作成装置および、音声素片データベース作成方法を説明する。図１に示した音声素片データベース作成装置１０００は、音素セグメンテーション部１１００と、音素−ダイフォン区間変換部１２００と、音声パラメータ系列変換部１３００と、音声モデル生成部１４００と、定義済ダイフォンラベルリスト１５００と、欠落ダイフォンラベル出力部１６００と、ハーフフォン生成部１８００と、代替音声モデル生成部１９００と、を有する。図３に示した音素セグメンテーション部１１００は、音素区間分割手段１１１０と、音素ラベル付与手段１１２０とを有する。音素−ダイフォン区間変換部１２００は、ダイフォン区間分割手段１２１０と、ダイフォンラベル付与手段１２２０とを有する。図５に示した代替音声モデル生成部１９００は、ハーフフォン配置手段１９１０と、欠落ダイフォンラベルリスト１９２０と、決定木確定手段１９３０と、音素間距離マトリクステーブル１９４０と、ハーフフォン選択手段１９５０と、ハーフフォン連結手段１９６０とを有する。 The speech segment database creation apparatus and speech segment database creation method of the present invention will be described with reference to FIGS. The speech unit database creation apparatus 1000 shown in FIG. 1 includes a phoneme segmentation unit 1100, a phoneme-diphone section conversion unit 1200, a speech parameter series conversion unit 1300, a speech model generation unit 1400, and a defined diphone label list. 1500, a missing diphone label output unit 1600, a half phone generation unit 1800, and an alternative speech model generation unit 1900. The phoneme segmentation unit 1100 shown in FIG. 3 includes a phoneme segment dividing unit 1110 and a phoneme label providing unit 1120. The phoneme-diphone section conversion unit 1200 includes a diphone section dividing unit 1210 and a diphone label providing unit 1220. The alternative speech model generation unit 1900 illustrated in FIG. 5 includes a half phone placement unit 1910, a missing diphone label list 1920, a decision tree determination unit 1930, an interphoneme distance matrix table 1940, a half phone selection unit 1950, Half phone connection means 1960.

図２、３を参照して、音素セグメンテーション部１１００では、音声波形データベース９１中の音声波形データ１１１１を入力として、音素区間分割手段１１１０が、音声波形データ１１１１を音素区間１１１２に分割し、音声波形データ１１１１と音素区間１１１２を対応付けて出力する。音声波形データ１１１１と音素区間１１１２を入力として、音素ラベル付与手段１１２０が、音素区間１１１２ごとに音素ラベル１１２１を付与して、音声波形データ１１１１と、音素区間１１１２と、音素ラベル１１２１とを対応付けて出力する（Ｓ１１００）。この処理は、セグメンテーションを自動的に行う方法として知られている従来方法（参考特許文献１：特開２００４−７７９０１）を用いることができる。 2 and 3, in phoneme segmentation section 1100, speech waveform data 1111 in speech waveform database 91 is input, and phoneme segment dividing means 1110 divides speech waveform data 1111 into phoneme segments 1112 to generate speech waveforms. Data 1111 and phoneme section 1112 are output in association with each other. With the phonetic waveform data 1111 and the phoneme segment 1112 as inputs, the phoneme label assigning means 1120 assigns the phoneme label 1121 for each phoneme segment 1112, and associates the speech waveform data 1111, the phoneme segment 1112, and the phoneme label 1121. Are output (S1100). For this processing, a conventional method known as a method of automatically performing segmentation (Reference Patent Document 1: Japanese Patent Application Laid-Open No. 2004-77901) can be used.

音素−ダイフォン区間変換部１２００では、音素区間１１１２ごとに音素ラベル１１２１を付与された音声波形データ１１１１を入力とし、ダイフォン区間分割手段１２１０が、任意の隣り合う二つの音素区間のうち先の音素区間の後半部と、後の音素区間の前半部とを連結してダイフォン区間１２１２とし、ダイフォン区間１２１２と、音素ラベル１１２１とを音声波形データ１１１１に対応付けて出力する。ダイフォン区間１２１２と、音素ラベル１１２１と、音声波形データ１１１１とを入力とし、ダイフォンラベル付与手段１２２０は、当該先の音素区間の音素ラベルと当該後の音素区間の音素ラベルとを連結してダイフォンラベル１２２１とし、当該ダイフォン区間１２１２と当該ダイフォンラベル１２２１とを対応付けて出力する（Ｓ１２００）。 In the phoneme-diphone section conversion unit 1200, the speech waveform data 1111 provided with the phoneme label 1121 for each phoneme section 1112 is input, and the diphone section dividing unit 1210 is the preceding phoneme section of any two adjacent phoneme sections. Are connected to the first half of the subsequent phoneme section to form a diphone section 1212, and the diphone section 1212 and the phoneme label 1121 are output in association with the speech waveform data 1111. The diphone section 1212, the phoneme label 1121, and the speech waveform data 1111 are input, and the diphone label attaching unit 1220 connects the phoneme label of the previous phoneme section and the phoneme label of the subsequent phoneme section to connect the diphone section 1212 and the phoneme label. The phone label 1221 is used, and the diphone section 1212 and the diphone label 1221 are output in association with each other (S1200).

図３では、音素セグメンテーション部１１００に、音声波形データ「ＯＮＳｅ」が入力された場合の音素セグメンテーション部１１００と音素−ダイフォン区間変換部１２００の動作について例示した。図３中の「Ｓｉｌ」は無音区間の音素ラベルを意味する。「Ｓｉｌ」を付与された無音区間については前半部と後半部に分割せず、無音区間全体を、後の音素区間の前半部もしくは先の音素区間の後半部と連結してダイフォン区間１２１２を生成するものとする。図３中の「Ｓｉｌ：Ｏ」、「Ｏ：Ｎ」、「Ｎ：Ｓ」、「Ｓ：ｅ」、「ｅ：Ｓｉｌ」は何れもダイフォンラベル１２２１である。 FIG. 3 illustrates the operations of the phoneme segmentation unit 1100 and the phoneme-diphone section conversion unit 1200 when the speech waveform data “ONSe” is input to the phoneme segmentation unit 1100. “Sil” in FIG. 3 means a phoneme label in a silent section. The silent section given “Sil” is not divided into the first half and the second half, and the entire silent section is connected to the first half of the subsequent phoneme section or the second half of the previous phoneme section to generate the diphone section 1212. It shall be. In FIG. 3, “Sil: O”, “O: N”, “N: S”, “S: e”, and “e: Sil” are all diphone labels 1221.

図２、４を参照して、音声パラメータ系列変換部１３００は、音声波形データ１１１１とダイフォンラベル１２２１とダイフォン区間１２１２とを入力とし、前記音声波形データ１１１１をダイフォン区間１２１２ごとに、一定のフレーム長（例えば５ｍｓ）ごとにＮ個の音声パラメータよりなる音声パラメータ系列１３０１−１〜Ｎに変換し、当該音声パラメータ系列１３０１−１〜Ｎをダイフォン区間１２１２と対応付けて出力する（Ｓ１３００）。音声パラメータの表現方法としては、例えばケプストラム（非特許文献２参照）などを用いることができる。 Referring to FIGS. 2 and 4, speech parameter series conversion section 1300 receives speech waveform data 1111, diphone label 1221, and diphone section 1212 as input, and uses speech waveform data 1111 for each diphone section 1212 at a fixed frame. Every length (for example, 5 ms) is converted into speech parameter series 1301-1 to N composed of N speech parameters, and the speech parameter series 1301-1 to 130-N are output in association with the diphone section 1212 (S1300). As a speech parameter expression method, for example, a cepstrum (see Non-Patent Document 2) or the like can be used.

音声モデル生成部１４００は、音声パラメータ系列１３０１−１〜Ｎとダイフォンラベル１２２１とダイフォン区間１２１２とを入力とし、ダイフォン区間１２１２ごとに、ダイフォン区間１２１２に対応付けられた音声パラメータ系列を状態１〜３に分割し、状態１〜３から１つずつ音声パラメータを選択して代表パタン１４０１−１〜３とする。これら３つの代表パタンよりなる３状態の音声モデル１４０２を生成し、当該ダイフォン区間１２１２と対応付いたダイフォンラベル１２２１と、当該音声モデル１４０２とを対応付けて出力する（Ｓ１４００）。本実施例では状態数を３としたが、状態数についてはこれ以外の値とすることもでき、例えば長母音のような長い音韻長を含むダイフォン区間については５状態とし、選択する代表パタン数を５として、５状態からなる音声モデルを生成することとしてもよい。各状態の長さは、ダイフォン区間を均等に分割することとしてもよいし、例えば音声パラメータが急激に変動する中心部を密に分割するような非均一な分割を行うこととしてもよい。また、各状態の代表パタンの選択方法としては、各状態の時間的中心部分のフレームにある音声パラメータを選択する方法、各状態における全てのフレームの音声パラメータの平均値を用いる方法、この平均値に最も近い音声パラメータを各状態から選択する方法がある。 The voice model generation unit 1400 receives the voice parameter series 1301-1 to 130-1 to N, the diphone label 1221 and the diphone section 1212 as input, and for each diphone section 1212, the voice parameter series associated with the diphone section 1212 The voice parameters are selected one by one from the states 1 to 3 and set as representative patterns 1401-1 to 3. A three-state voice model 1402 composed of these three representative patterns is generated, and the diphone label 1221 associated with the diphone section 1212 and the voice model 1402 are output in association with each other (S1400). In this embodiment, the number of states is set to 3, but the number of states may be other values. For example, a diphone section including a long phoneme length such as a long vowel is set to 5 states, and the number of representative patterns to be selected. It is good also as producing | generating the audio | voice model which consists of 5 states with 5 being. The length of each state may be obtained by equally dividing the diphone section, or by performing non-uniform division such as densely dividing the central portion where the voice parameter rapidly changes. In addition, as a method for selecting a representative pattern in each state, a method for selecting a speech parameter in a frame in a temporally central portion of each state, a method using an average value of speech parameters of all frames in each state, this average value There is a method of selecting the voice parameter closest to the number from each state.

図４では、ダイフォン区間に分割された音声波形データ「ＯＮＳｅ」、対応するダイフォンラベル「Ｓｉｌ：Ｏ」、「Ｏ：Ｎ」、「Ｎ：Ｓ」、「Ｓ：ｅ」、「ｅ：Ｓｉｌ」が入力された場合、例えばダイフォンラベル「Ｎ：Ｓ」に対応付けられた音声パラメータ系列１３０１−１〜Ｎにおける音声パラメータ系列変換部１３００と、音声モデル生成部１４００の動作について例示した。 In FIG. 4, voice waveform data “ONSe” divided into diphone sections, corresponding diphone labels “Sil: O”, “O: N”, “N: S”, “S: e”, “e: Sil ”Is input, for example, the operation of the speech parameter sequence conversion unit 1300 and the speech model generation unit 1400 in the speech parameter sequence 1301-1 to 130-N associated with the diphone label“ N: S ”is illustrated.

図１、２を参照して、欠落ダイフォンラベル出力部１６００は、ダイフォンラベル１２２１と、定義済ダイフォンラベルリスト１５００とを入力とし、定義済ダイフォンラベルリスト１５００に存在するが、ダイフォンラベル１２２１として入力されていないダイフォンラベルを欠落ダイフォンラベルとして出力する（Ｓ１６００）。なお、定義済ダイフォンラベルリスト１５００とは、音声合成に必要な全てのダイフォンのダイフォンラベルをリストとして予め生成したものである。 Referring to FIGS. 1 and 2, missing diphone label output unit 1600 receives diphone label 1221 and defined diphone label list 1500 as input, and exists in defined diphone label list 1500. A diphone label that is not input as the label 1221 is output as a missing diphone label (S1600). The defined diphone label list 1500 is a list in which diphone labels of all diphones necessary for speech synthesis are generated in advance as a list.

なお、欠落ダイフォンラベル出力部１６００は、音声波形データベース９１の規模が不十分で、音声波形データベース９１に必要な全てのダイフォンが含まれていない場合、当該含まれていないダイフォンを欠落ダイフォンラベルとして特定して出力することを目的として存在している。従って、欠落ダイフォンラベル出力部１６００は、音声波形データベース９１に含まれる全ての音声波形データに対応付いた全てのダイフォンラベルと、定義済ダイフォンラベルリスト１５００との比較により、欠落しているダイフォンを特定して欠落ダイフォンラベルとして出力する。この点に注意を要する。 Note that the missing diphone label output unit 1600 determines that the voice waveform database 91 is insufficient in scale and does not include all the required diphones in the voice waveform database 91. It exists for the purpose of specifying and outputting as. Therefore, the missing diphone label output unit 1600 is missing by comparing all the diphone labels associated with all the audio waveform data included in the audio waveform database 91 with the defined diphone label list 1500. A diphone is identified and output as a missing diphone label. Attention should be paid to this point.

ここで、欠落ダイフォンラベルが存在する場合には、ハーフフォン生成部１８００の動作（Ｓ１８００）、代替音声モデル生成部１９００の動作（Ｓ１９００）を実行するが、欠落ダイフォンラベルが存在しない場合には、Ｓ１８００、Ｓ１９００は行われない（Ｓ１７００）。以下、欠落ダイフォンラベルが存在した場合のハーフフォン生成部１８００、代替音声モデル生成部１９００の各動作について説明する。 Here, when there is a missing diphone label, the operation of the half phone generation unit 1800 (S1800) and the operation of the alternative speech model generation unit 1900 (S1900) are executed. S1800 and S1900 are not performed (S1700). Hereinafter, each operation of the half phone generation unit 1800 and the alternative speech model generation unit 1900 when a missing diphone label exists will be described.

ハーフフォン生成部１８００は、音声モデル１４０２とダイフォンラベル１２２１とを入力とし、音声モデル１４０２を前半部と後半部に分割して双方をハーフフォンとし、当該分割された音声モデルと対応付いたダイフォンラベルの前半部をハーフフォンラベルとして、当該分割された音声モデルの前半部からなるハーフフォンと対応付けて出力し、当該分割された音声モデルと対応付いたダイフォンラベルの後半部をハーフフォンラベルとして、当該分割された音声モデルの後半部からなるハーフフォンと対応付けて出力する（Ｓ１８００）。例えば音声モデル１４０２の状態数がＬであった場合、前半部のハーフフォンは第１状態〜第（Ｌ／２）状態（小数点以下繰り上げ）の代表パタンを保持し、残りの状態を削除し、後半部のハーフフォンは第（Ｌ／２）＋１状態〜第Ｌ状態（小数点以下繰り下げ）の代表パタンを保持し、残りの状態を削除することによって生成する。Ｌが奇数である場合には、音声モデル１４０２の状態数Ｌのちょうど中間に位置する状態の代表パタンについては、前半部のハーフフォンと後半部のハーフフォンの双方において保持されるものとする。従って音声モデル１４０２の状態数が３である場合には、前半部のハーフフォンは音声モデルの第１状態〜第２状態の代表パタンを保持し、後半部のハーフフォンは音声モデルの第２状態〜第３状態の代表パタンを保持する。３状態のちょうど中間に位置する状態２の代表パタンについては、前半部のハーフフォンと後半部のハーフフォンの双方において保持される。なお、ハーフフォン生成部１８００は、音声波形データベース９１に含まれる全ての音声波形データから生成された音声モデルを分割してハーフフォンを生成する。従って音声波形データベース９１に含まれる全ての音声波形データと対応するハーフフォンが生成されることに注意する。 The half phone generation unit 1800 receives the voice model 1402 and the diphone label 1221 as input, divides the voice model 1402 into the first half and the second half and sets both as half phones, and a die associated with the divided voice model. The first half of the phone label is set as a half phone label and output in association with the half phone consisting of the first half of the divided voice model, and the second half of the diphone label associated with the divided voice model is half phone. A label is output in association with a half phone comprising the latter half of the divided speech model (S1800). For example, when the number of states of the voice model 1402 is L, the half phone in the first half holds the representative pattern from the first state to the (L / 2) state (rounded up after the decimal point), deletes the remaining states, The half phone in the latter half is generated by holding the representative patterns from the (L / 2) +1 state to the Lth state (decimal point down) and deleting the remaining states. When L is an odd number, the representative pattern in a state located just in the middle of the number of states L of the speech model 1402 is held in both the first half phone and the second half phone. Therefore, when the number of states of the voice model 1402 is 3, the first half of the half phone holds the representative patterns of the first and second states of the voice model, and the second half of the half phone holds the second state of the voice model. Hold the representative pattern in the third state. The representative pattern in the state 2 that is located in the middle of the three states is held in both the first half phone and the second half phone. Note that the half phone generating unit 1800 generates a half phone by dividing a speech model generated from all speech waveform data included in the speech waveform database 91. Therefore, it should be noted that half-phones corresponding to all voice waveform data included in the voice waveform database 91 are generated.

図２、５を参照して、代替音声モデル生成部１９００は、前記ハーフフォンと、前記ハーフフォンラベルと、前記欠落ダイフォンラベルとを入力とし、任意の欠落ダイフォンラベルの前半部と同一もしくは類似のハーフフォンラベルと対応付いたハーフフォンと、当該欠落ダイフォンラベルの後半部と同一もしくは類似のハーフフォンラベルと対応付いたハーフフォンとを連結し、代替音声モデルとして出力する（Ｓ１９００）。 2 and 5, the alternative speech model generation unit 1900 receives the half phone, the half phone label, and the missing diphone label as input, and is the same as the first half of any missing diphone label or The half phone associated with the similar half phone label and the half phone associated with the same or similar half phone label as the second half of the missing diphone label are connected and output as an alternative voice model (S1900).

詳細には、代替音声モデル生成部１９００において、ハーフフォン配置手段１９１０は、ハーフフォンと、ハーフフォンラベルとを入力とし、入力されたハーフフォンを予め用意された決定木に配置する。決定木は、音素前後の音素環境をコンテキストとして生成し、音素ごとに各１つずつ用意されているものである。前述のハーフフォンは、ハーフフォンラベルに示された音素と同一の音素について用意された決定木のリーフノードに配置される。ハーフフォン配置手段１９１０は、ハーフフォン生成部１８００において生成された全てのハーフフォンを、音素ごとに用意された決定木のリーフノードに配置する。つまり決定木のリーフノードには、音声波形データベース９１に含まれる全ての音声波形データに基づいて生成した全てのハーフフォンを配置する。この点に注意を要する。 Specifically, in the alternative speech model generation unit 1900, the half phone placement unit 1910 receives the half phone and the half phone label as input, and places the input half phone in a predetermined decision tree. The decision tree is generated by generating a phoneme environment before and after phonemes as a context, and one for each phoneme. The aforementioned half phone is arranged at a leaf node of the decision tree prepared for the same phoneme as the phoneme indicated in the half phone label. The half phone arrangement unit 1910 arranges all the half phones generated by the half phone generation unit 1800 at the leaf nodes of the decision tree prepared for each phoneme. That is, all the half phones generated based on all the speech waveform data included in the speech waveform database 91 are arranged in the leaf nodes of the decision tree. Attention should be paid to this point.

欠落ダイフォンラベルリスト１９２０は、欠落ダイフォンラベルを入力とし、入力された欠落ダイフォンラベルを自身に記憶する。決定木確定手段１９３０は、欠落ダイフォンラベルリスト１９２０と、ハーフフォンを配置した決定木とを入力とし、全ての欠落ダイフォンラベルの前半部および後半部に対して、参照すべき決定木を確定し、それぞれを前半ハーフフォン決定木、後半ハーフフォン決定木として欠落ダイフォンラベルに対応付けて出力する。ここで、参照すべき決定木のリーフノードに、ハーフフォンが１つも配置されていない場合には、音素間距離マトリクス１９４０を参照し、参照すべき決定木の音素と最も音素間距離が短くなる音素の決定木を代替の決定木として確定する。ここで、音素間距離マトリクスは、調音点や調音方法などの弁別素性を考慮し、予め定義したマトリクステーブルである。 The missing diphone label list 1920 receives the missing diphone label and stores the inputted missing diphone label in itself. The decision tree decision means 1930 receives the missing diphone label list 1920 and the decision tree in which halfphones are arranged as inputs, and decides decision trees to be referred to for the first half and the second half of all missing diphone labels. Each of them is output as a first half half phone decision tree and a second half half phone decision tree in association with a missing diphone label. Here, when no half phone is arranged in the leaf node of the decision tree to be referred to, the interphoneme distance matrix 1940 is referred to and the distance between the phonemes of the decision tree to be referred to and the phoneme distance is the shortest. Determine the phoneme decision tree as an alternative decision tree. Here, the inter-phoneme distance matrix is a matrix table defined in advance in consideration of discrimination features such as a tuning point and a tuning method.

上記のように決定木を利用することにより、コンテキスト情報を考慮した絞り込みが容易になるという利点がある。例えば、決定木を利用するのでなく、音素単位でハーフフォンの集合を構成することとすると、前後の音素環境があまりにも合致しないハーフフォンが連結対象として選択されてしまう。このため決定木を利用して前後の音素環境を考慮したクラスタリングを行い、連結対象の候補の絞り込みを行う。決定木は音素単位で作成されるため、音声データベース９１に含まれる音素と同種別、同数の決定木が作成される。決定木の分岐条件の質問には、音素のコンテキスト情報（前後の音素環境の一致など）を用いることとする。例えば、「ダイフォンの前半部に相当するか」、「後続音素が母音か」、「後続音素が／Ａ／であるか」「先行音素が破裂子音か」「先行音素が／Ｐ／か」など、大まかな分類から詳細な分類へと徐々に絞り込むように構成されている。 By using a decision tree as described above, there is an advantage that narrowing down in consideration of context information becomes easy. For example, if a set of halfphones is configured in units of phonemes instead of using a decision tree, halfphones whose phoneme environments before and after do not match too much are selected as connection targets. For this reason, clustering is performed in consideration of the phoneme environment before and after using a decision tree, and candidates for connection are narrowed down. Since the decision tree is created in units of phonemes, the same number and number of decision trees as the phonemes included in the speech database 91 are created. Phoneme context information (such as matching phoneme environments before and after) is used for the decision tree branch condition question. For example, “Does it correspond to the first half of the diphone?”, “Is the subsequent phoneme a vowel?”, “Is the subsequent phoneme / A /”, “Is the preceding phoneme a burst consonant”, “Is the preceding phoneme / P /”, etc. It is configured to gradually narrow down from rough classification to detailed classification.

図６を参照して、ハーフフォン選択手段１９５０は、欠落ダイフォンラベルリスト１９２０と、前半ハーフフォン決定木１９５１と、後半ハーフフォン決定木１９５２とを入力とし、欠落ダイフォンラベルごとに、前半ハーフフォン決定木１９５１、後半ハーフフォン決定木１９５２から各１つずつ、半音素前後のコンテキストが合致したリーフノードに含まれるハーフフォンを連結対象として決定し、欠落ダイフォンラベルと対応付けて出力する。 Referring to FIG. 6, half phone selection means 1950 receives a missing diphone label list 1920, a first half half phone decision tree 1951, and a second half half phone decision tree 1952, and inputs the first half half for each missing diphone label. A half phone included in a leaf node whose context before and after the semiphoneme is matched is determined as a connection target for each one from the phone decision tree 1951 and the latter half phone decision tree 1952, and is output in association with the missing diphone label.

ここで、前半ハーフフォン決定木１９５１と、後半ハーフフォン決定木１９５２の、少なくともいずれか一方の参照すべきリーフノードにハーフフォンが複数存在し、双方のハーフフォンラベルの音素が有声音である場合には、前半ハーフフォン決定木に存在するハーフフォンと、後半ハーフフォン決定木に存在するハーフフォンとのＦ０ギャップが最小となる組み合わせを欠落ダイフォンラベルの連結対象となるハーフフォンとして選択することとしてもよい。Ｆ０ギャップはハーフフォンのＦ０平均値の差分から求める。ここで、ハーフフォンのＦ０平均値は、前半部のハーフフォンの各状態の代表パタンのＦ０値を平均した値、後半部のハーフフォンの各状態の代表パタンのＦ０値を平均した値をそれぞれ用いる。 Here, when there are a plurality of halfphones in at least one of the leaf nodes to be referred to in the first halfphone decision tree 1951 and the second halfphone decision tree 1952, and the phonemes of both halfphone labels are voiced sounds To select the combination that minimizes the F0 gap between the half phone existing in the first half-phone decision tree and the half phone existing in the second half-phone decision tree as the half phone to be connected to the missing diphone label. It is good. The F0 gap is obtained from the difference between the F0 average values of the half phones. Here, the F0 average value of the half phone is a value obtained by averaging the F0 values of the representative patterns in each state of the first half phone, and an average value of the F0 values of the representative patterns of each state of the second half phone. Use.

また、参照すべきリーフノードに存在するハーフフォンを、予め定義されたＦ０値域で区切られた２以上のＦ０カテゴリ１９５３に分類し、同一もしくは近接するカテゴリに分類されたハーフフォン同士からなる組み合わせを欠落ダイフォンラベルの連結対象として選択してもよい。ここでＦ０カテゴリとはＦ０値を量子化幅Ｄにて分類したものである。Ｆ０値の量子化幅Ｄを例えば５０Ｈｚとして、１００Ｈｚ未満、１００Ｈｚ以上１５０Ｈｚ未満、１５０Ｈｚ以上２００Ｈｚ未満、２００Ｈｚ以上２５０Ｈｚ未満、２５０Ｈｚ以上３００Ｈｚ未満、３００Ｈｚ以上からなる６カテゴリとすることができる。また、別の方法として、対数Ｆ０領域で量子化幅Ｄを設定しても良い。量子化幅Ｄについては、音声合成処理における信号処理方式のＦ０変形耐性に応じて適切な分類を行うことができるように決定するものとする。同一のＦ０カテゴリに含まれるハーフフォンが複数存在する場合には、それぞれのＦ０平均値を比較し、Ｆ０平均値の差分が最も小さくなる組み合わせを連結対象として選択する。等しいＦ０カテゴリにどちらか一方のハーフフォンが含まれていない場合には、互いに隣接するＦ０カテゴリのハーフフォン同士を連結対象として選択する。このようにして、最低でも１つ、最高でＦ０カテゴリ数と等しい数のハーフフォンの組み合わせを得ることができる。なお、前記の連結対象の選択は、平均Ｆ０の代わりとして音響パラメータ（例えばスペクトル間距離など）の差分を基準とすることとしてもよい。 Further, the halfphones existing in the leaf node to be referred to are classified into two or more F0 categories 1953 separated by a predefined F0 range, and combinations consisting of halfphones classified into the same or adjacent categories are combined. You may select as a connection object of a missing diphone label. Here, the F0 category is a classification of the F0 values by the quantization width D. The quantization width D of the F0 value is, for example, 50 Hz, and can be classified into 6 categories consisting of less than 100 Hz, 100 Hz to less than 150 Hz, 150 Hz to less than 200 Hz, 200 Hz to less than 250 Hz, 250 Hz to less than 300 Hz, and 300 Hz or more. As another method, the quantization width D may be set in the logarithmic F0 region. The quantization width D is determined so that appropriate classification can be performed according to the F0 deformation tolerance of the signal processing method in the speech synthesis process. When there are a plurality of half-phones included in the same F0 category, the F0 average values are compared, and the combination with the smallest difference in the F0 average value is selected as the connection target. When either half phone is not included in the same F0 category, the half phones of the F0 category adjacent to each other are selected as the connection targets. In this way, it is possible to obtain a combination of at least one half phone and at most equal to the number of F0 categories. In addition, the selection of the connection target may be based on a difference in acoustic parameters (for example, a distance between spectra, for example) instead of the average F0.

なお、平均Ｆ０を基準として連結対象となるハーフフォンを決定するのは、連結対象のハーフフォンがいずれも有声音である場合に限られる。組み合わせる素片の何れか、あるいは両方が無声音の場合は、リーフノードに含まれる全てのハーフフォンの集合のセントロイド（平均（重心）に最も近いある一つの音声モデル）を代表パタンとする。もしくは、リーフノードに含まれる全てのハーフフォンについて、両者の音響パラメータ（例えばスペクトル間距離など）が最も近くなる組み合わせを選択することとしてもよい。無声音は、周期的な振動を伴わない音であるため、Ｆ０情報を持たない。従ってＦ０ギャップを考慮する必要がないため、有声音同士の連結と、無声音を含む連結とで連結方法が異なる。 Note that the halfphones to be connected are determined based on the average F0 only when all the halfphones to be connected are voiced sounds. If either or both of the combined segments are unvoiced sounds, the centroid (one voice model closest to the average (center of gravity)) of all the halfphone sets included in the leaf node is used as a representative pattern. Or it is good also as selecting the combination with which both acoustic parameters (for example, distance between spectra, etc.) are the nearest about all the half phones contained in a leaf node. An unvoiced sound does not have F0 information because it is a sound without periodic vibration. Therefore, since it is not necessary to consider the F0 gap, the connection method differs between connection of voiced sounds and connection including unvoiced sounds.

図５に戻り、ハーフフォン連結手段１９６０は、連結対象となるハーフフォンと、欠落ダイフォンラベルリスト１９２０とを入力とし、欠落ダイフォンラベルごとに連結対象となるハーフフォンを連結して、代替音声モデルとして、欠落ダイフォンラベルと対応付けて出力する。 Returning to FIG. 5, the halfphone connection unit 1960 receives the halfphone to be connected and the missing diphone label list 1920 as input, and connects the halfphone to be connected for each missing diphone label to substitute voice. A model is output in association with the missing diphone label.

ハーフフォンの連結は、連結後の代替音声モデルの状態数が奇数であるとき、前半部のハーフフォンの最初の状態〜最後から数えて２番目の状態の代表パタンと、後半部のハーフフォンの最初から数えて２番目の状態〜最後の状態の代表パタンを用いて連結する。連結後にちょうど中間に位置する、前半部のハーフフォンの最後の状態と、後半部のハーフフォンの最初の状態については、双方の状態における内分値を用いた重みづけ加算により求めた値を用いる。内分比率には例えばシグモイド関数を用いることができる。 When the number of states of the alternative speech model after connection is an odd number, the half phone is connected to the representative pattern of the second state from the first state to the last half phone of the first half and the half phone of the second half. The connection is made using the representative patterns from the second state to the last state counted from the beginning. For the last state of the first half phone and the first half phone state that are located in the middle after the connection, values obtained by weighted addition using the internal values in both states are used. . For example, a sigmoid function can be used as the internal ratio.

例えば音声モデルの代表パタン数が３であった場合、前半部のハーフフォンの最初の状態の代表パタンが連結後の代替音声モデルの第１状態の代表パタンとして用いられ、後半部のハーフフォンの最後の状態の代表パタンが連結後の代替音声モデルの第３状態の代表パタンとして用いられる。連結後の代替音声モデルの第２状態については、前半部のハーフフォンの最後の状態と、後半部のハーフフォンの最初の状態との内分値を用いた重みづけ加算により求めた値を用いる。なお、代替音声モデルの状態数が偶数である場合には、前半ハーフフォンの各状態、後半ハーフフォンの各状態の代表パタンをそれぞれ用いて連結すればよい。 For example, when the number of representative patterns of the speech model is 3, the representative pattern of the first state of the first half phone is used as the representative pattern of the first state of the alternative speech model after connection, The representative pattern in the last state is used as the representative pattern in the third state of the alternative speech model after connection. For the second state of the alternative speech model after concatenation, a value obtained by weighted addition using the internal values of the last state of the first half phone and the first state of the second half phone is used. . When the number of states of the alternative speech model is an even number, connection may be performed using the representative patterns of the states of the first half phone and the states of the second half phone.

図１、図１０を参照して、音声素片データベース９２は、音声モデルと、代替音声モデルとを入力とし、入力された音声モデルと、代替音声モデルとを記憶する（Ｓ９２）。図１０は、音声素片データベース９２に記憶された音声モデル、代替音声モデルを例示した表である。音声素片データベース９２には、ダイフォンラベルごとに平均周波数Ｆ０（Ｈｚ）、平均周波数Ｆ０の傾斜（Ｈｚ／ｍｓ）、パワー（ｄＢ）、音声パラメータが記憶される。 Referring to FIGS. 1 and 10, the speech unit database 92 receives a speech model and an alternative speech model, and stores the input speech model and the alternative speech model (S92). FIG. 10 is a table illustrating a speech model and an alternative speech model stored in the speech segment database 92. The speech segment database 92 stores an average frequency F0 (Hz), a slope (Hz / ms) of the average frequency F0, power (dB), and speech parameters for each diphone label.

なお、あらかじめダイフォンラベルを付与された音声モデルを入力とする場合には、欠落ダイフォンラベルリスト１５００、欠落ダイフォンラベル出力部１６００、ハーフフォン生成部１８００、代替音声モデル生成部１９００のみを備える代替音声モデル作成装置１００００によって、代替音声モデルを作成する構成としても良い。 In the case where a speech model to which a diphone label is assigned in advance is used as an input, only a missing diphone label list 1500, a missing diphone label output unit 1600, a half phone generation unit 1800, and an alternative speech model generation unit 1900 are provided. The alternative voice model creating apparatus 10000 may create a substitute voice model.

この代替音声モデル作成装置１００００における各部の働きは、前記音声素片データベース作成装置１０００における同一名称の各部における働きと同じである。代替音声モデル作成装置１００００に入力されるダイフォンラベルを付与された音声モデルは、予め用意された音声波形データベースに含まれる全ての音声波形データを用いて予め別の装置にて作成されているものとする。なお、代替音声モデル作成装置１００００における欠落ダイフォンラベル出力部１６００の働きは、音声素片データベース作成装置１０００における欠落ダイフォンラベル出力部１６００の働きと同じであり、予め用意された音声波形データベースに含まれる全ての音声波形データから生成した全てのダイフォンラベルを入力として、定義済ダイフォンラベルリスト１５００との比較により、欠落しているダイフォンを特定して欠落ダイフォンラベルとして出力する。同様に、代替音声モデル作成装置１００００のハーフフォン生成部１８００は、音声波形データベースに含まれる全ての音声波形データを用いてハーフフォンを生成し、代替音声モデル作成装置１００００のハーフフォン配置手段１９１０は、ハーフフォン生成部１８００において生成された全てのハーフフォンを音素ごとに用意された決定木のリーフノードに配置する。この点に注意を要する。 The function of each part in the alternative speech model creation apparatus 10000 is the same as the function of each part having the same name in the speech segment database creation apparatus 1000. The voice model with the diphone label input to the alternative voice model creation device 10000 is created in advance by another device using all the voice waveform data included in the voice waveform database prepared in advance. And Note that the operation of the missing diphone label output unit 1600 in the alternative speech model creation device 10000 is the same as that of the missing diphone label output unit 1600 in the speech segment database creation device 1000. All diphone labels generated from all included audio waveform data are input, and a missing diphone is identified and output as a missing diphone label by comparison with a predefined diphone label list 1500. Similarly, the half phone generation unit 1800 of the alternative speech model creation device 10000 creates a half phone using all speech waveform data included in the speech waveform database, and the half phone placement unit 1910 of the alternative speech model creation device 10000 All the half phones generated by the half phone generation unit 1800 are arranged in the leaf nodes of the decision tree prepared for each phoneme. Attention should be paid to this point.

本実施例の音声素片データベース作成装置１０００によれば、ハーフフォンを連結して代替音声モデルを生成するため、音声波形データベースから必要な全ての音声モデルを生成できなかった場合にも、完全な音声素片データベースを生成することができ、当該音声素片データベースを用いて、音声の欠落がない合成音声を作成することができる。また、音声素片データベース作成時にあらかじめ適切な代替音声モデルを生成しておくため、素片探索処理量の増大を避けることができる。 According to the speech unit database creation apparatus 1000 of the present embodiment, half phone is connected to generate an alternative speech model. Therefore, even when all necessary speech models cannot be generated from the speech waveform database, the speech unit database creation device 1000 is completely A speech segment database can be generated, and a synthesized speech free of speech loss can be created using the speech segment database. In addition, since an appropriate alternative speech model is generated in advance when the speech unit database is created, an increase in the amount of segment search processing can be avoided.

また、Ｆ０ギャップが最小となるハーフフォンの組み合わせを欠落ダイフォンラベルの連結対象となるハーフフォンとして選択する場合には、代替音声モデルの接続部のＦ０変化量が減少し、当該代替音声モデルを用いた合成音声が高品質となる。 In addition, when the combination of half phones with the smallest F0 gap is selected as the half phone to be connected to the missing diphone label, the amount of change in F0 at the connection portion of the alternative voice model is reduced, and the alternative voice model is selected. The synthesized speech used is of high quality.

また、予め定義されたＦ０値域で区切られた２以上のＦ０カテゴリに分類し、同一もしくは近接するカテゴリに分類されたハーフフォン同士からなる組み合わせを欠落ダイフォンラベルの連結対象として選択する場合には、代替音声モデルの接続部のＦ０変化量が減少し、当該代替音声モデルを用いた合成音声が高品質となる。また、このようにして作成された音声素片データベースは、同一の欠落ダイフォンラベルについて、平均Ｆ０の異なる代替音声モデルを複数（最多の場合、カテゴリ数と同数）有する可能性が高くなる。このため、上記音声素片データベースを用いて音声合成処理を行う場合、韻律生成によって生成したＦ０に近似したＦ０値を持つ音声モデルを、前記音声素片データベースから選択できる可能性が高くなり、Ｆ０変化量が減少することにより合成音声がさらに高品質となる。また、同一もしくは近接するカテゴリに分類されたハーフフォン同士を組み合わせて代替音声モデルとするため、すべてのハーフフォンの組み合わせを代替音声モデルとして記憶することとした場合に比べ、音声素片データベースに記憶する代替音声モデルの総数を著しく少なく抑えることができ、素片探索時間の増加やデータベースサイズの増加を避けることができる。同時に、前述のＦ０カテゴリは、Ｆ０変形耐性に応じて適切な量子化幅から設定するため、音声合成に最適な代替音声モデルはなお音声素片データベースに記憶されることとなり、これにより合成音声が高品質となる。 In addition, when two or more F0 categories separated by a predefined F0 range are classified, and a combination of halfphones classified into the same or adjacent categories is selected as a connection target of missing diphone labels. The F0 change amount at the connection portion of the alternative speech model is reduced, and the synthesized speech using the alternative speech model becomes high quality. In addition, the speech unit database created in this way has a high possibility of having a plurality of alternative speech models having the same average F0 (the same number as the number of categories in the maximum case) for the same missing diphone label. Therefore, when speech synthesis processing is performed using the speech unit database, it is highly possible that a speech model having an F0 value approximated to F0 generated by prosody generation can be selected from the speech unit database. As the amount of change is reduced, the synthesized speech is of higher quality. In addition, since the half-phones classified into the same or close categories are combined into an alternative voice model, the combination of all the half-phones is stored in the voice unit database as compared with the case where all the half-phone combinations are stored as alternative voice models. The total number of alternative speech models to be performed can be significantly reduced, and an increase in segment search time and an increase in database size can be avoided. At the same time, since the aforementioned F0 category is set from an appropriate quantization width according to the F0 deformation tolerance, an alternative speech model that is optimal for speech synthesis is still stored in the speech unit database. High quality.

また、任意の音素におけるハーフフォンが、全く存在しない場合に、予め定義された音素間距離マトリクスが最小となるハーフフォンを、前記存在しないハーフフォンの替わりに当該欠落ダイフォンラベルの連結対象として選択することとすれば、音声波形データベース中に全く存在しない音素についても、代替音声モデルを生成することができるため、必要な全ての音声モデルを保有する音声素片データベースを生成することができ、当該音声素片データベースを用いて、音声の欠落がない合成音声を作成することができる。 In addition, when there is no halfphone in any phoneme, the halfphone having the smallest predefined phoneme distance matrix is selected as the connection target of the missing diphone label instead of the non-existing halfphone. If so, since an alternative speech model can be generated even for phonemes that are not present in the speech waveform database, it is possible to generate a speech segment database that holds all necessary speech models. Using the speech segment database, synthesized speech with no missing speech can be created.

図７〜９を参照して本発明の音声合成装置および、音声合成方法を説明する。図７に示す音声合成装置７０００は、テキスト解析部７１００と、テキスト解析用辞書７２００と、韻律生成部７３００と、音声モデル選択部７４００と、音声合成部７６００とを有する。テキスト解析部７１００は、テキストを入力とし、テキスト解析用辞書７２００を用いて、読み、アクセント、音韻系列を出力する（Ｓ７１００）。韻律生成部７３００は、読み、アクセントを入力とし、Ｆ０、パワー、音韻長を出力する（Ｓ７３００）。音声モデル選択部７４００は、Ｆ０、パワー、音韻系列を入力とし、音声素片データベースから音声モデルを選択して出力する（Ｓ７４００）。音声合成部７６００は、音声モデル、Ｆ０、パワー、音韻長を入力とし、合成音声を出力する（Ｓ７６００）。詳細には、図９に示す音声合成部７６００は、音声パラメータ系列生成手段７６１０と、音声パラメータ系列補間手段７６２０と、合成音声波形生成手段７６３０とを有する。 The speech synthesis apparatus and speech synthesis method of the present invention will be described with reference to FIGS. A speech synthesizer 7000 shown in FIG. 7 includes a text analysis unit 7100, a text analysis dictionary 7200, a prosody generation unit 7300, a speech model selection unit 7400, and a speech synthesis unit 7600. The text analysis unit 7100 receives the text, and outputs a reading, accent, and phoneme sequence using the text analysis dictionary 7200 (S7100). The prosody generation unit 7300 receives readings and accents, and outputs F0, power, and phoneme length (S7300). The speech model selection unit 7400 receives F0, power, and phoneme series as input, and selects and outputs a speech model from the speech unit database (S7400). The speech synthesizer 7600 receives the speech model, F0, power, and phoneme length, and outputs synthesized speech (S7600). Specifically, the speech synthesis unit 7600 shown in FIG. 9 includes speech parameter series generation means 7610, speech parameter series interpolation means 7620, and synthesized speech waveform generation means 7630.

音声パラメータ系列生成手段７６１０は、入力された音声モデル１４０２の各代表パタンを、入力された音韻長に応じて繰り返して連結する。図９の例では、音声モデル１４０２の３つの代表パタンである音声パラメータの各々が、音韻長を３等分した長さ分だけ繰り返し複製され連結される。入力された音声モデルの全てについて音声パラメータの複製−連結処理が行われ、音声モデルごとに複製−連結処理が行われた音声パラメータ系列は、対応するダイフォンラベルの順序ごとに全て連結される。例えば音声モデルの状態数がＰ、音韻長から算出されるフレーム数がＱの場合、ｊ番目の状態の代表パタンである音声パラメータは、フレーム番号（ｊ−１）×（Ｑ／Ｐ）＋１番から、ｊ×（Ｑ／Ｐ）番まで繰り返され、連結される。 The speech parameter series generation unit 7610 repeatedly connects the representative patterns of the input speech model 1402 according to the input phoneme length. In the example of FIG. 9, each of the speech parameters, which are the three representative patterns of the speech model 1402, is repeatedly duplicated and connected by a length obtained by dividing the phoneme length into three equal parts. The voice parameter duplication / concatenation processing is performed for all the input voice models, and the voice parameter series that has undergone the duplication / concatenation processing for each voice model is all concatenated for each corresponding diphone label order. For example, when the number of states of the speech model is P and the number of frames calculated from the phoneme length is Q, the speech parameter that is the representative pattern of the jth state is the frame number (j−1) × (Q / P) +1. To j × (Q / P) number are repeated and connected.

音声パラメータ系列補間手段７６２０は、音声パラメータ系列を、滑らかに遷移するように補間する。ここでの補間方法としては、例えば、音声パラメータの分布列から、最尤パラメータ列を生成する方法（参考非特許文献１：徳田恵一、益子貴史、小林隆夫、今井聖、「動的特徴量を用いたＨＭＭからの音声パラメータ生成アルゴリズム」、日本音響学会誌、社団法人日本音響学会、平成９年３月、第５３巻、第３号、pp192〜200）などが適用可能である。 The voice parameter series interpolation means 7620 interpolates the voice parameter series so as to make a smooth transition. As an interpolation method here, for example, a method of generating a maximum likelihood parameter sequence from a speech parameter distribution sequence (reference non-patent document 1: Keiichi Tokuda, Takashi Masuko, Takao Kobayashi, Kiyoshi Imai, “Dynamic feature value The speech parameter generation algorithm from the used HMM ”, Journal of the Acoustical Society of Japan, Acoustical Society of Japan, March 1997, Vol. 53, No. 3, pp192-200) can be applied.

合成音声波形生成手段７６３０は、音声パラメータ系列から、合成音声波形を生成する。ここでの合成音声波形生成方法としては、例えばＳＴＲＡＩＧＨＴ法(参考非特許文献２：Hideki Kawahara, Ikuyo Masuda-Katsuse and Alain de Cheveigne, “Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instanta- neous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds”, Speech Communication, 27, 3-4, pp.187-207 (1999))などを用いることができる。 The synthesized speech waveform generating unit 7630 generates a synthesized speech waveform from the speech parameter series. As a synthetic speech waveform generation method here, for example, the STRAIGHT method (reference non-patent document 2: Hideki Kawahara, Ikuyo Masuda-Katsuse and Alain de Cheveigne, “Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instanta- neous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds ”, Speech Communication, 27, 3-4, pp.187-207 (1999)).

本実施例の音声合成装置７０００によれば、ハーフフォンを連結して代替音声モデルを生成し、音声素片データベースを作成するため、音声波形データベースから必要な全ての音声モデルを生成できなかった場合にも、完全な音声素片データベースを生成することができ、当該音声素片データベースを用いて、音声の欠落がない合成音声を作成することができる。また、音声素片データベース作成時にあらかじめ適切な代替音声モデルを生成しておくため、素片探索処理量の増大を避けることができる。 According to the speech synthesizer 7000 of this embodiment, when a half phone is connected to generate an alternative speech model and a speech segment database is created, all necessary speech models cannot be generated from the speech waveform database. In addition, it is possible to generate a complete speech unit database, and it is possible to create a synthesized speech without speech loss using the speech unit database. In addition, since an appropriate alternative speech model is generated in advance when the speech unit database is created, an increase in the amount of segment search processing can be avoided.

また、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 In addition, the various processes described above are not only executed in time series according to the description, but may be executed in parallel or individually according to the processing capability of the apparatus that executes the processes or as necessary. Needless to say, other modifications are possible without departing from the spirit of the present invention.

また、上述の構成をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。 Further, when the above-described configuration is realized by a computer, processing contents of functions that each device should have are described by a program. The processing functions are realized on the computer by executing the program on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

Claims

A speech unit database creation device for creating a speech unit database from speech waveform data to which a phoneme label is assigned for each phoneme section length,
The speech waveform data is input, and the second half of the preceding phoneme section and the first half of the subsequent phoneme section of any two adjacent phoneme sections are connected to form a diphone section, and the phonemes of the preceding phoneme section are connected. A phoneme-diphone section conversion unit that links the label and the phoneme label of the subsequent phoneme section to form a diphone label, and associates and outputs the diphone section and the diphone label;
The voice waveform data, the diphone label, and the diphone section are input, the voice waveform data is converted into voice parameters for each diphone section and at a fixed frame length, and a voice parameter column for each diphone section is converted into voice. A voice parameter series conversion unit that outputs the voice parameter series in association with the diphone section as a parameter series;
The voice parameter series, the diphone label, and the diphone section are input, and for each diphone section, one or more voice parameters are selected from the voice parameter series associated with the diphone section as a representative pattern, A voice model generation unit that generates a voice model including a representative pattern, outputs a corresponding diphone label associated with the diphone section, and the voice model;
Missing output of the diphone label and the defined diphone label list as input, and the diphone label that is present in the defined diphone label list but not entered as the diphone label is output as a missing diphone label A diphone label output section;
The voice model and the diphone label are input, the voice model is divided into a first half and a second half to make both half phones, and the first half of the diphone label associated with the divided voice model is a half phone. The phone label is output in association with the half phone consisting of the first half of the divided voice model, and the second half of the diphone label associated with the divided voice model is used as the half phone label. A half-phone generating unit that outputs the half-phone corresponding to the second half of the speech model;
The half phone that has the half phone, the half phone label, and the missing diphone label as inputs and is associated with a half phone label that is the same as or similar to the front half of any missing diphone label, and the missing diphone An alternative speech model generation unit that connects a half phone corresponding to the same or similar half phone label to the latter half of the label and outputs as an alternative speech model;
A speech unit database creation apparatus comprising:

The speech unit database creation device according to claim 1,
The alternative speech model generation unit
At least one of the half phone corresponding to the same half phone label as the first half of any missing diphone label and the half phone corresponding to the same half phone label as the second half of the missing diphone label When there are a plurality of speech unit databases, a combination that minimizes the F0 gap between the first half phone and the second half phone is selected as a connection target of the missing diphone label. Creation device.

The speech unit database creation device according to claim 1,
The alternative speech model generation unit
At least one of the half phone corresponding to the same half phone label as the first half of any missing diphone label and the half phone corresponding to the same half phone label as the second half of the missing diphone label When there are a plurality, the first half phone and the second half phone are classified into two or more categories separated by a predefined F0 range, and are classified into the same or adjacent categories. A speech unit database creation apparatus, wherein a combination of a first half phone and a second half phone is selected as a connection target of the missing diphone label.

The speech unit database creation device according to any one of claims 1 to 3,
The alternative speech model generation unit
If there is no half phone associated with the same half phone label as the first half of any missing diphone label, or no half phone associated with the same half phone label as the second half of any missing diphone label, A speech unit database creation apparatus, wherein a half phone having a minimum distance matrix between phonemes is selected as a connection target of the missing diphone label instead of the half phone that does not exist.

An alternative speech model creation device for creating an alternative speech model from a speech model assigned a diphone label,
Missing output of the diphone label and the defined diphone label list as input, and the diphone label that is present in the defined diphone label list but not entered as the diphone label is output as a missing diphone label A diphone label output section;
The voice model and the diphone label are input, the voice model is divided into a first half and a second half to make both half phones, and the first half of the diphone label associated with the divided voice model is a half phone. The phone label is output in association with the half phone consisting of the first half of the divided voice model, and the second half of the diphone label associated with the divided voice model is used as the half phone label. A half-phone generating unit that outputs the half-phone corresponding to the second half of the speech model;
The half phone that has the half phone, the half phone label, and the missing diphone label as inputs and is associated with a half phone label that is the same as or similar to the front half of any missing diphone label, and the missing diphone An alternative speech model generation unit that connects a half phone corresponding to the same or similar half phone label to the latter half of the label and outputs as an alternative speech model;
An alternative speech model creation device comprising:

A speech synthesizer that synthesizes speech from text,
A text analysis unit that takes text as input and outputs a reading, accent, and phoneme sequence;
A prosody generation unit that inputs reading, accent, and outputs F0, power, and phoneme length;
A speech model selection unit that receives F0, power, and phoneme sequence as input, and selects and outputs a speech model from a speech unit database;
A speech synthesizer that receives a speech model, F0, power, and phoneme length as input, and outputs a synthesized speech;
And the speech unit database is created by the apparatus according to any one of claims 1 to 5.

A speech segment database creation method for creating a speech segment database from speech waveform data to which a phoneme label is assigned for each phoneme section length,
The speech waveform data is input, and the second half of the preceding phoneme section and the first half of the subsequent phoneme section of any two adjacent phoneme sections are connected to form a diphone section, and the phonemes of the preceding phoneme section are connected. A phoneme-to-diphone section conversion step of connecting the label and the phoneme label of the subsequent phoneme section to form a diphone label, and associating and outputting the diphone section and the diphone label;
The speech waveform data, the diphone label, and the diphone section are input, and the speech waveform data is converted into speech parameters for each diphone section and for each fixed frame length, and a speech parameter column for each diphone section is converted to speech. A voice parameter series conversion step for outputting the voice parameter series in association with the diphone section as a parameter series;
The voice parameter series, the diphone label, and the diphone section are input, and for each diphone section, one or more voice parameters are selected from the voice parameter series associated with the diphone section as a representative pattern, A speech model generation step of generating a speech model composed of a representative pattern, and outputting a corresponding diphone label associated with the diphone section and the speech model;
Missing output of the diphone label and the defined diphone label list as input, and the diphone label that is present in the defined diphone label list but not entered as the diphone label is output as a missing diphone label A diphone label output step;
The voice model and the diphone label are input, the voice model is divided into a first half and a second half to make both half phones, and the first half of the diphone label associated with the divided voice model is a half phone. The phone label is output in association with the half phone consisting of the first half of the divided voice model, and the second half of the diphone label associated with the divided voice model is used as the half phone label. A half phone generation step of outputting in association with a half phone comprising the second half of the voice model;
The half phone that has the half phone, the half phone label, and the missing diphone label as inputs and is associated with a half phone label that is the same as or similar to the front half of any missing diphone label, and the missing diphone An alternative speech model generation step of connecting a half phone corresponding to the same or similar half phone label to the latter half of the label and outputting as an alternative speech model;
A speech unit database creation method comprising:

An alternative speech model creation method for creating an alternative speech segment database from a speech model assigned a diphone label,
Missing output of the diphone label and the defined diphone label list as input, and the diphone label that is present in the defined diphone label list but not entered as the diphone label is output as a missing diphone label A diphone label output step;
The voice model and the diphone label are input, the voice model is divided into a first half and a second half to make both half phones, and the first half of the diphone label associated with the divided voice model is a half phone. The phone label is output in association with the half phone consisting of the first half of the divided voice model, and the second half of the diphone label associated with the divided voice model is used as the half phone label. A half phone generation step of outputting in association with a half phone comprising the second half of the voice model;
The half phone that has the half phone, the half phone label, and the missing diphone label as inputs and is associated with a half phone label that is the same as or similar to the front half of any missing diphone label, and the missing diphone An alternative speech model generation step of connecting a half phone corresponding to the same or similar half phone label to the latter half of the label and outputting as an alternative speech model;
An alternative speech model creation method comprising:

The program for functioning a computer as an apparatus in any one of Claims 1-6.