JP2015068863A

JP2015068863A - Voice synthesis device, voice synthesis method, and voice synthesis program

Info

Publication number: JP2015068863A
Application number: JP2013200684A
Authority: JP
Inventors: 康行三井; Yasuyuki Mitsui; 玲史近藤; Reishi Kondou
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2013-09-27
Filing date: 2013-09-27
Publication date: 2015-04-13

Abstract

PROBLEM TO BE SOLVED: To provide a voice synthesis device capable of generating a synthetic voice having a stable rhythm close to a natural voice.SOLUTION: A voice synthesis device 1 includes: approximate pitch pattern output means 11 for inputting a pitch pattern including a fine structure, and for generating an approximate pitch pattern for approximately expressing the pitch pattern; and pitch pattern connection means 12 for inputting an original utterance pitch pattern based on the registered pitch pattern and the approximate pitch pattern, and for determining a connection point between the pitch pattern including the fine structure and the original utterance pitch pattern on the basis of the approximate pitch pattern, and for connecting the pitch pattern including the fine structure with the original utterance pitch pattern at the connection point to generate a connected pitch pattern.

Description

本発明は、波形編集方式を用いた音声合成装置、音声合成方法及び音声合成プログラムに関し、特に、自然な韻律を生成することができる音声合成装置、音声合成方法及び音声合成プログラムに関する。 The present invention relates to a speech synthesizer, a speech synthesis method, and a speech synthesis program using a waveform editing method, and more particularly to a speech synthesizer, a speech synthesis method, and a speech synthesis program that can generate natural prosody.

素片波形と呼ばれる短時間波形を編集することによって音声を合成する方式（波形編集方式）を用いた音声合成技術では、ピッチパタンや音素時間継続長等で表現される韻律情報を生成し、その韻律情報を再現するように単位波形を編集して全体の波形を構成する。合成音声によって言語的な意味を正しく伝達するためには、韻律情報を正確に再現することが重要である。一般に、韻律情報を正確に再現する際には、収録された音声から抽出された素片波形のピッチ周波数を変更する必要があるため、生成される合成音声の音質が劣化することが知られている。 In speech synthesis technology using a method of synthesizing speech by editing a short-time waveform called a segmental waveform (waveform editing method), prosody information expressed by pitch pattern, phoneme duration, etc. is generated. The unit waveform is edited so as to reproduce the prosodic information, and the entire waveform is constructed. In order to correctly convey the linguistic meaning through synthesized speech, it is important to accurately reproduce prosodic information. In general, when accurately reproducing prosodic information, it is necessary to change the pitch frequency of the segment waveform extracted from the recorded speech, which is known to degrade the quality of the synthesized speech that is generated. Yes.

特許文献１には、合成すべきテキストデータと素片波形データベースにおいて保存されているデータの元発話の内容とを照合することによって、ピッチ周波数の変更による音質劣化を防ぐ音声合成装置について開示されている。特許文献１の音声合成装置は、素片波形データベースに保存されているデータと合成すべきテキストデータとが合致する区間においてはピッチパタンを極力編集せずに素片波形を接続する。また、素片波形データベースに保存されているデータと合成すべきテキストデータとが合致していない区間においては標準的なピッチパタンを使うことによって合成音声を生成する。 Patent Document 1 discloses a speech synthesizer that prevents deterioration in sound quality due to a change in pitch frequency by collating text data to be synthesized with the content of the original utterance of data stored in the segment waveform database. Yes. The speech synthesizer of Patent Document 1 connects segment waveforms without editing the pitch pattern as much as possible in the section where the data stored in the segment waveform database matches the text data to be synthesized. Also, in a section where the data stored in the segment waveform database and the text data to be synthesized do not match, synthesized speech is generated by using a standard pitch pattern.

非特許文献１に示されているようなＨＭＭ等の統計的手法によれば、微細構造までを表現したピッチパタンを生成することが可能である（ＨＭＭ：ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）。ピッチパタンの微細構造とは、マイクロプロソディ等に代表される、音素の調音様式や話者の声道形状等に影響を受けたピッチ周波数の急峻な変動を指す。 According to a statistical method such as HMM as shown in Non-Patent Document 1, it is possible to generate a pitch pattern that represents even a fine structure (HMM: Hidden Markov Model). The fine structure of the pitch pattern refers to a steep variation in pitch frequency affected by the articulation style of phonemes, the shape of the vocal tract of the speaker, and the like typified by microprosody.

特開２００９−５３５９９９号公報JP 2009-535999 A

徳田恵一，「隠れマルコフモデルの音声合成への応用」，電子情報通信学会技術研究報告，ＳＰ９９−６１，ｐｐ．４７−５４，１９９９．Keiichi Tokuda, “Application of Hidden Markov Model to Speech Synthesis”, IEICE Technical Report, SP99-61, pp. 47-54, 1999.

特許文献１の音声合成装置によれば、文全体の韻律の自然性と安定性を保ちつつ、音質の高い合成音声を生成することが可能となる。しかしながら、合成すべきテキストデータと元発話とが合致しない区間においては、微細構造を平滑化した標準的なピッチパタンが使われる。そのため、元発話を再現した区間に比べて生成された合成音声の音質が低くなってしまい、全区間における韻律の自然性の統一性が失われるという問題があった。 According to the speech synthesizer of Patent Document 1, it is possible to generate a synthesized speech with high sound quality while maintaining the naturalness and stability of the prosody of the entire sentence. However, in a section where the text data to be synthesized does not match the original utterance, a standard pitch pattern in which the fine structure is smoothed is used. Therefore, the sound quality of the synthesized speech generated is lower than that of the section in which the original utterance is reproduced, and there is a problem that the uniformity of the prosody naturalness in all the sections is lost.

非特許文献１によれば、微細構造を含むピッチパタンを生成し、元発話と合致しない区間においてピッチパタンの微細構造を用いることによって、韻律の自然性の統一性を得ることができる。 According to Non-Patent Document 1, the uniformity of prosody naturalness can be obtained by generating a pitch pattern including a fine structure and using the fine structure of the pitch pattern in a section that does not match the original utterance.

しかしながら、微細構造情報を含むピッチパタンは、標準的なピッチパタンと比較してアクセント句ごとの概形が安定しない。例えば、日本語標準アクセントにおいては、アクセント句の１〜２モーラ目にピッチのピークが来ることが一般的だが、微細構造を表現したために局所的なピークが後方に表れてしまう場合がある。このようなピッチパタンと元発話を再現したピッチパタンとを部分的に接続すると、アクセント句におけるピッチパタンの概形が崩れてしまう。その結果、合成音声が誤ったアクセントに聞こえてしまうという問題がある。また、ＨＭＭ等で生成したピッチパタンにおいては無声音区間の情報が欠落しており、無音声区間ではそもそも概形に関する情報が全く存在しないという問題もある。 However, the pitch pattern including the fine structure information is not stable in outline for each accent phrase as compared with the standard pitch pattern. For example, in Japanese standard accents, a pitch peak generally appears at the first or second mora of an accent phrase, but a local peak may appear rearward because a fine structure is expressed. If such a pitch pattern and a pitch pattern that reproduces the original utterance are partially connected, the outline of the pitch pattern in the accent phrase will be destroyed. As a result, there is a problem that the synthesized speech is heard with an incorrect accent. In addition, in the pitch pattern generated by HMM or the like, there is a problem that information on the unvoiced sound section is missing, and no information on the outline exists in the soundless section.

本発明は、肉声に近い安定した韻律をもつ合成音声を生成可能な音声合成装置を提供することを目的とする。 An object of the present invention is to provide a speech synthesizer capable of generating synthesized speech having a stable prosody close to a real voice.

本発明の音声合成装置は、微細構造を含むピッチパタンを入力とし、ピッチパタンを近似的に表現する近似ピッチパタンを生成する近似ピッチパタン出力手段と、登録されたピッチパタンに基づいた元発話ピッチパタンと近似ピッチパタンとを入力して近似ピッチパタンに基づいて微細構造を含むピッチパタンと元発話ピッチパタンとの接続点を決定し、接続点において微細構造を含むピッチパタンと元発話ピッチパタンとを接続して接続ピッチパタンを生成するピッチパタン接続手段と、を備える。 The speech synthesizer of the present invention has an input of a pitch pattern including a fine structure, an approximate pitch pattern output means for generating an approximate pitch pattern that approximately represents the pitch pattern, and an original utterance pitch based on the registered pitch pattern A connection point between the pitch pattern including the fine structure and the original utterance pitch pattern is determined based on the approximate pitch pattern and the pitch pattern including the fine structure and the original utterance pitch pattern are input at the connection point. And pitch pattern connection means for generating a connection pitch pattern.

本発明の音声合成方法においては、微細構造を含むピッチパタンを入力してピッチパタンを近似的に表現する近似ピッチパタンを生成し、登録されたピッチパタンに基づいた元発話ピッチパタン及び近似ピッチパタンを入力して近似ピッチパタンに基づいて微細構造を含むピッチパタンと元発話ピッチパタンとの接続点を決定し、接続点において微細構造を含むピッチパタンと元発話ピッチパタンとを接続して接続ピッチパタンを生成する。 In the speech synthesis method of the present invention, a pitch pattern including a fine structure is input to generate an approximate pitch pattern that approximately represents the pitch pattern, and an original utterance pitch pattern and an approximate pitch pattern based on the registered pitch pattern are generated. To determine the connection point between the pitch pattern including the fine structure and the original utterance pitch pattern based on the approximate pitch pattern, and connecting the pitch pattern including the fine structure and the original utterance pitch pattern at the connection point Generate a pattern.

本発明の音声合成プログラムは、微細構造を含むピッチパタンを入力してピッチパタンを近似的に表現する近似ピッチパタンを生成する処理と、登録されたピッチパタンに基づいた元発話ピッチパタンと近似ピッチパタンとを入力して近似ピッチパタンに基づいて微細構造を含むピッチパタンと元発話ピッチパタンとの接続点を決定し、接続点において微細構造を含むピッチパタンと元発話ピッチパタンとを接続して接続ピッチパタンを生成する処理と、をコンピュータに実行させる。 The speech synthesis program of the present invention includes a process for generating an approximate pitch pattern that approximates a pitch pattern by inputting a pitch pattern including a fine structure, and an original utterance pitch pattern and an approximate pitch based on the registered pitch pattern. The connection point between the pitch pattern including the fine structure and the original utterance pitch pattern is determined based on the approximate pitch pattern, and the pitch pattern including the fine structure and the original utterance pitch pattern are connected at the connection point. And causing the computer to execute processing for generating a connection pitch pattern.

本発明によれば、微細構造を持つピッチパタンと元発話ピッチパタンとを、アクセント句内のピッチパタン外形を保つように滑らかに接続することによって、肉声に近い安定した韻律をもつ合成音声を生成することが可能となる。 According to the present invention, a synthesized speech having a stable prosody close to a real voice is generated by smoothly connecting a pitch pattern having a fine structure and an original utterance pitch pattern so as to maintain a pitch pattern outline in an accent phrase. It becomes possible to do.

本発明の実施形態に係る音声合成装置の機能ブロック図である。It is a functional block diagram of the speech synthesizer concerning an embodiment of the present invention. 本発明の実施形態に係る音声合成装置の機能ブロック図である。It is a functional block diagram of the speech synthesizer concerning an embodiment of the present invention. 本発明の第１の実施形態に係る音声合成装置の機能ブロック図である。1 is a functional block diagram of a speech synthesizer according to a first embodiment of the present invention. 本発明の第１の実施形態に係る音声合成装置の動作を説明するためのフローチャートである。It is a flowchart for demonstrating operation | movement of the speech synthesizer which concerns on the 1st Embodiment of this invention. 本発明の第２の実施形態に係る音声合成装置の機能ブロック図である。It is a functional block diagram of the speech synthesizer concerning a 2nd embodiment of the present invention. 本発明の第２の実施形態に係る音声合成装置の動作を説明するためのフローチャートである。It is a flowchart for demonstrating operation | movement of the speech synthesizer which concerns on the 2nd Embodiment of this invention. 本発明の実施例１に係る音声合成装置の機能ブロック図である。It is a functional block diagram of the speech synthesizer concerning Example 1 of the present invention. 本発明の実施例１に係る音声合成装置における音声合成を説明するための図である。It is a figure for demonstrating the speech synthesis | combination in the speech synthesizer based on Example 1 of this invention. 本発明の実施例１に係る音声合成装置における音声合成を説明するための図である。It is a figure for demonstrating the speech synthesis | combination in the speech synthesizer based on Example 1 of this invention. 本発明の実施例１に係る音声合成装置における音声合成を説明するための図である。It is a figure for demonstrating the speech synthesis | combination in the speech synthesizer based on Example 1 of this invention. 本発明の実施例２に係る音声合成装置における音声合成を説明するための図である。It is a figure for demonstrating the speech synthesis | combination in the speech synthesizer which concerns on Example 2 of this invention. 本発明の実施例２に係る音声合成装置における音声合成を説明するための図である。It is a figure for demonstrating the speech synthesis | combination in the speech synthesizer which concerns on Example 2 of this invention. 本発明の実施例２に係る音声合成装置における音声合成を説明するための図である。It is a figure for demonstrating the speech synthesis | combination in the speech synthesizer which concerns on Example 2 of this invention. 本発明の実施例３に係る音声合成装置の機能ブロック図である。It is a functional block diagram of the speech synthesizer concerning Example 3 of the present invention. 本発明の実施例３に係る近似ピッチパタン出力手段の機能ブロック図である。It is a functional block diagram of the approximate pitch pattern output means which concerns on Example 3 of this invention.

以下に、本発明を実施するための形態について図面を用いて説明する。但し、以下に述べる実施形態及び実施例には、本発明を実施するために技術的に好ましい限定がされているが、発明の範囲を以下に限定するものではない。尚、各実施形態について、同様な構成要素には同じ符号を付し、適宜説明を省略する。 EMBODIMENT OF THE INVENTION Below, the form for implementing this invention is demonstrated using drawing. However, the embodiments and examples described below are technically preferable for carrying out the present invention, but the scope of the invention is not limited to the following. In addition, about each embodiment, the same code | symbol is attached | subjected to the same component and description is abbreviate | omitted suitably.

図１は、本発明の実施形態に係る音声合成装置１の機能構成を示すブロック図である。 FIG. 1 is a block diagram showing a functional configuration of a speech synthesizer 1 according to an embodiment of the present invention.

音声合成装置１は、入力されたピッチパタン情報（以下、ピッチパタン）から近似ピッチパタンを生成する。そして、音声合成装置１は、生成された近似ピッチパタンに基づいて、元発話ピッチパタン情報（以下、元発話ピッチパタン）とピッチパタンとを接続し、接続ピッチパタン情報（以下、接続ピッチパタン）を生成する。 The speech synthesizer 1 generates an approximate pitch pattern from the input pitch pattern information (hereinafter referred to as pitch pattern). Then, the speech synthesizer 1 connects the original utterance pitch pattern information (hereinafter referred to as the original utterance pitch pattern) and the pitch pattern based on the generated approximate pitch pattern, and the connection pitch pattern information (hereinafter referred to as the connection pitch pattern). Is generated.

音声合成装置１に入力されるピッチパタンは、基本周波数（ピッチ周波数）の時間変化である。ピッチパタンは、入力テキストを言語解析することによって生成されたコンテクスト情報に基づいて生成される。コンテクスト情報とは、音声合成の対象となる入力テキストに対応する音素列と、音素の前後環境や位置といった各音素に付随する情報とをまとめたものである。 The pitch pattern input to the speech synthesizer 1 is a time change of the fundamental frequency (pitch frequency). The pitch pattern is generated based on context information generated by performing linguistic analysis on the input text. The context information is a collection of phoneme strings corresponding to the input text to be synthesized, and information associated with each phoneme such as the environment and position of the phoneme.

ピッチパタンは、コンテクスト情報に含まれる音素列を構成する各音素に対応するパタンを配列したパタンであり、短時間で細かく変化する微細構造を含む。なお、有声音についてはピッチパタンが得られるものの、無声音についてはピッチパタンが得られない。 The pitch pattern is a pattern in which patterns corresponding to each phoneme constituting the phoneme string included in the context information are arranged, and includes a fine structure that finely changes in a short time. In addition, although a pitch pattern is obtained for voiced sound, a pitch pattern cannot be obtained for unvoiced sound.

元発話ピッチパタンは、収録音声のピッチ周波数の微細変化を含むピッチパタンを忠実に再現するパタンである。元発話ピッチパタンは、時刻とピッチ周波数の数値を持つ節点によって表現される。元発話ピッチパタンは、データベース等に予め格納されている。 The original utterance pitch pattern is a pattern that faithfully reproduces a pitch pattern including a minute change in pitch frequency of recorded voice. The original utterance pitch pattern is expressed by a node having numerical values of time and pitch frequency. The original utterance pitch pattern is stored in advance in a database or the like.

音声合成装置１は、近似ピッチパタン出力手段１１と、ピッチパタン接続手段１２と、を備える。 The speech synthesizer 1 includes approximate pitch pattern output means 11 and pitch pattern connection means 12.

（近似ピッチパタン出力手段）
近似ピッチパタン出力手段１１は、入力されたピッチパタン（Ｐ１）を近似的に表現する近似ピッチパタンを生成する。 (Approximate pitch pattern output means)
The approximate pitch pattern output means 11 generates an approximate pitch pattern that approximately represents the input pitch pattern (P1).

近似ピッチパタン出力手段１１は、自身が近似ピッチパタンを生成する機能を有してもよいし、データベースなどに格納された近似ピッチパタンを選択する機能を有してもよい。 The approximate pitch pattern output unit 11 may have a function of generating an approximate pitch pattern by itself or a function of selecting an approximate pitch pattern stored in a database or the like.

近似ピッチパタン出力手段１１は、例えば多項式近似や代表点を使ったスプライン補間曲線近似等によって、ピッチパタンから近似ピッチパタンを生成する。ただし、無声音の区間についてはピッチパタンが生成されないため、補間してピッチパタンを求める。 The approximate pitch pattern output means 11 generates an approximate pitch pattern from the pitch pattern by, for example, polynomial approximation or spline interpolation curve approximation using representative points. However, since the pitch pattern is not generated for the unvoiced sound section, the pitch pattern is obtained by interpolation.

なお、近似曲線を求める際には、言語毎の韻律に関する特徴を制約条件として加えてもよい。例えば、日本語の場合、アクセント句内の第１〜２モーラでピークまで上昇し、その後アクセントフォールまではなだらかに下降し、アクセントフォールのモーラ付近で急降下するといった特徴が制約条件として使える。 When obtaining an approximate curve, features related to prosody for each language may be added as constraints. For example, in the case of Japanese, a feature that the peak rises at the first to second mora in the accent phrase, then gently falls to the accent fall, and then suddenly drops near the accent fall mora can be used as the constraint condition.

また、無声音区間が連続すると、ピークの位置や周波数値、始点のピッチ周波数値が明らかにならない場合がある。そのような場合は、仮のピッチパタンを推定すればよい。例えば、入力テキストにおける無声音素を、無声音素の特徴と近い有声音素に置換することによって、仮のピッチパタンを推定することができる。例えば、無声歯茎破裂音「ｔ」と有声歯茎破裂音「ｄ」などのように、有声／無声で調音構造が似ているものを選択することができる。 In addition, when the unvoiced sound section is continuous, the peak position, frequency value, and starting pitch frequency value may not be clarified. In such a case, a temporary pitch pattern may be estimated. For example, a temporary pitch pattern can be estimated by replacing unvoiced phonemes in the input text with voiced phonemes that are close to the features of the unvoiced phonemes. For example, it is possible to select voiced / unvoiced sounds with similar articulation structures, such as unvoiced gum burst sound “t” and voiced gum burst sound “d”.

言語によっては、調音構造の似ている有声音素を持たない無声音素が存在する場合もある。そのような場合は、ピッチパタンの急峻な変動を含みにくい音素（例えば、「ｎ」）などの適当な有声音素で置換すればよい。また、無声化母音については、対応する有声母音で置換すればよい。 Depending on the language, there may be unvoiced phonemes that do not have voiced phonemes with similar articulatory structures. In such a case, it may be replaced with an appropriate voiced phoneme such as a phoneme (for example, “n”) that does not easily include a sharp change in pitch pattern. In addition, the unvoiced vowel may be replaced with the corresponding voiced vowel.

また、近似ピッチ生成手段１０は、予め作成しておいたピッチパタンを選択することによって、近似ピッチパタンを生成してもよい。例えば、大量の音声データをデータベース化しておき、音声データの中からアクセント句ごとにピッチパタンを抽出しておく。近似ピッチパタン出力手段１１は、抽出されていたピッチパタンから近似ピッチパタンを生成すればよい。 Further, the approximate pitch generation means 10 may generate an approximate pitch pattern by selecting a pitch pattern created in advance. For example, a large amount of audio data is stored in a database, and a pitch pattern is extracted for each accent phrase from the audio data. The approximate pitch pattern output unit 11 may generate an approximate pitch pattern from the extracted pitch pattern.

（ピッチパタン接続手段）
ピッチパタン接続手段１２は、近似ピッチパタンに基づいて、元発話ピッチパタンとピッチパタンとを接続した接続ピッチパタンを生成し、生成した接続ピッチパタンを出力する。 (Pitch pattern connection means)
The pitch pattern connection means 12 generates a connection pitch pattern obtained by connecting the original utterance pitch pattern and the pitch pattern based on the approximate pitch pattern, and outputs the generated connection pitch pattern.

ピッチパタン接続手段１２は、近似ピッチパタンに基づいて、元発話ピッチパタンとピッチパタンとを接続する周波数値である接続点を決定する。 The pitch pattern connection means 12 determines a connection point that is a frequency value for connecting the original utterance pitch pattern and the pitch pattern based on the approximate pitch pattern.

例えば、ピッチパタンに含まれる音節列と一致する音節列及びアクセント位置を持つ元発話ピッチパタンが検索された場合、元発話ピッチパタンの接続先の音素と元発話ピッチパタンとの間を境界線とする。そして、近似ピッチパタンと境界線との交点の周波数値を接続点に決定することができる。 For example, when an original utterance pitch pattern having a syllable string and an accent position that matches the syllable string included in the pitch pattern is searched, a boundary line is defined between the phoneme to which the original utterance pitch pattern is connected and the original utterance pitch pattern. To do. And the frequency value of the intersection of an approximate pitch pattern and a boundary line can be determined as a connection point.

また、近似ピッチパタンと境界線との交点を参考値とし、近似ピッチパタンと境界線との交点とは異なる点を接続点としてもよい。例えば、近似ピッチパタンと境界線との交点、ピッチパタン及び元発話ピッチパタンの境界線上における周波数値の平均値を接続点としてもよい。また、ピッチパタンと元発話ピッチパタンとが境界線上で十分に滑らかに変化している場合、ピッチパタンあるいは元発話ピッチパタンのいずれかの境界線上の周波数値を接続点としてもよい。 Further, an intersection point between the approximate pitch pattern and the boundary line may be used as a reference value, and a point different from the intersection point between the approximate pitch pattern and the boundary line may be used as the connection point. For example, the intersection of the approximate pitch pattern and the boundary line, the average value of the frequency values on the boundary line of the pitch pattern and the original utterance pitch pattern may be used as the connection point. In addition, when the pitch pattern and the original utterance pitch pattern change sufficiently smoothly on the boundary line, the frequency value on the boundary line of either the pitch pattern or the original utterance pitch pattern may be used as the connection point.

接続点において、ピッチパタンと元発話ピッチパタンとをそのまま接続してもよいが、接続点においてピッチが急峻に変わると、後に波形を生成した際に異音が混入する原因ともなってしまう。そのため、ピッチパタンと元発話ピッチパタンとの接続点近傍においてスムージングをすることが望ましい。例えば、接続点に隣接する２つの音素をスムージング区間に設定し、そのスムージング区間においてスムージングをすればよい。なお、ピッチパタンと元発話ピッチパタンとが境界線近傍において十分滑らかに変化している場合、スムージングを行わずにピッチパタンあるいは元発話ピッチパタンを周波数軸方向に平行移動することで接続しても構わない。 At the connection point, the pitch pattern and the original utterance pitch pattern may be connected as they are, but if the pitch changes sharply at the connection point, it may cause noise when a waveform is generated later. Therefore, it is desirable to perform smoothing in the vicinity of the connection point between the pitch pattern and the original utterance pitch pattern. For example, two phonemes adjacent to the connection point may be set as a smoothing section, and smoothing may be performed in the smoothing section. If the pitch pattern and the original utterance pitch pattern change sufficiently smoothly in the vicinity of the boundary line, the pitch pattern or the original utterance pitch pattern can be connected by translating in the frequency axis direction without performing smoothing. I do not care.

さらに、ピッチパタン接続手段１２は、文全体のピッチパタンのバランスを取るために、ピッチパタン及び元発話ピッチパタンを接続した後、アクセント句内における接続ピッチパタンのピーク値に基づいて、接続ピッチパタン全体を周波数方向に補正する。具体的には、接続後のピッチパタンと近似ピッチパタンのピーク値との差分だけ、ピッチパタン全体を周波数方向に平行移動させる。ただし、例えば差分が小さい場合等は、この処理を行わなくても構わない。 Further, the pitch pattern connecting means 12 connects the pitch pattern and the original utterance pitch pattern in order to balance the pitch pattern of the whole sentence, and then, based on the peak value of the connection pitch pattern in the accent phrase, The whole is corrected in the frequency direction. Specifically, the entire pitch pattern is translated in the frequency direction by the difference between the connected pitch pattern and the peak value of the approximate pitch pattern. However, for example, when the difference is small, this process may not be performed.

ここで、図２を用いて、本発明の実施形態に係る音声合成装置１の機能を実現するために、近似ピッチパタン出力手段１１及びピッチパタン接続手段１２に構成要素を加えた音声合成装置２について説明する。なお、図２においては、データの流れについては省略する。 Here, referring to FIG. 2, the speech synthesizer 2 in which components are added to the approximate pitch pattern output means 11 and the pitch pattern connection means 12 in order to realize the function of the speech synthesizer 1 according to the embodiment of the present invention. Will be described. In FIG. 2, the data flow is omitted.

音声合成装置２は、上述の近似ピッチパタン出力手段１１及び近似ピッチパタン接続手段１２を備える。また、音声合成装置２は、言語解析手段１３と、ピッチパタン生成手段１４と、音素継続長生成手段１５と、元発話ピッチパタン選択手段１６と、単位波形選択手段１７と、音声波形生成手段１８と、を備える。さらに、音声合成装置２は、ピッチパタン生成モデル記憶手段２１と、音素継続長生成モデル記憶手段２２と、元発話ピッチパタン記憶手段２３と、単位波形記憶手段２４と、を備える。 The speech synthesizer 2 includes the above-described approximate pitch pattern output means 11 and approximate pitch pattern connection means 12. The speech synthesizer 2 also includes a language analysis unit 13, a pitch pattern generation unit 14, a phoneme duration generation unit 15, an original utterance pitch pattern selection unit 16, a unit waveform selection unit 17, and a speech waveform generation unit 18. And comprising. Furthermore, the speech synthesizer 2 includes a pitch pattern generation model storage unit 21, a phoneme duration generation model storage unit 22, an original utterance pitch pattern storage unit 23, and a unit waveform storage unit 24.

ここで、各構成要素について説明する。なお、詳細な説明については、後述の実施形態及び実施例の説明において述べる。 Here, each component will be described. Detailed description will be given in the description of embodiments and examples described later.

言語解析手段１３は、入力テキストが入力されると、言語解析技術を用いて入力テキストに関するコンテクスト情報を生成する。コンテクスト情報とは、入力テキストに対応する音素列と、例えば音素の前後環境や位置といった、各音素に付随する情報をまとめたものである。 When the input text is input, the language analysis unit 13 generates context information related to the input text using language analysis technology. The context information is a collection of phoneme strings corresponding to the input text and information associated with each phoneme, for example, the surrounding environment and position of the phoneme.

ピッチパタン生成手段１４は、生成されたコンテクスト情報に基づいて、ピッチパタン生成モデル記憶手段２１に格納されたピッチパタン生成モデルを用いてピッチパタン情報を生成する。 The pitch pattern generation unit 14 generates pitch pattern information using the pitch pattern generation model stored in the pitch pattern generation model storage unit 21 based on the generated context information.

音素継続長生成手段１５は、音素継続長生成モデル記憶手段２２に格納された音素継続長生成モデルを用いて音素継続長を生成する。 The phoneme duration generation unit 15 generates a phoneme duration using the phoneme duration generation model stored in the phoneme duration generation model storage unit 22.

近似ピッチパタン出力手段１１は、ピッチパタン情報を基にして、近似ピッチパタンを生成する。 The approximate pitch pattern output means 11 generates an approximate pitch pattern based on the pitch pattern information.

元発話ピッチパタン選択手段１６は、元発話ピッチパタン記憶手段２３に記憶されている音節列情報に基づいて、元発話ピッチパタンを検索し、音節列情報の少なくとも一部と一致するピッチパタンを含む元発話ピッチパタンデータを選択する。そして、元発話ピッチパタン選択手段１６は、選択した元発話ピッチパタンデータから、当該／前方／後方の音節列及びアクセント位置が一致する部分を元発話ピッチパタンとして検索する。さらに、元発話ピッチパタン選択手段１６は、検索結果に基づいて、微細構造を持つピッチパタンに元発話ピッチパタンを接続する位置を決定する。 The original utterance pitch pattern selection means 16 searches the original utterance pitch pattern based on the syllable string information stored in the original utterance pitch pattern storage means 23, and includes a pitch pattern that matches at least a part of the syllable string information. Select original utterance pitch pattern data. Then, the original utterance pitch pattern selection means 16 searches the selected original utterance pitch pattern data for a portion where the relevant / front / rear syllable string and accent position match as the original utterance pitch pattern. Further, the original utterance pitch pattern selection means 16 determines a position where the original utterance pitch pattern is connected to a pitch pattern having a fine structure based on the search result.

ピッチパタン接続手段１２は、元発話ピッチパタン選択手段１６によって選択された元発話ピッチパタンと、ピッチパタン生成手段１４によって生成されたピッチパタンと、を近似ピッチパタンに基づいて接続し、接続ピッチパタンを生成する。なお、元発話ピッチパタンと微細構造を持つピッチパタンを接続することによって生成した接続ピッチパタンの情報は、ピッチ周波数の目標値となる。 The pitch pattern connection unit 12 connects the original utterance pitch pattern selected by the original utterance pitch pattern selection unit 16 and the pitch pattern generated by the pitch pattern generation unit 14 based on the approximate pitch pattern, and connects the connection pitch pattern. Is generated. Note that the information of the connection pitch pattern generated by connecting the original utterance pitch pattern and the pitch pattern having a fine structure is a target value of the pitch frequency.

単位波形選択手段１７は、接続ピッチパタンと生成した音素継続長とを併せた韻律情報に基づいて、単位波形記憶手段２４内に記憶された単位波形データを選択する。なお、単位選択を行う波形編集型音声合成方式においては、選択された単位波形データが重畳する間隔（ピッチ間隔）を調節することによって、ピッチ周波数の高低を表現する。 The unit waveform selection means 17 selects the unit waveform data stored in the unit waveform storage means 24 based on the prosodic information that combines the connection pitch pattern and the generated phoneme duration. In the waveform editing type speech synthesis method in which unit selection is performed, the pitch frequency is expressed by adjusting the interval (pitch interval) at which the selected unit waveform data is superimposed.

音声波形生成手段１８は、生成した韻律情報を再現するように、選択された単位波形データを編集することによって、合成音声波形を生成する。 The speech waveform generation means 18 generates a synthesized speech waveform by editing the selected unit waveform data so as to reproduce the generated prosodic information.

ピッチパタン生成モデル記憶手段２１は、ピッチパタン生成モデルを格納する。音素継続長生成モデル記憶手段２２は、音素継続長生成モデルを格納する。なお、ピッチパタン生成モデルと音素継続長生成モデルとは、同一のデータベースに格納するように構成してもよい。 The pitch pattern generation model storage unit 21 stores a pitch pattern generation model. The phoneme duration generation model storage unit 22 stores a phoneme duration generation model. Note that the pitch pattern generation model and the phoneme duration generation model may be stored in the same database.

元発話ピッチパタン記憶手段２３は、元発話ピッチパタン及び発声内容を示す音節列データを格納する。 The original utterance pitch pattern storage means 23 stores syllable string data indicating the original utterance pitch pattern and utterance content.

単位波形記憶手段２４は、単位波形データを格納する。 The unit waveform storage means 24 stores unit waveform data.

以上が、本発明の実施形態に係る音声合成装置１についての説明である。 The above is the description of the speech synthesizer 1 according to the embodiment of the present invention.

以上のように、本実施形態によれば、微細構造を含むピッチパタン同士を滑らかに接続し、自然な韻律情報を持つ合成音声を生成することが可能となる。 As described above, according to this embodiment, it is possible to smoothly connect pitch patterns including a fine structure and generate a synthesized speech having natural prosodic information.

続いて、本実施形態に係る音声合成装置１を具体化した実施形態の音声合成装置について説明する。 Next, a speech synthesizer according to an embodiment that embodies the speech synthesizer 1 according to the present embodiment will be described.

（第１の実施形態）
図３は、本発明の第１の実施形態に係る音声合成装置１０の構成例を示すブロック図である。図４は、本実施形態の動作の一例を示すフローチャートである。 (First embodiment)
FIG. 3 is a block diagram illustrating a configuration example of the speech synthesizer 10 according to the first embodiment of the present invention. FIG. 4 is a flowchart showing an example of the operation of this embodiment.

図３を参照すると、本実施形態に係る音声合成装置１０は、近似ピッチパタン出力手段である近似ピッチパタン生成部１１１と、ピッチパタン接続手段であるピッチパタン接続部１２０と、を備えて構成されている。 Referring to FIG. 3, the speech synthesizer 10 according to the present embodiment includes an approximate pitch pattern generation unit 111 that is an approximate pitch pattern output unit and a pitch pattern connection unit 120 that is a pitch pattern connection unit. ing.

図３と図４を用いて、本実施形態の動作について説明する。 The operation of this embodiment will be described with reference to FIGS.

近似ピッチパタン生成部１１１は、ピッチパタン（Ｐ１）を近似的に表現する近似ピッチパタンを生成する（ステップＳ１１）。Ｐ１は、微細構造を含んだピッチパタンであり、近似ピッチパタンは、Ｐ１の概形を表現する非線形の近似曲線である。 The approximate pitch pattern generation unit 111 generates an approximate pitch pattern that approximately represents the pitch pattern (P1) (step S11). P1 is a pitch pattern including a fine structure, and the approximate pitch pattern is a non-linear approximate curve expressing the outline of P1.

ピッチパタン接続部１２０は、Ｐ１と接続するピッチパタン（Ｐ２）との接続点を決定する（ステップＳ１２）。Ｐ２は、元発話ピッチパタンに相当する。 The pitch pattern connection unit 120 determines a connection point between P1 and the pitch pattern (P2) to be connected (step S12). P2 corresponds to the original utterance pitch pattern.

そして、ピッチパタン接続部１２０は、接続点においてＰ１とＰ２を接続する（ステップＳ３）。 And the pitch pattern connection part 120 connects P1 and P2 in a connection point (step S3).

以上のように、第１の実施形態によれば、近似ピッチパタンを生成して微細構造を含むピッチパタン同士を滑らかに接続し、自然な韻律情報を持つ合成音声を生成することが可能となる。 As described above, according to the first embodiment, it is possible to generate an approximate pitch pattern, smoothly connect pitch patterns including a fine structure, and generate a synthesized speech having natural prosodic information. .

（第２の実施形態）
続いて、本発明の第２の実施形態について説明する。 (Second Embodiment)
Subsequently, a second embodiment of the present invention will be described.

図５は、本発明の第２の実施形態に係る音声合成装置２０の構成例を示すブロック図である。図６は本実施形態の動作の一例を示すフローチャートである。 FIG. 5 is a block diagram showing a configuration example of the speech synthesizer 20 according to the second embodiment of the present invention. FIG. 6 is a flowchart showing an example of the operation of this embodiment.

図５を参照すると、本実施形態に係る音声合成装置２０は、近似ピッチパタン出力手段である近似ピッチパタン選択部１１２及び近似ピッチパタン記憶部１１３と、ピッチパタン接続手段であるピッチパタン接続部１２０と、を備えて構成されている。第２の実施形態に係る音声合成装置２０は、第１の実施形態に係る音声合成装置１０の近似ピッチパタン生成部１１１を近似ピッチパタン選択部１１２及び近似ピッチパタン記憶部１１３で置換した構成をとる。 Referring to FIG. 5, the speech synthesizer 20 according to this embodiment includes an approximate pitch pattern selection unit 112 and an approximate pitch pattern storage unit 113 that are approximate pitch pattern output means, and a pitch pattern connection unit 120 that is a pitch pattern connection unit. And is configured. The speech synthesis apparatus 20 according to the second embodiment has a configuration in which the approximate pitch pattern generation unit 111 of the speech synthesis apparatus 10 according to the first embodiment is replaced with an approximate pitch pattern selection unit 112 and an approximate pitch pattern storage unit 113. Take.

図５と図６を用いて、本実施形態の動作について説明する。 The operation of this embodiment will be described with reference to FIGS.

近似ピッチパタン記憶部１１３には、微細構造を含まない近似ピッチパタンが少なくとも１つ以上予め記憶されている。 The approximate pitch pattern storage unit 113 stores in advance at least one approximate pitch pattern that does not include a fine structure.

近似ピッチパタン選択部１１２は、微細構造を含むピッチパタン（Ｐ１）と、近似ピッチパタン記憶部１１３に記憶されている各近似ピッチパタンと、の距離計算を行う（ステップＳ２１）。 The approximate pitch pattern selection unit 112 calculates the distance between the pitch pattern (P1) including the fine structure and each approximate pitch pattern stored in the approximate pitch pattern storage unit 113 (step S21).

近似ピッチパタン選択部１１２は、近似ピッチパタン記憶部１１３に記憶されている各近似ピッチパタンのうち、最も距離の小さいものを近似ピッチパタンとして選択する（ステップＳ２２）。 The approximate pitch pattern selection unit 112 selects, as the approximate pitch pattern, the shortest distance among the approximate pitch patterns stored in the approximate pitch pattern storage unit 113 (step S22).

ピッチパタン接続部１２０は、Ｐ１と接続するピッチパタン（Ｐ２）との接続点を決定する（ステップＳ２３）。Ｐ２は、元発話ピッチパタンに相当する。 The pitch pattern connection unit 120 determines a connection point between P1 and the pitch pattern (P2) to be connected (step S23). P2 corresponds to the original utterance pitch pattern.

そして、ピッチパタン接続部１２０は、接続点においてＰ１とＰ２を接続する（ステップＳ２４）。 And the pitch pattern connection part 120 connects P1 and P2 in a connection point (step S24).

以上のように、第２の実施形態によれば、近似ピッチパタンを逐次生成しなくても、近似ピッチパタンを使用してピッチパタン同士を接続することが可能となる。 As described above, according to the second embodiment, it is possible to connect pitch patterns using approximate pitch patterns without sequentially generating approximate pitch patterns.

続いて、本発明の実施形態に係る音声合成装置及び音声合成システムについて、実施例を挙げて説明する。 Next, the speech synthesis apparatus and speech synthesis system according to the embodiment of the present invention will be described with reference to examples.

（実施例１）
図７は、本発明の第１の実施形態に係る実施例１の音声合成装置１００の概要を示す図である。図７においては、各構成要素間でやり取りするデータ要素を各構成要素間に加えて図示した。 Example 1
FIG. 7 is a diagram showing an overview of the speech synthesis apparatus 100 of Example 1 according to the first embodiment of the present invention. In FIG. 7, data elements exchanged between the constituent elements are illustrated between the constituent elements.

実施例１に係る音声合成装置１００は、近似ピッチパタン生成部１１１と、ピッチパタン接続部１２０と、を備える。また、音声合成装置１００は、言語解析部１３０と、ピッチパタン生成部１４０と、音素継続長生成部１５０と、元発話ピッチパタン選択部１６０と、ピッチパタン生成モデル記憶部２１１と、元発話ピッチパタン記憶部２１３と、音素継続長生成モデル記憶部２１２と、を備える。さらに、音声合成装置１００は、単位波形選択部１７０と、音声波形生成部１８０と、単位波形記憶部２１４と、を備える。なお、実施例１では、近似ピッチパタン生成部１１１について、第１の実施形態と同じ符号を用いる。 The speech synthesizer 100 according to the first embodiment includes an approximate pitch pattern generation unit 111 and a pitch pattern connection unit 120. The speech synthesizer 100 also includes a language analysis unit 130, a pitch pattern generation unit 140, a phoneme duration generation unit 150, an original utterance pitch pattern selection unit 160, a pitch pattern generation model storage unit 211, and an original utterance pitch. A pattern storage unit 213 and a phoneme duration generation model storage unit 212 are provided. Furthermore, the speech synthesizer 100 includes a unit waveform selection unit 170, a speech waveform generation unit 180, and a unit waveform storage unit 214. In Example 1, the same reference numerals as those in the first embodiment are used for the approximate pitch pattern generation unit 111.

実施例１では、韻律情報を基本周波数（ピッチ周波数）の時間変化、すなわちピッチパタン及び音素継続長の２つの情報からなるものとして扱う。なお、韻律情報としては、音声波形の短時間パワーの時間変化等も韻律情報に含めることも考えられる。 In the first embodiment, the prosodic information is treated as a time change of the fundamental frequency (pitch frequency), that is, information including two pieces of information of the pitch pattern and the phoneme duration. As the prosodic information, it is conceivable that the prosodic information includes a temporal change in the short-time power of the speech waveform.

また、実施例１では、微細構造を含むピッチパタンを生成する技術として、ＨＭＭ音声合成技術を用いることを想定する（ＨＭＭ：ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）。もちろん、微細構造を含むピッチパタンが生成可能であれば、ＨＭＭ音声合成技術ではない音声合成技術を用いても構わない。 In the first embodiment, it is assumed that an HMM speech synthesis technique is used as a technique for generating a pitch pattern including a fine structure (HMM: Hidden Markov Model). Of course, as long as a pitch pattern including a fine structure can be generated, a voice synthesis technique other than the HMM voice synthesis technique may be used.

続いて、実施例１の音声合成装置１００の各構成要素について詳細に説明する。 Next, each component of the speech synthesizer 100 according to the first embodiment will be described in detail.

言語解析部１３０は、入力テキストが入力されると、入力テキストに関するコンテクスト情報を生成する。 When the input text is input, the language analysis unit 130 generates context information regarding the input text.

ピッチパタン生成部１４０は、生成されたコンテクスト情報に基づいて、ピッチパタン生成モデル記憶部２１１に格納されたピッチパタン生成モデルを用いてピッチパタン情報を生成する。 The pitch pattern generation unit 140 generates pitch pattern information using the pitch pattern generation model stored in the pitch pattern generation model storage unit 211 based on the generated context information.

音素継続長生成部１５０は、音素継続長生成モデル記憶部２１２に格納された音素継続長生成モデルを用いて音素継続長を生成する。 The phoneme duration generation unit 150 generates a phoneme duration by using the phoneme duration generation model stored in the phoneme duration generation model storage unit 212.

なお、実施例１では、ピッチパタンと音素継続長とを別々に生成しているが、両者を併せた形の情報を生成しても構わない。 In the first embodiment, the pitch pattern and the phoneme continuation length are generated separately. However, information in the form of a combination of both may be generated.

ピッチパタン生成部１４０によって生成されたピッチパタンは、音素内のような短時間で細かく変化するような微細構造を含んでいる。ただし、「ｋ」や「ｓ」といった無声音の区間についてはピッチが存在しないため、ピッチパタンは生成されない。 The pitch pattern generated by the pitch pattern generation unit 140 includes a fine structure that changes finely in a short time as in a phoneme. However, since there is no pitch for the unvoiced sound section such as “k” or “s”, a pitch pattern is not generated.

図８に、入力テキストが「お出かけください」であった場合に生成されたピッチパタンの具体例（実線）を示す。図８においては、有声音の区間である「ｏｄｅ」、「ａ」、「ｅ」、「ｕｄａ」、「ａｉ」については、ピッチパタン（実線）が得られる。それに対し、無声音の区間である「ｋ」、「ｓ」については、ピッチパタンが得られない。 FIG. 8 shows a specific example (solid line) of the pitch pattern generated when the input text is “please go out”. In FIG. 8, pitch patterns (solid lines) are obtained for voiced sound sections “ode”, “a”, “e”, “uda”, and “ai”. On the other hand, a pitch pattern cannot be obtained for “k” and “s” which are unvoiced sound sections.

近似ピッチパタン生成部１１１は、生成されたピッチパタン情報を基にして、近似的に表現した近似ピッチパタンを生成する。 The approximate pitch pattern generation unit 111 generates an approximate pitch pattern that is approximately expressed based on the generated pitch pattern information.

図９に、生成された近似ピッチパタンの具体例を示す。図９においては、有声音の区間で得られたピッチパタンに基づいて、近似ピッチパタン（一点鎖線）が生成されている。 FIG. 9 shows a specific example of the generated approximate pitch pattern. In FIG. 9, an approximate pitch pattern (one-dot chain line) is generated based on the pitch pattern obtained in the voiced sound section.

元発話ピッチパタン記憶部２１３には、元発話ピッチパタン及び発声内容を示す音節列データが記憶されている。元発話ピッチパタンは、収録音声のピッチ周波数の微細変化を含むピッチパタンを忠実に再現するパタンであり、時刻とピッチ周波数の数値を持つ節点により表現される。 The original utterance pitch pattern storage unit 213 stores syllable string data indicating the original utterance pitch pattern and utterance content. The original utterance pitch pattern is a pattern that faithfully reproduces a pitch pattern including a minute change in the pitch frequency of the recorded voice, and is expressed by a node having numerical values of time and pitch frequency.

また、実施例１では、「ｏｈｅＮｊｉｋｕｄａｓａ＠ｉ（おへんじください）」という発話内容の収録音声を表現する元発話ピッチパタンが記憶されているものとする。ここで、「＠」の直前の母音は、標準語におけるアクセント位置を示している。 Further, in the first embodiment, it is assumed that the original utterance pitch pattern expressing the recorded voice of the utterance content “o he N ji ku da sa @ i (please change)” is stored. Here, the vowel immediately before “@” indicates the accent position in the standard word.

元発話ピッチパタン選択部１６０は、元発話ピッチパタン記憶部２１３に記憶されている音節列情報に基づいて、元発話ピッチパタン記憶部２１３を検索し、音節列情報の少なくとも一部と一致する元発話ピッチパタンを選択する。例えば、図８にあるように「おでかけください（お出かけください）」という入力テキストが入力された場合、前記コンテクスト情報より、音節列は「ｏｄｅｋａｋｅｋｕｄａｓａ＠ｉ」となる。 The original utterance pitch pattern selection unit 160 searches the original utterance pitch pattern storage unit 213 based on the syllable string information stored in the original utterance pitch pattern storage unit 213, and the original utterance pitch pattern storage unit 213 matches the at least part of the syllable string information. Select the utterance pitch pattern. For example, when the input text “please go out (please go out)” is input as shown in FIG. 8, the syllable string becomes “o de ka ke ku da sa @ i” from the context information.

元発話ピッチパタン選択部１６０は、元発話ピッチパタン記憶部２１３内の元発話ピッチパタンデータから、当該／前方／後方の音節列及びアクセント位置が一致する部分を検索する。元発話ピッチパタン選択部１６０は、検索結果に基づいて、元発話ピッチパタンを接続する位置を決定する。 The original utterance pitch pattern selection unit 160 searches the original utterance pitch pattern data in the original utterance pitch pattern storage unit 213 for a portion where the corresponding / forward / rear syllable string and accent position match. The original utterance pitch pattern selection unit 160 determines a position to connect the original utterance pitch pattern based on the search result.

実施例１の場合、「ｏｈｅＮｊｉｋｕｄａｓａ＠ｉ」の「ｄａｓａ＠ｉ」の部分が、音節列及びアクセント位置の両方において、「ｏｄｅｋａｋｅｋｕｄａｓａ＠ｉ」に含まれる音節列に一致している。そのため、「ｄａｓａ＠ｉ」が検索結果に該当し、元発話ピッチパタンとして使用できる。このようにして、当該アクセント句内の元発話ピッチパタンが選択される。 In the case of the first embodiment, the “da sa @ i” portion of “o he N ji ku da sa @ i” is included in “o de kake ku da sa @ i” in both the syllable string and the accent position. Matches the syllable string to be played. Therefore, “da sa @ i” corresponds to the search result and can be used as the original utterance pitch pattern. In this way, the original utterance pitch pattern in the accent phrase is selected.

また、元発話ピッチパタン選択部１６０は、音節列「ｄａｓａ＠ｉ」を接続する境界線を、音節列「ｏｈｅＮｊｉｋｕｄａｓａ＠ｉ」を構成する音節である「ｋｕ」と「ｄａ」との間に決定する。 In addition, the original utterance pitch pattern selection unit 160 sets the boundary line connecting the syllable string “da sa @ i” as “ku” and “ku” that are syllables that constitute the syllable string “o he Nji ku da sa @ i”. "da".

ピッチパタン接続部１２０は、選択された元発話ピッチパタン（以下、Ｐ_ｏ）と、ピッチパタン生成部１４０で生成されたピッチパタン（以下、Ｐ_ｇ）と、を接続し、当該アクセント句の接続ピッチパタンを生成する。 The pitch pattern connection unit 120 connects the selected original utterance pitch pattern (hereinafter referred to as P _o ) and the pitch pattern (hereinafter referred to as P _g ) generated by the pitch pattern generation unit 140 and connects the accent phrases. Generate a pitch pattern.

ここで、ピッチパタン接続部１２０によるピッチパタン接続方法の具体例を、図１０を用いて説明する。 Here, a specific example of the pitch pattern connection method by the pitch pattern connection unit 120 will be described with reference to FIG.

まず、ピッチパタン接続部１２０は、生成された近似ピッチパタンに基づいて、Ｐ_ｏとＰ_ｇとを接続する周波数値である接続点を決定する。 First, pitch pattern connecting unit 120, based on the generated approximated pitch contour, to determine the connection point is a frequency value which connects the P _o and P _g.

元発話ピッチパタン選択部１６０で決定された通り、Ｐ_ｏとＰ_ｇが接続されるのは、音節「ｋｕ」と「ｄａ」との境界線となる。ピッチパタン接続部１２０は、生成された近似ピッチパタンの曲線と境界線との交点の周波数値を接続点として決定する。 As determined by the original utterance pitch pattern selection unit 160, the connection between P _o and P _g is a boundary line between the syllables “ku” and “da”. The pitch pattern connection unit 120 determines the frequency value of the intersection of the generated approximate pitch pattern curve and the boundary line as a connection point.

あるいは、境界線と近似ピッチパタン曲線の交点を接続点とせずに、交点を参考値として接続点を決定してもよい。例えば、生成された近似ピッチパタンの曲線と境界線との交点、「Ｐ_ｏ」及び「Ｐ_ｇ」の境界線上における周波数値の平均値を接続点としてもよい。また、Ｐ_ｏ及びＰ_ｇが境界線近傍において十分滑らかに変化している場合、Ｐ_ｏあるいはＰ_ｇどちらかの境界線上における周波数値を接続点としてもよい。 Alternatively, the connection point may be determined using the intersection point as a reference value without using the intersection point of the boundary line and the approximate pitch pattern curve as the connection point. For example, the intersection of the generated approximate pitch pattern curve and the boundary line, and the average value of the frequency values on the boundary line of “P _o ” and “P _g ” may be used as the connection point. In addition, when P _o and P _g change sufficiently smoothly in the vicinity of the boundary line, the frequency value on the boundary line of either P _o or P _g may be used as the connection point.

なお、接続点において、Ｐ_ｏ及びＰ_ｇをそのまま接続してもよいが、接続点においてピッチが急峻に変わり、後に波形を生成した際に異音が混入する原因ともなることもある。そのため、接続点を含む範囲において、ピッチパタンをスムージングすることが望ましい。 Note that _Po and _Pg may be connected as they are at the connection point, but the pitch may change sharply at the connection point, which may cause abnormal noise to be mixed when a waveform is generated later. Therefore, it is desirable to smooth the pitch pattern in a range including the connection point.

図１０では、接続点に隣接する２音素「ｕ」と「ｄ」とをスムージング区間と設定し、この区間内でスムージングするものとした。なお、Ｐ_ｏ及びＰ_ｇが境界線近傍において十分滑らかに変化している場合等は、スムージングを行わずに、Ｐ_ｏあるいはＰ_ｇを周波数軸方向（縦軸上向方向）に平行移動することで接続しても構わない。 In FIG. 10, the two phonemes “u” and “d” adjacent to the connection point are set as the smoothing section, and the smoothing is performed within this section. When P _o and P _g change smoothly enough in the vicinity of the boundary line, etc., P _o or P _g is translated in the frequency axis direction (vertical axis upward direction) without performing smoothing. You may connect with.

さらに、ピッチパタン接続部１２０は、文全体のピッチパタンのバランスを取るために、Ｐ_ｏ及びＰ_ｇを接続した後、アクセント句内における接続ピッチパタンのピーク値に基づいて、接続ピッチパタン全体を周波数方向に補正する処理を行う。 Further, the pitch pattern connecting portion 120, in order to balance the pitch pattern of the entire sentence, after connecting the P _o and P _g, based on the peak value of the connection pitch pattern in the accent phrase, the entire connection pitch pattern Processing to correct in the frequency direction is performed.

具体的には、接続ピッチパタンと近似ピッチパタンのピーク値との差分だけ、ピッチパタン全体を周波数方向に平行移動させる。ただし、差分が小さい場合は、この処理を行わなくても構わない。 Specifically, the entire pitch pattern is translated in the frequency direction by the difference between the connection pitch pattern and the peak value of the approximate pitch pattern. However, if the difference is small, this process may not be performed.

単位波形選択部１７０は、接続ピッチパタンと生成した音素継続長とを併せた韻律情報に基づいて、単位波形記憶部２１４内に記憶された単位波形データを選択する。単位波形データは、収録音声から予め生成されたものであり、合成音を構成する最小単位となる音声波形を指す。ここで、ピッチパタン内の元発話ピッチパタン区間については、対応する元発話の単位波形データを使用する。 The unit waveform selection unit 170 selects unit waveform data stored in the unit waveform storage unit 214 based on the prosodic information that combines the connection pitch pattern and the generated phoneme duration. The unit waveform data is generated in advance from recorded speech and indicates a speech waveform that is a minimum unit constituting a synthesized sound. Here, for the original utterance pitch pattern section in the pitch pattern, the corresponding unit waveform data of the original utterance is used.

最後に、音声波形生成部１８０は、生成した韻律情報を再現するように、選択された単位波形データを編集することによって、合成音声波形を生成する。合成音声波形の生成に関しては、例えば単位波形をピッチパタンに基づいて並べて波形重畳していけばよい。 Finally, the speech waveform generation unit 180 generates a synthesized speech waveform by editing the selected unit waveform data so as to reproduce the generated prosodic information. Regarding the generation of the synthesized speech waveform, for example, unit waveforms may be arranged on the basis of the pitch pattern and superimposed.

なお、実施例１において、元発話ピッチパタン選択部１６０は、１つの元発話ピッチパタンを選択しているが、音素列等の条件が適合するデータが複数存在する場合は、複数の元発話ピッチパタンを候補として挙げてもよい。 In the first embodiment, the original utterance pitch pattern selection unit 160 selects one original utterance pitch pattern. However, when there are a plurality of pieces of data satisfying conditions such as phoneme sequences, a plurality of original utterance pitches are selected. A pattern may be cited as a candidate.

音素列等の条件が適合するデータが複数存在する場合、ピッチパタン接続部１２０は、挙げられた元発話ピッチパタン候補の中から１つの最適な元発話ピッチパタンを決定する。候補の中から１つの元発話ピッチパタンを決定する方法は、元発話ピッチパタンが持つピッチ周波数情報に基づいて、接続の際のピッチ周波数変更量が最も少ないものを選択するという方法が考えられる。これは、ピッチ周波数の変更量が少ないほど、音声波形生成部１８０において合成音声波形を生成した際に、音質の劣化を抑えられるためである。 When there are a plurality of pieces of data that satisfy a condition such as a phoneme string, the pitch pattern connection unit 120 determines one optimal original utterance pitch pattern from the listed original utterance pitch pattern candidates. As a method of determining one original utterance pitch pattern from the candidates, a method of selecting the one having the smallest amount of change in pitch frequency at the time of connection can be considered based on the pitch frequency information of the original utterance pitch pattern. This is because the smaller the change amount of the pitch frequency, the more the deterioration of the sound quality can be suppressed when the synthesized waveform is generated by the speech waveform generator 180.

実施例１によれば、微細構造を持つピッチパタンと、元発話ピッチパタンがアクセント句内のピッチパタン概形を保つようにして滑らかに接続されるため、肉声に近い安定した韻律を持つ合成音声を生成することが可能となる。 According to the first embodiment, since the pitch pattern having a fine structure and the original utterance pitch pattern are smoothly connected so as to maintain the outline of the pitch pattern in the accent phrase, the synthesized speech having a stable prosody close to the real voice Can be generated.

以上が、本発明の第１の実施形態に係る実施例１についての説明である。続いて、本発明の第１の実施形態に係る実施例２について説明する。 The above is the description of Example 1 according to the first embodiment of the present invention. Next, Example 2 according to the first embodiment of the present invention will be described.

（実施例２）
実施例２は、実施例１と同様の構成（図７）を備える音声合成装置１００による。実施例２では、入力テキストが「ひはんしたしすてむが（批判したシステムが）」であった場合を想定する。実施例２においてピッチパタン生成部１４０が生成するピッチパタンの具体例を図１１に示す。 (Example 2)
The second embodiment is based on the speech synthesizer 100 having the same configuration (FIG. 7) as the first embodiment. In the second embodiment, it is assumed that the input text is “Hyantashi Sutemuga (the criticized system)”. A specific example of the pitch pattern generated by the pitch pattern generation unit 140 in the second embodiment is shown in FIG.

入力テキストに対応する音節列は「ｈＩ＃ｈａＮｓｈＩ＃ｔａ／ｓｈＩ＃ｓＵ＃ｔｅｍｕｇａ」となる。 The syllable string corresponding to the input text is “hI # ha N shI # ta / shI # sU # te muga”.

ここで、「／」は、アクセント句の区切りを示している。また、Ｉ＃などの大文字の母音に＃が付いた音素は、無声化した母音を示している。 Here, “/” indicates an accent phrase delimiter. Also, phonemes with a # in uppercase vowels such as I # indicate unvoiced vowels.

図１１から分かるように、無声音の区間（「ｈＩ＃ｈ」、「ｓｈＩ＃ｔ」、「ｓｈＩ＃ｓＵ＃ｔ」の３ヶ所）にはピッチパタンが生成されておらず、ピークの位置及び周波数値、始点のピッチ周波数値が明らかになっていない。このため、アクセント句全体のピッチパタン概形の情報を得ることができない。そこで、実施例２では、無声音区間の仮のピッチパタンも推定してやる必要がある。 As can be seen from FIG. 11, no pitch pattern is generated in the unvoiced sound section (three places “hI # h”, “shI # t”, and “shI # sU # t”), and the peak position and frequency The pitch frequency value at the start point is not clear. For this reason, it is not possible to obtain information on the overall pitch pattern of the accent phrase. Therefore, in the second embodiment, it is necessary to estimate the temporary pitch pattern of the unvoiced sound section.

無声音区間のピッチパタンの推定方法としては、無声音素を有声音素に置換したうえで生成したピッチパタンを使用するという方法が考えられる。具体的には、まず、入力テキストにおける無声音素を、特徴の近い有声音素に置換したコンテクスト情報（有声化コンテクスト情報）を生成する。特徴の近い音素とは、例えば、無声歯茎破裂音「ｔ」と有声歯茎破裂音「ｄ」といったように、有声／無声で調音構造が似ているものを選択することが考えられる。 As a method for estimating the pitch pattern of an unvoiced sound section, a method of using a pitch pattern generated after replacing an unvoiced phoneme with a voiced phoneme can be considered. Specifically, first, context information (voiced context information) is generated by replacing unvoiced phonemes in the input text with voiced phonemes having similar characteristics. It is conceivable to select phonemes having similar features such as voiced / unvoiced phonetic structures such as unvoiced gum burst sound “t” and voiced gum burst sound “d”.

ただし、言語によっては対応する音素が存在しない場合もある。そのため、無声音素に対応する有声音素が存在しない場合は、例えば「ｎ」などのようなピッチパタンの急峻な変動を含みにくい有声音素で置換する。また、無声化母音については、「Ｉ＃」→「ｉ」のように対応する有声音素を用いることが考えられる。 However, depending on the language, there may be no corresponding phoneme. For this reason, if there is no voiced phoneme corresponding to the unvoiced phoneme, the voiced phoneme is replaced with a voiced phoneme that does not easily contain a sharp change in pitch pattern such as “n”. For the unvoiced vowel, it is conceivable to use a corresponding voiced phoneme as “I #” → “i”.

実施例２においては、例えば「ｎｉｎａＮｚｉｄａ／ｊｉｚｕｄｅｍｕｇａ」が有声化コンテクスト情報となる。ピッチパタン生成部１４０は、生成された有声化コンテクスト情報に基づいて、無声音区間の仮ピッチパタンを生成する。図１２に、生成された仮ピッチパタンの具体例（点線）を示す。 In the second embodiment, for example, “ni na zi da / ji zu de mu ga” is the voiced context information. The pitch pattern generation unit 140 generates a temporary pitch pattern of an unvoiced sound section based on the generated voiced context information. FIG. 12 shows a specific example (dotted line) of the generated temporary pitch pattern.

近似ピッチパタン生成部１１１は、前記生成された仮ピッチパタンを近似的に表現するように、近似ピッチパタンを生成する。図１３に、生成された近似ピッチパタンの具体例（一点鎖線）を示す。 The approximate pitch pattern generation unit 111 generates an approximate pitch pattern so as to approximately represent the generated temporary pitch pattern. FIG. 13 shows a specific example (one-dot chain line) of the generated approximate pitch pattern.

音声合成装置１００は、近似ピッチパタンを生成した後、実施例１と同様の処理によって、合成音声を生成する。 After generating the approximate pitch pattern, the speech synthesizer 100 generates synthesized speech by the same processing as in the first embodiment.

なお、本実施例で示した無声音素と有声音素の置換する処理を使用するか否かを、状況に応じて切り替えてもよい。例えば、入力テキストに対応するコンテクスト情報に含まれる音素のうち、全音素数に対し５０％以上の音素が無声音だった場合に前記の置換処理を行い、５０％未満の場合には置換処理を行わないといった判断基準で切り替えればよい。 Note that whether or not to use the processing for replacing unvoiced phonemes and voiced phonemes shown in this embodiment may be switched depending on the situation. For example, the replacement process is performed when 50% or more of the phonemes included in the context information corresponding to the input text are unvoiced, and the replacement process is not performed when the number is less than 50%. It may be switched according to such a judgment criterion.

実施例２によれば、無声音素が多くピッチパタンの概形情報が得にくい場合でも、適切なピッチパタン概形を推測できるため、より安定した韻律を持つ合成音声を生成することが可能となる。 According to the second embodiment, even when there are many unvoiced phonemes and it is difficult to obtain outline information of the pitch pattern, it is possible to estimate an appropriate outline of the pitch pattern. Therefore, it is possible to generate synthesized speech having a more stable prosody. .

（実施例３）
続いて、本発明の第２の実施形態に係る実施例３について説明する。図１４は、本発明の実施例３に係る音声合成装置３００の概要を示す図である。図１４においては、各構成要素間でやり取りするデータ要素を各構成要素間に加えて図示した。 (Example 3)
Next, Example 3 according to the second embodiment of the present invention will be described. FIG. 14 is a diagram illustrating an overview of the speech synthesis apparatus 300 according to the third embodiment of the present invention. In FIG. 14, data elements exchanged between the constituent elements are illustrated between the constituent elements.

実施例３に係る音声合成装置３００は、実施例１の構成における近似ピッチパタン生成部１１１の代わりに、近似ピッチパタン選択部１１２と、近似ピッチパタン記憶部１１３と、を備えた構成となっている。実施例３において、近似ピッチパタン選択部１１２及び近似ピッチパタン記憶部１１３以外の構成要素は実施例１と同様であるため、同じ符号を付して示す。また、実施例３では、近似ピッチパタン選択部１１２及び近似ピッチパタン記憶部１１３について、第２の実施形態と同じ符号を用いる。 The speech synthesizer 300 according to the third embodiment includes an approximate pitch pattern selection unit 112 and an approximate pitch pattern storage unit 113 instead of the approximate pitch pattern generation unit 111 in the configuration of the first embodiment. Yes. In the third embodiment, the constituent elements other than the approximate pitch pattern selection unit 112 and the approximate pitch pattern storage unit 113 are the same as those in the first embodiment, and thus are denoted by the same reference numerals. In the third embodiment, the same reference numerals as those in the second embodiment are used for the approximate pitch pattern selection unit 112 and the approximate pitch pattern storage unit 113.

近似ピッチパタン記憶部１１３は、あらかじめ作成しておいた近似ピッチパタンを複数記憶している。また、音声合成装置３００は、近似ピッチパタンを自ら生成させて、近似ピッチパタン記憶部１１３に格納してもよい。 The approximate pitch pattern storage unit 113 stores a plurality of approximate pitch patterns created in advance. The speech synthesizer 300 may generate an approximate pitch pattern by itself and store it in the approximate pitch pattern storage unit 113.

ここで、図１５を用いて、音声合成装置３００における近似ピッチパタンの生成方法について説明する。実施例３においては、図１５の近似ピッチパタン形成装置３１０（近似ピッチパタン形成手段）による近似ピッチパタンを生成の例を示す。なお、近似ピッチパタン記憶部１１３が近似ピッチパタンを外部から取得できさえすれば、近似ピッチパタン形成装置３１０を音声合成装置３００の構成に含めなくてもよい。 Here, a method of generating an approximate pitch pattern in the speech synthesizer 300 will be described with reference to FIG. In the third embodiment, an example of generating an approximate pitch pattern by the approximate pitch pattern forming apparatus 310 (approximate pitch pattern forming means) of FIG. 15 is shown. Note that the approximate pitch pattern forming device 310 may not be included in the configuration of the speech synthesizer 300 as long as the approximate pitch pattern storage unit 113 can acquire the approximate pitch pattern from the outside.

近似ピッチパタン形成装置３１０は、音声データ記憶部３１１と、ピッチパタン抽出部３１２と、近似ピッチパタン形成部３１３と、を含んでいる。 The approximate pitch pattern forming apparatus 310 includes an audio data storage unit 311, a pitch pattern extraction unit 312, and an approximate pitch pattern formation unit 313.

音声データ記憶部３１１は、複数の音声データを記憶している。 The audio data storage unit 311 stores a plurality of audio data.

ピッチパタン抽出部３１２は、基本周波数の抽出技術を用いて、音声データ記憶部３１１に記憶されている音声データからピッチパタンをアクセント句ごとに抽出する。 The pitch pattern extraction unit 312 extracts a pitch pattern for each accent phrase from the audio data stored in the audio data storage unit 311 using a fundamental frequency extraction technique.

近似ピッチパタン形成部３１３は、実施例１の近似ピッチパタン生成部１１１と同様の機能を有し、抽出されたピッチパタンに基づいて近似ピッチパタンを生成する。近似ピッチパタン形成部３１３は、生成させた近似ピッチパタンを近似ピッチパタン記憶部１１３に格納する。 The approximate pitch pattern forming unit 313 has the same function as the approximate pitch pattern generation unit 111 of the first embodiment, and generates an approximate pitch pattern based on the extracted pitch pattern. The approximate pitch pattern forming unit 313 stores the generated approximate pitch pattern in the approximate pitch pattern storage unit 113.

なお、近似ピッチパタンの選択肢を増やすためにも、音声データは大量である方が望ましい。大量の音声データを処理するためには、計算機によって自動で近似ピッチパタン生成などの処理がなされることが望ましい。 In order to increase the choices of the approximate pitch pattern, it is desirable that the audio data is large. In order to process a large amount of audio data, it is desirable that processing such as approximate pitch pattern generation be performed automatically by a computer.

また、音声データの数を絞ったうえで、人手によって正確な近似ピッチパタンを作成して近似ピッチパタン記憶部１１３に記憶させてもよい。音声データの数を絞れれば、近似ピッチパタン記憶部１１３の規模を調整することによって、処理時間の短縮が期待できる。 Further, after the number of audio data is reduced, an accurate approximate pitch pattern may be manually created and stored in the approximate pitch pattern storage unit 113. If the number of audio data can be reduced, the processing time can be shortened by adjusting the scale of the approximate pitch pattern storage unit 113.

近似ピッチパタン選択部１１２は、近似ピッチパタン記憶部１１３に記憶されている近似ピッチパタンの中から、ピッチパタン生成部１４０において生成されたピッチパタンに最も類似する形状を持つ近似ピッチパタンを選択する。近似ピッチパタン選択部１１２による近似ピッチパタンの選択方法としては、ユークリッド距離やマハラノビス距離等といった距離尺度の最小二乗誤差が最も小さいものを選択するといった方法等が考えられる。 The approximate pitch pattern selection unit 112 selects an approximate pitch pattern having a shape most similar to the pitch pattern generated by the pitch pattern generation unit 140 from the approximate pitch patterns stored in the approximate pitch pattern storage unit 113. . As a method of selecting an approximate pitch pattern by the approximate pitch pattern selection unit 112, a method of selecting a distance measure having the smallest least square error such as a Euclidean distance or a Mahalanobis distance may be considered.

近似ピッチパタンの選択以降、音声合成装置３００は、実施例１と同様の処理を行うことによって合成音声を生成する。 After the selection of the approximate pitch pattern, the speech synthesizer 300 generates synthesized speech by performing the same processing as in the first embodiment.

実施例３によれば、実際の音声から抽出したピッチパタンから生成した近似ピッチパタンを使うため、より肉声に近い韻律を持つ合成音声を生成することが可能となる。また、近似曲線を逐次生成する必要がなくなるため、近似ピッチパタン記憶部１１３の規模を調整することによって、実施例１に比べて処理時間を短縮することが可能であるという利点もある。 According to the third embodiment, since the approximate pitch pattern generated from the pitch pattern extracted from the actual voice is used, it is possible to generate a synthesized voice having a prosody closer to the real voice. In addition, since it is not necessary to sequentially generate approximate curves, there is an advantage that the processing time can be shortened compared with the first embodiment by adjusting the scale of the approximate pitch pattern storage unit 113.

以上、実施形態及び実施例を参照して本発明を説明してきたが、本発明は上記実施形態及び実施例に限定されるものではない。本発明の構成や詳細には、例えば近似曲線の導出方法、韻律情報生成方式及び音声合成方式等に関して、本発明のスコープ内で当業者が理解し得る様々な変更をすることができる。 As mentioned above, although this invention has been described with reference to the embodiments and examples, the present invention is not limited to the above-described embodiments and examples. Various changes that can be understood by those skilled in the art within the scope of the present invention can be made to the configuration and details of the present invention with respect to, for example, the method of deriving an approximate curve, the prosody information generation method, and the speech synthesis method.

以上説明したように、本発明の音声合成装置は、自然で安定したイントネーションを表現する音声合成システムを構築する際に好適に適用可能である。 As described above, the speech synthesizer of the present invention can be suitably applied when constructing a speech synthesis system that expresses natural and stable intonation.

例えば、本発明の音声合成装置は、ニュース記事や自動応答文等といったテキスト全般の読み上げシステムに好適に適用される。 For example, the speech synthesizer of the present invention is suitably applied to a general text reading system such as news articles and automatic response sentences.

上記の実施形態の一部又は全部は、以下の付記のようにも記載されうるが、以下には限られない。
（付記１）
微細構造を含むピッチパタンを入力とし、前記ピッチパタンを近似的に表現する近似ピッチパタンを生成する近似ピッチパタン出力手段と、
登録されたピッチパタンに基づいた元発話ピッチパタンと前記近似ピッチパタンとを入力して前記近似ピッチパタンに基づいて前記微細構造を含むピッチパタンと前記元発話ピッチパタンとの接続点を決定し、前記接続点において前記微細構造を含むピッチパタンと前記元発話ピッチパタンとを接続して接続ピッチパタンを生成するピッチパタン接続手段と、を備えることを特徴とする音声合成装置。
（付記２）
前記ピッチパタン接続手段は、
前記元発話ピッチパタンと前記ピッチパタンとの接続境界線が前記近似ピッチパタンと交わる点を接続点とすることを特徴とする付記１に記載の音声合成装置。
（付記３）
前記ピッチパタン接続手段は、
前記接続点において、前記ピッチパタンと前記元発話ピッチパタンとをスムージング処理することを特徴とする付記１又は２に記載の音声合成装置。
（付記４）
前記ピッチパタン接続手段は、
前記ピッチパタンと前記元発話ピッチパタンとを接続後、アクセント句内における前記接続ピッチパタンのピーク値に基づいて、前記接続ピッチパタン全体を周波数方向に補正することを特徴とする
付記１乃至３のいずれか一項に記載の音声合成装置。
（付記５）
前記近似ピッチパタン出力手段は、
前記近似ピッチパタンを格納する近似ピッチパタン記憶手段と、
前記ピッチパタンに基づいて、前記近似ピッチパタン記憶手段に格納された前記近似ピッチパタンを選択する近似ピッチパタン選択手段と、を有することを特徴とする付記１乃至４のいずれか一項に記載の音声合成装置。
（付記６）
前記近似ピッチパタン出力手段は、
収録された音声から抽出されたピッチパタンに基づいて近似ピッチパタンを生成する近似ピッチパタン形成手段を備え、
前記近似ピッチパタン記憶手段は、前記近似ピッチパタン形成手段によって生成された複数の近似ピッチパタンを格納し、
前記近似ピッチパタン選択手段は、
前記近似ピッチパタン記憶手段に格納された前記近似ピッチパタンを選択することを特徴とする付記５に記載の音声合成装置。
（付記７）
前記近似ピッチパタン選択手段は、
少なくともピッチパタンを特徴量とする特徴量空間において前記微細構造を含むピッチパタンに距離が近い前記近似ピッチパタンを選択することを特徴とする付記５又は６に記載の音声合成装置。
（付記８）
入力テキストに対応する音素列情報と各音素に付随する情報とを含むコンテクスト情報を生成する言語解析手段と、
前記コンテクスト情報を入力し、予め格納されたピッチパタン生成モデルに基づいて前記微細構造を含むピッチパタンを生成し、生成した該ピッチパタンを前記近似ピッチパタン出力手段に出力するピッチパタン生成手段と、を備えることを特徴とする付記１乃至７のいずれか一項に記載の音声合成装置。
（付記９）
入力テキストに対応する音素列情報と各音素に付随する情報とを含むコンテクスト情報を生成する言語解析手段を備えることを特徴とする付記１乃至７のいずれか一項に記載の音声合成装置。
（付記１０）
前記コンテクスト情報を入力し、予め格納されたピッチパタン生成モデルに基づいて前記微細構造を含むピッチパタンを生成するピッチパタン生成手段を備えることを特徴とする付記９に記載の音声合成装置。
（付記１１）
前記コンテクスト情報を入力し、予め格納された音素継続長生成モデルを用いて音素継続長を生成する音素継続長生成手段を備えることを特徴とする付記９又は１０に記載の音声合成装置。
（付記１２）
前記近似ピッチパタン出力手段は、
前記入力テキストに含まれる有声音区間から無声音区間を推定することを特徴とする付記８乃至１１のいずれか一項に記載の音声合成装置。
（付記１３）
前記近似ピッチパタン出力手段は、
前記ピッチパタンを構成する無声音素を前記無声音素の特徴と近い有声音素に置換した仮ピッチパタンを生成し、前記無声音区間のピッチパタンを前記仮ピッチパタンで補間した後に前記近似ピッチパタンを生成することを特徴とする付記８乃至１２のいずれか一項に記載の音声合成装置。
（付記１４）
前記近似ピッチパタン出力手段は、
言語毎の韻律に関する特徴を制約条件に加えて前記近似ピッチパタンを生成することを特徴とする付記１乃至１３のいずれか一項に記載の音声合成装置。
（付記１５）
予め登録された音節列情報に基づいて前記元発話ピッチパタンを検索し、前記音節列情報の少なくとも一部と一致する音節列情報を含む元発話ピッチパタンデータを選択し、選択した前記元発話ピッチパタンデータから、当該／前方／後方の音節列及びアクセント位置が一致する部分を元発話ピッチパタンとして検索する元発話ピッチパタン選択手段を備えることを特徴とする付記８乃至１４のいずれか一項に記載の音声合成装置。
（付記１６）
前記接続ピッチパタンと前記音素継続長とを併せた韻律情報に基づいて、予め格納された単位波形データを選択する単位波形選択手段と、
生成された前記韻律情報を再現するように、前記選択された単位波形データを編集して合成音声波形を生成する音声波形生成手段と、を備えることを特徴とする付記１４又は１５に記載の合成音声装置。
（付記１７）
微細構造を含むピッチパタンを入力して前記ピッチパタンを近似的に表現する近似ピッチパタンを生成し、
登録されたピッチパタンに基づいた元発話ピッチパタン及び前記近似ピッチパタンを入力して前記近似ピッチパタンに基づいて前記微細構造を含むピッチパタンと前記元発話ピッチパタンとの接続点を決定し、前記接続点において前記微細構造を含むピッチパタンと前記元発話ピッチパタンとを接続して接続ピッチパタンを生成することを特徴とした音声合成方法。
（付記１８）
微細構造を含むピッチパタンを入力して前記ピッチパタンを近似的に表現する近似ピッチパタンを生成する処理と、
登録されたピッチパタンに基づいた元発話ピッチパタンと、前記近似ピッチパタンと、を入力して前記近似ピッチパタンに基づいて前記微細構造を含むピッチパタンと前記元発話ピッチパタンとの接続点を決定し、前記接続点において前記微細構造を含むピッチパタンと前記元発話ピッチパタンとを接続して接続ピッチパタンを生成する処理と、をコンピュータに実行させることを特徴とした音声合成プログラム。 A part or all of the above-described embodiment can be described as in the following supplementary notes, but is not limited thereto.
(Appendix 1)
Approximate pitch pattern output means for receiving a pitch pattern including a fine structure and generating an approximate pitch pattern that approximately represents the pitch pattern;
Input an original utterance pitch pattern based on a registered pitch pattern and the approximate pitch pattern, and determine a connection point between the pitch pattern including the fine structure and the original utterance pitch pattern based on the approximate pitch pattern, A speech synthesizer comprising: pitch pattern connection means for generating a connection pitch pattern by connecting a pitch pattern including the fine structure and the original utterance pitch pattern at the connection point.
(Appendix 2)
The pitch pattern connecting means is:
The speech synthesizer according to appendix 1, wherein a connection point is a point where a connection boundary line between the original utterance pitch pattern and the pitch pattern intersects with the approximate pitch pattern.
(Appendix 3)
The pitch pattern connecting means is:
The speech synthesizer according to appendix 1 or 2, wherein smoothing processing is performed on the pitch pattern and the original utterance pitch pattern at the connection point.
(Appendix 4)
The pitch pattern connecting means is:
After connecting the pitch pattern and the original utterance pitch pattern, the entire connection pitch pattern is corrected in the frequency direction based on a peak value of the connection pitch pattern in an accent phrase. The speech synthesizer as described in any one of Claims.
(Appendix 5)
The approximate pitch pattern output means includes:
An approximate pitch pattern storage means for storing the approximate pitch pattern;
5. An approximate pitch pattern selection unit that selects the approximate pitch pattern stored in the approximate pitch pattern storage unit based on the pitch pattern, according to any one of Supplementary notes 1 to 4. Speech synthesizer.
(Appendix 6)
The approximate pitch pattern output means includes:
Approximate pitch pattern forming means for generating an approximate pitch pattern based on the pitch pattern extracted from the recorded voice,
The approximate pitch pattern storage means stores a plurality of approximate pitch patterns generated by the approximate pitch pattern formation means,
The approximate pitch pattern selection means includes:
6. The speech synthesizer according to appendix 5, wherein the approximate pitch pattern stored in the approximate pitch pattern storage means is selected.
(Appendix 7)
The approximate pitch pattern selection means includes:
The speech synthesizer according to appendix 5 or 6, wherein the approximate pitch pattern having a distance close to the pitch pattern including the fine structure is selected in a feature amount space having at least a pitch pattern as a feature amount.
(Appendix 8)
Language analysis means for generating context information including phoneme string information corresponding to the input text and information associated with each phoneme;
Pitch pattern generation means for inputting the context information, generating a pitch pattern including the fine structure based on a pitch pattern generation model stored in advance, and outputting the generated pitch pattern to the approximate pitch pattern output means; The speech synthesizer according to any one of appendices 1 to 7, further comprising:
(Appendix 9)
The speech synthesizer according to any one of appendices 1 to 7, further comprising language analysis means for generating context information including phoneme string information corresponding to an input text and information associated with each phoneme.
(Appendix 10)
The speech synthesizer according to appendix 9, further comprising pitch pattern generation means for inputting the context information and generating a pitch pattern including the fine structure based on a pitch pattern generation model stored in advance.
(Appendix 11)
The speech synthesizer according to appendix 9 or 10, further comprising phoneme duration generation means for inputting the context information and generating a phoneme duration using a phoneme duration generation model stored in advance.
(Appendix 12)
The approximate pitch pattern output means includes:
The speech synthesizer according to any one of appendices 8 to 11, wherein an unvoiced sound section is estimated from a voiced sound section included in the input text.
(Appendix 13)
The approximate pitch pattern output means includes:
A temporary pitch pattern is generated by replacing unvoiced phonemes constituting the pitch pattern with voiced phonemes close to the characteristics of the unvoiced phoneme, and the approximate pitch pattern is generated after interpolating the pitch pattern of the unvoiced sound section with the temporary pitch pattern The speech synthesizer according to any one of appendices 8 to 12, wherein:
(Appendix 14)
The approximate pitch pattern output means includes:
14. The speech synthesizer according to any one of appendices 1 to 13, wherein the approximate pitch pattern is generated by adding a feature related to a prosody for each language to a constraint condition.
(Appendix 15)
Searching the original utterance pitch pattern based on pre-registered syllable string information, selecting original utterance pitch pattern data including syllable string information that matches at least a part of the syllable string information, and selecting the selected original utterance pitch Additional utterance pitch pattern selection means for retrieving, as a utterance pitch pattern, a portion where the corresponding / forward / backward syllable string and accent position coincide with each other from the pattern data is provided. The speech synthesizer described.
(Appendix 16)
Unit waveform selection means for selecting unit waveform data stored in advance based on prosodic information that combines the connection pitch pattern and the phoneme duration;
The synthesis according to appendix 14 or 15, further comprising: speech waveform generation means for generating a synthesized speech waveform by editing the selected unit waveform data so as to reproduce the generated prosodic information. Audio device.
(Appendix 17)
An input of a pitch pattern including a fine structure to generate an approximate pitch pattern that approximately represents the pitch pattern,
Input the original utterance pitch pattern based on the registered pitch pattern and the approximate pitch pattern, determine a connection point between the pitch pattern including the fine structure and the original utterance pitch pattern based on the approximate pitch pattern, and A speech synthesis method characterized in that a connection pitch pattern is generated by connecting a pitch pattern including the fine structure and the original utterance pitch pattern at a connection point.
(Appendix 18)
A process of generating an approximate pitch pattern that approximates the pitch pattern by inputting a pitch pattern including a fine structure;
By inputting the original utterance pitch pattern based on the registered pitch pattern and the approximate pitch pattern, a connection point between the pitch pattern including the fine structure and the original utterance pitch pattern is determined based on the approximate pitch pattern. A speech synthesis program that causes a computer to execute a process of generating a connection pitch pattern by connecting the pitch pattern including the fine structure and the original utterance pitch pattern at the connection point.

１、２、１０、２０、１００、３００音声合成装置
１１近似ピッチパタン出力手段
１２ピッチパタン接続手段
１３言語解析手段
１４ピッチパタン生成手段
１５音素継続長生成手段
１６元発話ピッチパタン選択手段
１７単位波形選択手段
１８音声波形生成手段
２１ピッチパタン生成モデル記憶手段
２２音素継続長生成モデル記憶手段
２３元発話ピッチパタン記憶手段
２４単位波形記憶手段
１１１近似ピッチパタン生成部
１１２近似ピッチパタン選択部
１１３近似ピッチパタン記憶部
１２０ピッチパタン接続部
１３０言語解析部
１４０ピッチパタン生成部
１６０元発話ピッチパタン選択部
１５０音素継続長生成部
１７０単位波形選択部
１８０音声波形生成部
２１１ピッチパタン生成モデル記憶部
２１２音素継続長生成モデル記憶部
２１３元発話ピッチパタン記憶部
２１４単位波形記憶部
３１０近似ピッチパタン形成装置
３１１音声データ記憶部
３１２ピッチパタン抽出部
３１３近似ピッチパタン形成部 1, 2, 10, 20, 100, 300 Speech synthesis apparatus 11 Approximate pitch pattern output means 12 Pitch pattern connection means 13 Language analysis means 14 Pitch pattern generation means 15 Phoneme duration generation means 16 Original utterance pitch pattern selection means 17 Unit waveform Selection means 18 Speech waveform generation means 21 Pitch pattern generation model storage means 22 Phoneme duration generation model storage means 23 Original utterance pitch pattern storage means 24 Unit waveform storage means 111 Approximate pitch pattern generation section 112 Approximate pitch pattern selection section 113 Approximation pitch pattern selection section Storage unit 120 Pitch pattern connection unit 130 Language analysis unit 140 Pitch pattern generation unit 160 Original utterance pitch pattern selection unit 150 Phoneme duration generation unit 170 Unit waveform selection unit 180 Speech waveform generation unit 211 Pitch pattern generation model storage unit 212 Phoneme continuation Long generation model storage unit 213 Original utterance pitch pattern storage unit 214 Unit waveform storage unit 310 Approximate pitch pattern formation device 311 Speech data storage unit 312 Pitch pattern extraction unit 313 Approximate pitch pattern formation unit

Claims

Approximate pitch pattern output means for receiving a pitch pattern including a fine structure and generating an approximate pitch pattern that approximately represents the pitch pattern;
Input an original utterance pitch pattern based on a registered pitch pattern and the approximate pitch pattern, and determine a connection point between the pitch pattern including the fine structure and the original utterance pitch pattern based on the approximate pitch pattern, A speech synthesizer comprising: pitch pattern connection means for generating a connection pitch pattern by connecting a pitch pattern including the fine structure and the original utterance pitch pattern at the connection point.

The pitch pattern connecting means is:
The speech synthesizer according to claim 1, wherein a connection point is a point where a connection boundary line between the original utterance pitch pattern and the pitch pattern intersects with the approximate pitch pattern.

The approximate pitch pattern output means includes:
An approximate pitch pattern storage means for storing the approximate pitch pattern;
The speech synthesis apparatus according to claim 1, further comprising: an approximate pitch pattern selection unit that selects the approximate pitch pattern stored in the approximate pitch pattern storage unit based on the pitch pattern.

The approximate pitch pattern selection means includes:
4. The speech synthesizer according to claim 3, wherein the approximate pitch pattern having a distance close to a pitch pattern including the fine structure is selected in a feature amount space having at least a pitch pattern as a feature amount.

Language analysis means for generating context information including phoneme string information corresponding to the input text and information associated with each phoneme;
Pitch pattern generation means for inputting the context information, generating a pitch pattern including the fine structure based on a pitch pattern generation model stored in advance, and outputting the generated pitch pattern to the approximate pitch pattern output means; The speech synthesizer according to any one of claims 1 to 4, further comprising:

The approximate pitch pattern output means includes:
A temporary pitch pattern is generated by replacing unvoiced phonemes constituting the pitch pattern with voiced phonemes close to the characteristics of the unvoiced phoneme, and the approximate pitch pattern is generated after interpolating the pitch pattern of the unvoiced sound section with the temporary pitch pattern The speech synthesizer according to claim 5.

The approximate pitch pattern output means includes:
7. The speech synthesizer according to claim 1, wherein the approximate pitch pattern is generated by adding a feature related to prosody for each language to a constraint condition.

Searching the original utterance pitch pattern based on pre-registered syllable string information, selecting original utterance pitch pattern data including syllable string information that matches at least a part of the syllable string information, and selecting the selected original utterance pitch 8. An original utterance pitch pattern selection means for searching a part of the pattern data where the corresponding / forward / backward syllable string and accent position coincide with each other as an original utterance pitch pattern. The speech synthesizer described in 1.

An input of a pitch pattern including a fine structure to generate an approximate pitch pattern that approximately represents the pitch pattern,
Input the original utterance pitch pattern based on the registered pitch pattern and the approximate pitch pattern, determine a connection point between the pitch pattern including the fine structure and the original utterance pitch pattern based on the approximate pitch pattern, and A speech synthesis method characterized in that a connection pitch pattern is generated by connecting a pitch pattern including the fine structure and the original utterance pitch pattern at a connection point.

A process of generating an approximate pitch pattern that approximates the pitch pattern by inputting a pitch pattern including a fine structure;
Input an original utterance pitch pattern based on a registered pitch pattern and the approximate pitch pattern, and determine a connection point between the pitch pattern including the fine structure and the original utterance pitch pattern based on the approximate pitch pattern, A speech synthesis program that causes a computer to execute a process of generating a connection pitch pattern by connecting a pitch pattern including the fine structure and the original utterance pitch pattern at the connection point.