JP2008268477A

JP2008268477A - Rhythm adjustable speech synthesizer

Info

Publication number: JP2008268477A
Application number: JP2007110287A
Authority: JP
Inventors: Mitsuaki Sato; 光朗佐藤; Makoto Takao; 誠高尾; Naohiko Fujiyama; 直彦藤山
Original assignee: Hitachi Business Solutions Co Ltd
Current assignee: Hitachi Solutions Create Ltd
Priority date: 2007-04-19
Filing date: 2007-04-19
Publication date: 2008-11-06

Abstract

<P>PROBLEM TO BE SOLVED: To provide a user friendly speech synthesizer equipped with a graphic user interface (GUI) for adjusting rhythm. <P>SOLUTION: The speech synthesizer is equipped with: an intermediate language creation section 24 for creating an intermediate language to which a rhythm parameter is attached, from an input text data; an editing image creating section 27 for creating an editing image which can be edited, by graphing a value of the rhythm parameter which is attached to the intermediate language; a rhythm parameter editing section 23 which rewrites the value of the rhythm parameter to an indicated value by displacing a display symbol displayed on a graph of the editing image via a setting receiving section 22; a speech synthesis section 25 for synthesizing a waveform based on the rewritten rhythm parameter; and a speech output section 26 for outputting the synthesized waveform via an output device. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、テキストから音声を合成するテキスト音声合成技術に関し、特に、基本周波数、継続時間長等の韻律を調整する技術に関する。 The present invention relates to a text-to-speech synthesis technology for synthesizing speech from text, and more particularly to a technology for adjusting prosody such as a fundamental frequency and a duration length.

近年、テキストからの音声合成を行なうシステムでは、より高品位な音声を得るために、自然音声から音声素片（音声波形の断片）を切り出して、その特徴を蓄積した音声コーパスを利用する方法が知られている。音声の特徴には、ピッチを示す基本周波数、各音素の長さを示す音素継続時間長等の韻律パラメータが含まれる。これらを蓄積した音声コーパスから、任意の基準を用いて音声素片を抽出することで、自然な音声が合成される。 In recent years, in a system for synthesizing speech from text, in order to obtain higher-quality speech, there is a method of using a speech corpus in which speech segments (speech waveform fragments) are cut out from natural speech and the features are accumulated. Are known. The features of speech include prosodic parameters such as a fundamental frequency indicating the pitch and a phoneme duration indicating the length of each phoneme. A natural speech is synthesized by extracting speech segments from the speech corpus in which these are stored using an arbitrary criterion.

しかしながら、そのような従来の方法は、ユーザが選択した関連語句を基に、予め設定された韻律パラメータ中から最適と想定されるものを選択し、自動的に修正するものである。従って、関連語句を指定しても、想定している発音に修正されない場合がある。また、共起データ中に該当の関連語句が無い場合も同様である。 However, such a conventional method selects and automatically corrects a prosodic parameter that is assumed to be optimal from preset prosodic parameters based on a related phrase selected by the user. Therefore, even if a related phrase is specified, it may not be corrected to the expected pronunciation. The same applies when there is no relevant word / phrase in the co-occurrence data.

特許文献１には、ユーザが韻律パラメータ、例えば、継続時間長や基本周波数の情報を含んだ中間言語を調整可能な編集手段を備える音声合成装置が提案されている。これは、修正対象とする語句に関連する語句が、その読み、韻律の情報と共に、共起データとして予め記憶されており、ＵＩ（ＵｓｅｒＩｎｔｅｒｆａｃｅ）を用いて、関連する共起データの一覧表示が可能となっている。ユーザがその中から最適な関連語句を選択することで、修正対象とする語句の読み、抑揚が、より自然なものに修正される。 Patent Document 1 proposes a speech synthesizer that includes an editing unit that allows a user to adjust a prosodic parameter, for example, an intermediate language that includes information on duration and fundamental frequency. This is because a phrase related to a phrase to be corrected is stored in advance as co-occurrence data together with its reading and prosodic information, and a list display of the related co-occurrence data is displayed using a UI (User Interface). It is possible. When the user selects an optimum related phrase from among them, the reading and inflection of the phrase to be corrected is corrected to be more natural.

特開２００６−３０３２６号JP 2006-30326 A

しかしながら、上述したような従来の方法は、ユーザが選択した関連語句を基に、予め設定された韻律パラメータ中から最適と想定されるものを選択し、自動的に修正するものである。従って、関連語句を指定しても、想定している発音に修正されない場合がある。また、共起データ中に該当の関連語句が無い場合も同様である。 However, the conventional method as described above selects and automatically corrects a prosodic parameter that is assumed to be optimal from preset prosodic parameters based on a related phrase selected by the user. Therefore, even if a related phrase is specified, it may not be corrected to the expected pronunciation. The same applies when there is no relevant word / phrase in the co-occurrence data.

本発明では、上記従来技術の問題点を解決するために、韻律パラメータを画面上で視認、操作が可能であり、専門的知識を有しないユーザでも、韻律パラメータの編集操作が簡便に行える、ユーザフレンドリなＧＵＩ（ＧｒａｐｈｉｃａｌＵｓｅｒＩｎｔｅｒｆａｃｅ）を提供することを目的とする。 In the present invention, in order to solve the above-described problems of the prior art, the prosody parameters can be visually recognized and operated on the screen, and the user who has no specialized knowledge can easily perform the prosody parameter editing operation. An object is to provide a friendly GUI (Graphical User Interface).

以上の課題を解決するため、本発明は、前記入力された文字列についての韻律パラメータを、音声コーパスから取得して、前記入力された文字列に前記韻律パラメータを韻律制御単位ごとに対応付け、中間言語を生成する中間言語生成手段と、前記生成した中間言語に含まれる韻律パラメータから、横軸に対応する第一の韻律パラメータと、縦軸に対応する第二の韻律パラメータを定義して構成されるグラフを描出し、前記中間言語の韻律制御単位ごとの座標に、予め記憶された表示シンボルを、グラフ上に配置して、接続される表示装置の画面上に表示させ、韻律パラメータの編集画面を生成する、韻律パラメータ編集画面生成手段と、を備えることを特徴とする、音声合成装置が提供される。 In order to solve the above problems, the present invention acquires a prosodic parameter for the input character string from a speech corpus, and associates the prosodic parameter with the input character string for each prosodic control unit, An intermediate language generating means for generating an intermediate language, and a first prosodic parameter corresponding to the horizontal axis and a second prosodic parameter corresponding to the vertical axis are defined from the prosodic parameters included in the generated intermediate language. Edit the prosodic parameters by displaying a pre-stored display symbol on the graph and displaying it on the screen of the connected display device at the coordinates for each prosodic control unit of the intermediate language. There is provided a speech synthesizer comprising: a prosodic parameter editing screen generating means for generating a screen.

以下、本発明の実施形態について、図面を参照して説明する。
まず、図１〜図３を参照して、本発明の実施形態の概略について説明する。図１は、本発明の請求項１の音声合成装置のハードウェアシステム構成を示すブロック図である。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
First, an outline of an embodiment of the present invention will be described with reference to FIGS. FIG. 1 is a block diagram showing a hardware system configuration of a speech synthesizer according to claim 1 of the present invention.

図１に示すように、音声合成装置１０は、プログラムが動作する一般的なコンピュータであり、例えば、パーソナルコンピュータや、ワークステーションである。すなわち、音声合成装置１０は、コンピュータの主要部であって各装置を集中的に制御するＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１と、各種データを書換え可能に記憶する主記憶装置２を備える。 As shown in FIG. 1, the speech synthesizer 10 is a general computer on which a program operates, for example, a personal computer or a workstation. That is, the speech synthesizer 10 includes a CPU (Central Processing Unit) 1 that is a main part of a computer and controls each device in a centralized manner, and a main storage device 2 that stores various data in a rewritable manner.

さらに、音声合成装置１０は、各種のプログラム、プログラムの生成するデータ等を格納する外部記憶装置３、各種操作指示を行うためのキーボードやマウスなどの入力装置４、画像データ等を表示する表示装置５、音声データ等を音声として出力する出力装置６を備える。これらの各装置はバスなどの信号線７を介してＣＰＵ１と接続される。もちろん、他に、外部の装置と通信を行うための通信装置を備えていてもよい。外部記憶装置３は、例えばＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）を備える。 Furthermore, the speech synthesizer 10 includes an external storage device 3 that stores various programs, data generated by the programs, an input device 4 such as a keyboard and a mouse for performing various operation instructions, and a display device that displays image data and the like. 5. An output device 6 for outputting audio data and the like as audio is provided. Each of these devices is connected to the CPU 1 through a signal line 7 such as a bus. Of course, in addition, a communication device for communicating with an external device may be provided. The external storage device 3 includes, for example, an HDD (Hard Disk Drive).

ＣＰＵ１は、例えば、外部記憶装置３上に格納されたプログラムを主記憶装置２上にロードして実行することにより、各種処理を実行する。外部記憶装置３は、ＨＤＤのみに限定されず、配布されたプログラムであるコンピュータソフトウェアを読み取るための機構として、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ等のドライブをさらに備えても良い。もちろん、プログラムは、例えば、通信装置を介してネットワークから外部記憶装置３にダウンロードされ、それから、主記憶装置２上にロードされてＣＰＵ１により実行されるようにしてもよい。 The CPU 1 executes various processes by, for example, loading a program stored on the external storage device 3 onto the main storage device 2 and executing it. The external storage device 3 is not limited to the HDD, and may further include a drive such as a CD-ROM or a DVD-ROM as a mechanism for reading computer software that is a distributed program. Of course, the program may be downloaded from the network to the external storage device 3 via a communication device, and then loaded onto the main storage device 2 and executed by the CPU 1.

入力装置４は、テキストを入力するためのテキスト入力装置、また、ＧＵＩ上の目的の動作を示すグラフィクスを操作するためのポインティングデバイスを含む。テキスト入力装置は、文字列が入力できる装置であれば、例えばキーボード、音声認識装置、または、文字列の読み込み装置等であってもよい。ポインティングデバイスは、例えばマウスや、画面に直接接触するようなタッチパネルであっても良い。 The input device 4 includes a text input device for inputting text, and a pointing device for operating graphics indicating a target operation on the GUI. The text input device may be, for example, a keyboard, a voice recognition device, or a character string reading device as long as it can input a character string. The pointing device may be, for example, a mouse or a touch panel that directly contacts the screen.

表示装置５のディスプレイは、CRT（Cathode Ray Tube）、LCD（Liquid Crystal Display）等から選択される。 The display of the display device 5 is selected from CRT (Cathode Ray Tube), LCD (Liquid Crystal Display), and the like.

出力装置６は、ＣＰＵから送られる音声データを音声に変換して出力するものであれば、外部スピーカのような外部出力装置であっても良い。 The output device 6 may be an external output device such as an external speaker as long as it converts audio data sent from the CPU into audio and outputs the audio.

図２は、上記のハードウェアで構成される音声合成装置１０の機能構成を示すブロック図である。なお、本実施形態では、音声合成装置１０は、基本的なＧＵＩ編集機能を備えているものとする。 FIG. 2 is a block diagram showing a functional configuration of the speech synthesizer 10 composed of the above hardware. In the present embodiment, the speech synthesizer 10 is assumed to have a basic GUI editing function.

本図に示すように、音声合成装置１０上には、制御部２０及び記憶部３０が構築される。制御部２０は、設定受付部２２、韻律パラメータ編集部２３、中間言語生成部２４、音声合成部２５、音声出力部２６、及び、編集画面生成部２７を備える。また、記憶部３０は、辞書データ記憶領域３２、音声コーパス記憶領域３３、及び、作業データ記憶領域３４を備える。 As shown in the figure, a control unit 20 and a storage unit 30 are constructed on the speech synthesizer 10. The control unit 20 includes a setting reception unit 22, a prosodic parameter editing unit 23, an intermediate language generation unit 24, a speech synthesis unit 25, a speech output unit 26, and an editing screen generation unit 27. The storage unit 30 includes a dictionary data storage area 32, a speech corpus storage area 33, and a work data storage area 34.

これらの機能は、例えば、ＣＰＵ１が補助記憶装置３に予め記憶されている所定のプログラムを主記憶装置２にロードして実行することにより、または、ハードウェアを制御することにより、またはそれらの組合せにより実現される。記憶部３０は、データを継続的に保持する場合は外部記憶装置３を、データを一時的に保持する場合は主記憶装置２を用いることにより実現される。 These functions are performed, for example, by the CPU 1 loading a predetermined program stored in advance in the auxiliary storage device 3 into the main storage device 2 and executing it, or by controlling the hardware, or a combination thereof. It is realized by. The storage unit 30 is realized by using the external storage device 3 when holding data continuously, and using the main storage device 2 when holding data temporarily.

設定受付部２２は、ＧＵＩ上のユーザによる操作、例えば、テキストデータの入力やカーソル・ポインタの移動などを、入力装置４を介して受け付ける。 The setting accepting unit 22 accepts an operation by a user on the GUI, for example, input of text data, movement of a cursor / pointer, and the like via the input device 4.

韻律パラメータ編集部２３は、設定受付部２２を介して、韻律パラメータ編集に係る操作を受け付ける。そして、韻律パラメータに対応付けられた表示シンボルの位置の変更を検出し、その変位量と変位方向から韻律パラメータの値を算出して、新しい韻律パラメータを中間言語に付与する。 The prosody parameter editing unit 23 receives an operation related to prosody parameter editing via the setting reception unit 22. Then, the change of the position of the display symbol associated with the prosodic parameter is detected, the value of the prosodic parameter is calculated from the displacement amount and the displacement direction, and the new prosodic parameter is given to the intermediate language.

中間言語生成部２４は、設定受付部２２を介して、入力されたテキストデータを受け付ける。そして、受け付けたテキストデータを、辞書データ７００等に基づいて、単語に分割し、それらの読み・アクセント情報や、アクセント句の情報を含む形態素解析データを生成して、該入力テキストデータに韻律が類似するデータを、外部記憶装置に格納された音声コーパス４００より検索、抽出する。さらに、それに基づいて基準韻律パラメータを算出して中間言語データを生成する処理を行なう。 The intermediate language generation unit 24 receives the input text data via the setting reception unit 22. Then, the received text data is divided into words based on the dictionary data 700 and the like, and morphological analysis data including reading / accent information and accent phrase information is generated, and the input text data has prosody. Similar data is searched and extracted from the speech corpus 400 stored in the external storage device. Further, based on this, a process for generating reference language parameters and generating intermediate language data is performed.

音声合成部２５は、中間言語生成部２４が生成した中間言語データに基づいて、出力音声波形を合成し、合成波形データを生成する。
音声出力部２６は、生成された合成波形データを、出力装置６を介して、実際の音声データとして出力する。 The speech synthesizer 25 synthesizes the output speech waveform based on the intermediate language data generated by the intermediate language generator 24 to generate synthesized waveform data.
The audio output unit 26 outputs the generated synthesized waveform data as actual audio data via the output device 6.

編集画面生成部２７は、生成された中間言語に含まれる韻律パラメータ情報をグラフ化し、ユーザが音声合成の作業を行うための画面、例えば、テキスト入力画面や韻律パラメータ編集画面を生成し、表示装置５に表示する。 The edit screen generation unit 27 graphs the prosodic parameter information included in the generated intermediate language, generates a screen for the user to perform speech synthesis work, for example, a text input screen or a prosody parameter edit screen, and displays the display device 5 is displayed.

辞書データ記憶領域３２は、単語の読み、アクセント情報等を記憶する辞書７００を、予め格納する。 The dictionary data storage area 32 stores in advance a dictionary 700 that stores word readings, accent information, and the like.

音声コーパス記憶領域３３は、図５に示すように、単語、アクセント句や文節単位の文字列群と、その基本周波数や継続時間長等の韻律パラメータ、音声データ等を対応付けた情報を、予めデータベースとして蓄積する、音声コーパスを格納する。具体的には、例えば、図５に示すように、文字列表記データ４１０、文字列表記データ４１０の発声音である音声波形データ４２０、音声波形データ４２０の基本周波数データ４３０、音声波形データ４２０の継続時間長データ４４０、文字列表記データ４１０の形態素分割結果である形態素分割データ４５０、文字列表記データ４１０の音素分割結果である音素分割データ４６０から構成されるデータセット（４１００、４２００…n）を、複数備えている。なお、構成エレメントの内容は上記に限定されず、パワーデータやケプストラムデータ等を含んでいても良い。 As shown in FIG. 5, the speech corpus storage area 33 stores information in which words, accent phrases, and character string groups in phrase units are associated with prosodic parameters such as fundamental frequency and duration, speech data, and the like in advance. Stores speech corpora stored as a database. Specifically, for example, as shown in FIG. 5, character string notation data 410, speech waveform data 420 that is the utterance of the character string notation data 410, basic frequency data 430 of the speech waveform data 420, and speech waveform data 420 Data set (4100, 4200... N) composed of duration time data 440, morpheme segmentation data 450 which is a morpheme segmentation result of character string representation data 410, and phoneme segmentation data 460 which is a phoneme segmentation result of character string representation data 410 Are provided. The contents of the constituent elements are not limited to the above, and may include power data, cepstrum data, and the like.

作業データ記憶領域３４は、入力されたテキストデータや、韻律パラメータ編集部２３、中間言語生成部２４が生成した中間データなどを一時的に格納するための領域である。具体的には、図３に示すように、テキストデータ６１０、形態素解析データ６２０、音素解析データ６３０、基準韻律パラメータ６４０、更新韻律パラメータ６５０、検索結果データ６６０、中間言語データ６７０、中間言語更新データ６８０の各種データを格納する領域である。 The work data storage area 34 is an area for temporarily storing input text data, intermediate data generated by the prosodic parameter editing unit 23 and the intermediate language generation unit 24, and the like. Specifically, as shown in FIG. 3, text data 610, morpheme analysis data 620, phoneme analysis data 630, reference prosody parameter 640, updated prosody parameter 650, search result data 660, intermediate language data 670, intermediate language update data This is an area for storing 680 various data.

次に、上述した機能で構成される音声合成装置の動作を、フロー図を参照しながら説明する。 Next, the operation of the speech synthesizer configured with the above-described functions will be described with reference to a flowchart.

図４は、テキストデータの入力を受け付けてから波形合成を行うまでの処理の流れを示すフロー図である。ここでは、「雨が降る。」というテキストデータを例にとって説明する。 FIG. 4 is a flowchart showing the flow of processing from receiving text data input to performing waveform synthesis. Here, the description will be made taking text data “raining” as an example.

先ず、設定受付部２２は、音声合成の対象のテキストデータを受け付ける（Ｓ１）。 First, the setting accepting unit 22 accepts text data to be synthesized (S1).

具体的には、設定受付部２２は、図８に示すテキスト入力画面を表示装置５に表示する。 Specifically, the setting reception unit 22 displays a text input screen shown in FIG.

ここで、テキスト入力画面（図８）の構成を説明する。テキスト入力画面は、韻律編集ボタン６０１と、入力テキスト設定欄６０２を備える。韻律編集ボタン６０１は、韻律パラメータ編集を開始するためのボタンである。入力テキスト設定欄６０２は、音声合成の対象となるテキストデータを設定するための欄である。 Here, the configuration of the text input screen (FIG. 8) will be described. The text input screen includes a prosody edit button 601 and an input text setting field 602. The prosody editing button 601 is a button for starting prosody parameter editing. The input text setting field 602 is a field for setting text data to be subjected to speech synthesis.

上記のテキスト入力画面を表示した後、設定受付部２２は、テキスト入力画面上のユーザの操作を、入力装置４を介して受け付ける。入力テキスト設定欄６０２にテキストデータが入力されると、設定受付部２２は、該テキストデータを作業データ記憶領域３４上のテキストデータ６１０に格納する。すなわち、テキストデータ「雨が降る。」が、テキストデータ６１０に格納される（Ｓ１）。 After displaying the text input screen, the setting reception unit 22 receives a user operation on the text input screen via the input device 4. When text data is input to the input text setting field 602, the setting receiving unit 22 stores the text data in the text data 610 on the work data storage area 34. That is, the text data “rains” is stored in the text data 610 (S1).

もちろん、テキストデータの受付方法は上記に限られない。例えば、設定受付部２２は、上記のテキスト入力画面の前段階に、複数の文で構成される文章のテキストデータの入力を受け付けるＧＵＩ画面を表示する。そして、入力された文章のテキストデータを該画面に表示し、ユーザに韻律調整の対象の１文を選択させてから、上記のテキスト入力画面を表示し、選択された１文を表示させる構成とすることができる。 Of course, the method of accepting text data is not limited to the above. For example, the setting reception unit 22 displays a GUI screen that receives input of text data of a sentence composed of a plurality of sentences in the previous stage of the above text input screen. The text data of the input sentence is displayed on the screen, and the user is allowed to select one sentence for prosody adjustment, and then the above text input screen is displayed to display the selected sentence. can do.

次に、中間言語生成部２４は、入力テキストデータが設定されると、入力テキストより中間言語を生成する（Ｓ２）。 Next, when the input text data is set, the intermediate language generation unit 24 generates an intermediate language from the input text (S2).

ここで、中間言語生成処理（Ｓ２）について、図６を用いて詳しく説明する。 Here, the intermediate language generation process (S2) will be described in detail with reference to FIG.

中間言語生成部２４は、作業データ記憶領域３４上のテキストデータ６１０に格納されたテキストデータ「雨が降る。」を読み出す（Ｓ１０１）。 The intermediate language generation unit 24 reads the text data “rains” stored in the text data 610 on the work data storage area 34 (S101).

次に、中間言語生成部２４は、形態素解析処理を実行する（Ｓ１０２）。具体的には、形態素解析部２４は、読み出したテキストデータを意味がわかる最小の単位(形態素)に分割する。そして、分割した形態素毎に、表記、読み、アクセント情報などで構成される形態素解析データを生成し、該データを作業データ記憶領域３４上の形態素解析データ６２０に格納する。単語毎の読み及びアクセント情報は、予め辞書７００に登録されている値を使用する。テキストデータを単語（形態素）に分割する方法としては、清水らによる「隣接単語間の結合関係に着目したテキスト音声変換用形態素解析処理、日本音響学会誌、５１巻、１号、ｐｐ．３−１３、１９９５」に記載の処理を利用することができる。勿論、該方法は一例であり、他の処理方法を用いても良い。 Next, the intermediate language generation unit 24 executes morpheme analysis processing (S102). Specifically, the morpheme analyzer 24 divides the read text data into the smallest units (morphemes) whose meaning is known. Then, for each divided morpheme, morpheme analysis data composed of notation, reading, accent information, etc. is generated, and the data is stored in the morpheme analysis data 620 in the work data storage area 34. As the reading and accent information for each word, values registered in the dictionary 700 in advance are used. As a method of dividing the text data into words (morphemes), Shimizu et al. “Text-to-speech morpheme analysis processing focusing on the connection relationship between adjacent words, Journal of the Acoustical Society of Japan, Vol. 51, No. 1, pp. 3- 13, 1995 "can be used. Of course, this method is an example, and other processing methods may be used.

上記のようにして、中間言語生成部２４は、図１３（Ａ）のテキストデータ「雨が降る。」から、図１３（Ｂ）に示すような形態素解析データを生成する。すなわち、単語毎のデータ「雨」「が」「降」「る」「。」に分割し、読み・アクセント情報「ア´メ」「カ゜」「フ´」「ル」「．」（「´」はアクセント、「゜」は鼻濁音を示す）をそれぞれ対応付ける。また、アクセント句の区切りを示す情報「／」を付加する。なお、「ア´メカ゜／フ´ル．」は、発音記号列に相当するものである。もちろん、形態素の構造は上記のものに限られない。 As described above, the intermediate language generation unit 24 generates morphological analysis data as shown in FIG. 13B from the text data “rains down” in FIG. That is, the data for each word is divided into “rain”, “ga”, “fall”, “ru”, “.”, And reading / accent information “A'me”, “ka”, “fu”, “le”, “.” (“′” "Is an accent, and" ° "is a nasal cloud sound). Also, information “/” indicating an accent phrase delimiter is added. “A'mecha / full” corresponds to a phonetic symbol string. Of course, the structure of the morpheme is not limited to the above.

次に、中間言語生成部２４は、音素分割処理を実行する（Ｓ１０３）。まず、形態素解析データ６２０を読み出し、該データに含まれる読みの情報を基に、意味の区別に用いられる最小の音の単位（音素）に分割し、得られた音素解析データを作業データ記憶領域３４上の音素解析データ６３０に格納する。 Next, the intermediate language generation unit 24 executes phoneme division processing (S103). First, the morpheme analysis data 620 is read, and based on the reading information included in the data, the phoneme analysis data is divided into the minimum sound units (phonemes) used for meaning distinction, and the obtained phoneme analysis data is stored in a work data storage area. 34 is stored in phoneme analysis data 630.

音素分割の方法は、例えば、宮崎らによる方法「日本文音声出力のための言語処理方式、情報処理学会論文誌、Ｖoｌ．２７、Ｎｏ．１１、ｐｐ．１０５３−１０６１、１９８６」を利用することができる。もちろん、該計算方法は一例であり、他の音素分割法を用いても良い。 The phoneme segmentation method uses, for example, the method by Miyazaki et al. “Language Processing Method for Japanese Sentence Speech Output, Journal of Information Processing Society of Japan, Vol. 27, No. 11, pp. 1053-1061, 1986”. Can do. Of course, this calculation method is an example, and other phoneme division methods may be used.

上記の音素解析により、中間言語生成部２４は、テキストデータ「雨が降る。」を、音素に分割し、図１３（Ｃ）に示される、「Ａ／ＭＥ／ＮＧ／Ａ／Ｈ／Ｕ／Ｒ／Ｕ／．」のような音素データを生成する。ここで、「Ａ」「Ｍ」「Ｅ」等は音素を示す記号であるが、これらは一例であり、他の音素記号表現を用いてもよい。 By the above phoneme analysis, the intermediate language generation unit 24 divides the text data “rains down” into phonemes, and displays “A / ME / NG / A / H / U /” shown in FIG. Phoneme data such as “R / U /.” Is generated. Here, “A”, “M”, “E”, and the like are symbols indicating phonemes, but these are only examples, and other phoneme symbol expressions may be used.

次に、中間言語生成部２４は、形態素解析データに対して、アクセント型、品詞等の情報が類似するデータセットを、音声コーパス４００から検索する処理を行う（Ｓ１０４〜Ｓ１０６）。 Next, the intermediate language generation unit 24 performs processing for searching the speech corpus 400 for a data set having similar information such as accent type and part of speech with respect to the morphological analysis data (S104 to S106).

中間言語生成部２４は、先ず、音声コーパス４００から、データセット４１００を読み出す（Ｓ１０４）。また、読み出したデータセット４１００から、形態素分割データ４５０を読み出す。 First, the intermediate language generation unit 24 reads the data set 4100 from the speech corpus 400 (S104). Further, the morpheme division data 450 is read from the read data set 4100.

そして、中間言語生成部２４は、形態素解析データ６２０を読み出し（Ｓ１０５）、形態素分割データ４５０に含まれる、読み・アクセント情報、アクセント型、品詞等のデータと比較し、予め定めた基準で類似度の計算を行う（Ｓ１０６）。 Then, the intermediate language generation unit 24 reads the morpheme analysis data 620 (S105), compares it with data such as reading / accent information, accent type, part of speech, etc. included in the morpheme division data 450, and compares the similarity with a predetermined criterion. Is calculated (S106).

以降、同様に、中間言語生成部２４は、全てのデータセット（４２００〜ｎ）について、上記の類似度の計算を行う（Ｓ１０６）。その結果として、予め設定したしきい値（基準類似度）を満たすデータセットのうち、形態素解析データと最も類似するデータセット（以下、選択データセットと呼ぶ。）を１つ選択する。 Thereafter, similarly, the intermediate language generation unit 24 performs the above similarity calculation for all the data sets (4200 to n) (S106). As a result, one data set that is most similar to the morphological analysis data (hereinafter referred to as a selection data set) is selected from among the data sets that satisfy a preset threshold value (reference similarity).

次に、中間言語生成部２４は、形態素解析データ６２０について、韻律パラメータの算出を行う（Ｓ１０７）。 Next, the intermediate language generation unit 24 calculates prosodic parameters for the morphological analysis data 620 (S107).

具体的には、中間言語生成部２４は、形態素解析データ６２０と、選択データセットの形態素分割データ４５０とを比較し、形態素毎に一致部と不一致部とに分離する。そして、一致部の形態素には、選択データセットの韻律パラメータ（基本周波数データ４３０、継続時間長データ４４０）を付与する。不一致部の形態素の基本周波数データは、形態素のモーラ数とアクセント型等に対して１つの基本周波数データを記憶している単語基本周波数パタンテーブルから検索して算出する。また、継続時間長は、匂坂らによる、「規則による音声合成のための音韻時間長制御、電子情報通信学会論文誌、Ｖｏｌ．Ｊ６７−Ａ、Ｎｏ．７、ｐｐ．６２９−６３６、１９８４」を利用して算出することができる。その後、中間言語生成部２４は、一致部と不一致部の韻律パラメータを滑らかに接続するために、不一致部を変形統合する処理を行なう。 Specifically, the intermediate language generation unit 24 compares the morpheme analysis data 620 and the morpheme division data 450 of the selected data set, and separates the morpheme into a match part and a mismatch part. The prosody parameters (basic frequency data 430 and duration time data 440) of the selected data set are assigned to the morphemes of the matching part. The fundamental frequency data of the morpheme of the mismatched part is calculated by searching from a word fundamental frequency pattern table storing one fundamental frequency data for the number of mora and accent type of the morpheme. Also, the duration time is “Sonic time length control for speech synthesis by rules, IEICE Transactions, Vol. J67-A, No. 7, pp. 629-636, 1984” by Osaka et al. It can be calculated using. Thereafter, the intermediate language generation unit 24 performs a process of transforming and integrating the mismatched parts in order to smoothly connect the prosodic parameters of the matched part and the mismatched part.

このようにして求めた韻律パラメータを、中間言語生成部２４は、作業データ記憶領域３４上の基準韻律パラメータ６４０に格納する。なお、韻律パラメータの算出方法は上記に限られない。音素ごとの継続時間長を求めるには、例えば、予めデータベースとして保持された、音素ごとの継続時間が記録されたテーブルや、付与対象となる音素の前後１音素までの環境要因を考慮した継続時間が記録されたテーブルを参照する方法を利用できる。また、音素ごと基本周波数を求めるには、二次臨界制動モデルと呼ばれる指数関数による曲線でモデル化する方法や矩形でモデル化する方法を利用できる。 The intermediate language generation unit 24 stores the prosodic parameters thus obtained in the reference prosodic parameters 640 in the work data storage area 34. The method for calculating the prosodic parameters is not limited to the above. In order to obtain the duration time for each phoneme, for example, a table in which a duration time for each phoneme is recorded in advance as a database, or a duration considering environmental factors up to one phoneme before and after the target phoneme Can be used to refer to the table in which is recorded. In addition, in order to obtain the fundamental frequency for each phoneme, a method of modeling with an exponential curve called a second critical braking model or a method of modeling with a rectangle can be used.

次に、中間言語生成部２４は、形態素解析データ６２０と、音素解析データ６３０と、基準韻律パラメータ６４０とを基に、中間言語データを生成し、作業データ記憶領域３４上の中間言語データ６７０に格納する（Ｓ１０８）。 Next, the intermediate language generation unit 24 generates intermediate language data based on the morpheme analysis data 620, the phoneme analysis data 630, and the reference prosody parameter 640, and stores the intermediate language data in the intermediate language data 670 on the work data storage area 34. Store (S108).

具体的には、中間言語生成部２４は、図１３（Ｄ）に示すような中間言語データを生成する。すなわち、基準形態素解析データに含まれる発音記号列を分割して、「ア」「メ」「カ゜」「／」「フ」「ル」「．」の音韻表記から成るデータ列を生成する。そして、それぞれの文字の音素毎に、基本周波数及び継続時間長のデータを付与する。例えば、「メ」は、基本周波数「２８３」・継続時間長「５１」の音素「Ｍ」と、基本周波数「２５２」・継続時間長「８９」の音素「Ｅ」とから構成される。 Specifically, the intermediate language generation unit 24 generates intermediate language data as shown in FIG. That is, the phonetic symbol string included in the reference morphological analysis data is divided to generate a data string composed of phoneme notations of “a”, “me”, “cap”, “/”, “fu”, “le”, and “.”. And the data of a fundamental frequency and duration length are provided for every phoneme of each character. For example, “me” includes a phoneme “M” having a fundamental frequency “283” and a duration “51”, and a phoneme “E” having a fundamental frequency “252” and a duration “89”.

以上のようにして、中間言語生成部２４は、中間言語生成処理（Ｓ２）を終了する。 As described above, the intermediate language generation unit 24 ends the intermediate language generation process (S2).

次に、韻律パラメータの編集処理（Ｓ３）について図７を用いて説明する。 Next, the prosody parameter editing process (S3) will be described with reference to FIG.

まず、設定受付部２２は、テキスト入力画面（図８）上で、韻律編集ボタン６０１のクリックを受け付ける（Ｓ３１０）。すると、編集画面生成部２７が、編集画面生成処理を開始する。 First, the setting receiving unit 22 receives a click on the prosody editing button 601 on the text input screen (FIG. 8) (S310). Then, the edit screen generation unit 27 starts an edit screen generation process.

編集画面生成部２７は、ユーザ可変の表示シンボルを備える、図９に示すような韻律パラメータ編集画面８００を生成する（Ｓ３１２）。まず、中間言語データ６７０を作業データ記憶領域３４から読み出す（Ｓ３１１）。そして、音韻表記９１０と、音素表記９２０を抽出する（図１３（Ｄ）参照）。まず、音韻表記９１０を、横軸方向へ、音韻表記文字列８３０として展開し（図９）、対応する音素表記９２０を、音素表記文字列８４０として、さらに展開する。 The edit screen generation unit 27 generates a prosody parameter edit screen 800 as shown in FIG. 9 having user-variable display symbols (S312). First, the intermediate language data 670 is read from the work data storage area 34 (S311). Then, phoneme notation 910 and phoneme notation 920 are extracted (see FIG. 13D). First, the phoneme notation 910 is expanded in the horizontal axis direction as a phoneme notation character string 830 (FIG. 9), and the corresponding phoneme notation 920 is further expanded as a phoneme notation character string 840.

次に、編集画面生成部２７は、中間言語データ６７０から、継続時間長パラメータ９３０と、基本周波数パラメータ９４０を抽出する。 Next, the edit screen generation unit 27 extracts a duration length parameter 930 and a fundamental frequency parameter 940 from the intermediate language data 670.

さらに、上記展開された音韻表記文字列８３０と、音素表記文字列８４０に対応する、前記継続時間長パラメータ９３０を横軸に、基本周波数パラメータ９４０を縦軸に対応付け、韻律パラメータのグラフ化を実行する。 Further, the expanded phoneme notation character string 830 and the phoneme notation character string 840 are associated with the duration parameter 930 on the horizontal axis and the fundamental frequency parameter 940 on the vertical axis, and the prosodic parameters are graphed. Execute.

次に、編集画面生成部２７は、音素表記文字列８４０の、文字ごとの開始点となる、横軸（継続時間長パラメータ９３０）の座標に、縦軸方向に延びる継続時間長表示シンボル８５０を、継続長の調整単位である音素表記文字列８４０の文字を区切るように配置する。（便宜上、一箇所のシンボルにのみ符号を付した。以下同様。）
編集画面生成部２７は、音韻表記文字列８３０の、文字ごとの開始点となる横軸座標配置された、上記継続時間長表示シンボル８５０上に、基本周波数表示シンボル８６０を、基本周波数パラメータの値に従って配置する。さらに、隣接する基本周波数表示シンボル８６０同士を結んだ線を、韻律結線シンボル８７０として生成する。 Next, the edit screen generation unit 27 sets the duration length display symbol 850 extending in the vertical axis direction at the coordinate of the horizontal axis (duration duration parameter 930), which is the starting point for each character of the phoneme-notation character string 840. The phoneme-notation character string 840, which is a unit for adjusting the continuation length, is arranged so as to delimit characters. (For convenience, a symbol is attached only to one symbol. The same applies hereinafter.)
The editing screen generation unit 27 sets the fundamental frequency display symbol 860 on the duration display symbol 850 arranged on the horizontal axis coordinate serving as the starting point for each character of the phoneme notation character string 830, and the value of the fundamental frequency parameter. Arrange according to. Further, a line connecting adjacent fundamental frequency display symbols 860 is generated as a prosodic connection symbol 870.

ここで、継続時間長表示シンボル８５０は、図９中に矢示線Ｘで示される横軸方向へ、基本周波数表示シンボル８６０は、図中の矢示線Ｙで示される縦軸方向へ、予め設定された基準値の範囲内において、入力装置４を介したスライド操作が可能に構成されている。継続時間長表示シンボル８５０は、左側に隣接する８５０との幅を変更可能であり、対応する表記文字の継続時間長を増減することが出来る。基本周波数シンボル８６０は、継続時間長表示シンボル８５０上でスライド操作が可能であり、対応する表記文字の基本周波数を増減させることが可能である。図中の矢示線Ｘ、Ｙは操作範囲を示すものではなく、単に操作方向を示すものである。 Here, the duration length display symbol 850 is in advance in the horizontal axis direction indicated by the arrow line X in FIG. 9, and the fundamental frequency display symbol 860 is in advance in the vertical axis direction indicated by the arrow line Y in FIG. Within the range of the set reference value, a sliding operation via the input device 4 is possible. The duration length display symbol 850 can change the width with the adjacent 850 on the left side, and can increase or decrease the duration length of the corresponding written character. The fundamental frequency symbol 860 can be slid on the duration length display symbol 850, and the fundamental frequency of the corresponding written character can be increased or decreased. The arrow lines X and Y in the figure do not indicate the operation range, but merely indicate the operation direction.

編集画面生成部２７は、横軸の継続時間長はミリ秒（１ｐｉｘｅｌ＝１ｍｓ）を、縦軸の周波数はＨｚ（対象データの最小周波数×０．８〜最大周波数÷０．７５）を単位として構成し、韻律編集画面８００を生成する。なお、ここではこのような単位を用いたが、勿論、他の単位を用いてグラフを生成してもよい。 In the edit screen generation unit 27, the horizontal axis has a duration of milliseconds (1 pixel = 1 ms) and the vertical axis has a frequency of Hz (minimum frequency of target data × 0.8 to maximum frequency ÷ 0.75). The prosody editing screen 800 is generated. In addition, although such a unit was used here, of course, you may produce | generate a graph using another unit.

また、韻律パラメータ編集画面８００が表示装置の画面内に入りきらない場合には、スクロールバーを構成し、画面の左右スライド操作が可能となるように構成する。なお、画面切り替え手段は、スクロールバーに限られず、ページ切り替えや、圧縮して全体を表示する機能を設けてもよい。 In addition, when the prosody parameter editing screen 800 does not fit within the screen of the display device, a scroll bar is configured so that the screen can be slid left and right. Note that the screen switching means is not limited to the scroll bar, and may be provided with functions for switching pages and compressing and displaying the whole.

以上のように生成された韻律パラメータ編集画面８００を、編集画面生成部２７は、ＧＵＩを用いて表示装置５に表示する。 The editing screen generation unit 27 displays the prosody parameter editing screen 800 generated as described above on the display device 5 using the GUI.

設定受付部２２は、表示シンボルの移動操作を受付ける（Ｓ３２１）。 The setting receiving unit 22 receives a display symbol moving operation (S321).

まず、設定受付部２２は、上記表示シンボルが、入力装置２を介して移動操作されたことを検出する。 First, the setting reception unit 22 detects that the display symbol has been moved through the input device 2.

継続時間長表示シンボル８５０上で、入力装置４のポインティングデバイスである、マウスによるクリック操作を検出すると、設定受付部２２は、８５０の矢示線Ｘ方向への、ドラッグによるスライド操作の受付けを開始する。次に、矢示線Ｘ方向へのスライド操作を検出すると、韻律パラメータ編集部２３は、スライド後の継続時間長表示シンボル８５０の変位量と、スライド方向についての情報を取得する。さらに、スライド方向と変位量によって定まる書換え値を、横軸座標の単位に基づいて算出する。 When the click operation with the mouse, which is the pointing device of the input device 4, is detected on the duration length display symbol 850, the setting reception unit 22 starts accepting the slide operation by dragging in the arrow X direction of 850. To do. Next, when a sliding operation in the direction indicated by the arrow X is detected, the prosodic parameter editing unit 23 acquires the displacement amount of the duration display symbol 850 after the slide and information about the sliding direction. Further, a rewrite value determined by the sliding direction and the displacement amount is calculated based on the unit of the horizontal axis coordinate.

基本周波数の変更についても同様に、基本周波数表示シンボル８６０上にマウスによるクリック操作を検出すると、韻律パラメータ編集部２３は、８６０の矢示線Ｙ方向へ、ドラッグによるスライド操作の受付けを開始する。次に、矢示線Ｙ方向へのスライド操作を検出すると、スライド後の基本周波数表示シンボル８６０の変位量と、スライド方向についての情報を取得する。さらに、スライド方向と変位量によって定まる書換えを縦軸座標の単位に基づいて算出する。 Similarly, regarding the change of the fundamental frequency, when a click operation with the mouse is detected on the fundamental frequency display symbol 860, the prosody parameter editing unit 23 starts accepting a slide operation by dragging in the direction indicated by the arrow Y in 860. Next, when a slide operation in the direction indicated by the arrow Y is detected, the displacement amount of the fundamental frequency display symbol 860 after the slide and information about the slide direction are acquired. Further, the rewriting determined by the sliding direction and the displacement amount is calculated based on the unit of the vertical axis coordinate.

ここで、ポインティングデバイスを介した操作は、ここではマウスによるものを使用したが、勿論、タッチパネルへのタッチアクション等を用いてもよい。 Here, the operation via the pointing device is performed using a mouse here, but of course, a touch action on the touch panel or the like may be used.

図１０を参照して、韻律パラメータの書換え値の算出処理を具体的に説明する。矢示線Ａ→Ａ’は、継続時間長表示シンボル８５０の変位を示す。まず、設定受付部２２がポインティングデバイスを介した矢示方向Ａ→Ａ’へのスライド操作を受付ける。すると、韻律パラメータ編集部２３は、操作対象である継続時間長表示シンボル８５０と、それ以降の横軸座標の全ての音韻表記、音素表記、及び、表示シンボル（継続時間長表示シンボル８５０、基本周波数表示シンボル８６０、韻律結線シンボル８７０）を、矢示線Ａ→Ａ’のスライド方向へ、同様の変位量、変位して表示する。 With reference to FIG. 10, the calculation process of the rewrite value of the prosodic parameter will be specifically described. The arrow line A → A ′ indicates the displacement of the duration length display symbol 850. First, the setting reception unit 22 receives a slide operation in the arrow direction A → A ′ via the pointing device. Then, the prosody parameter editing unit 23, the duration display symbol 850 to be operated, and all the phoneme notation, phoneme notation, and display symbols (the duration length display symbol 850, the basic frequency, and the subsequent horizontal coordinate). The display symbol 860 and the prosodic connection symbol 870) are displayed with the same displacement amount and displacement in the slide direction of the arrow line A → A ′.

ここで、Ａ→Ａ’へのスライド操作において、Ｘ軸座標のスライド方向は−であるから、変位量は、−（Ａ’−Ａ）で表される（単位はＸ軸座標のピクセル値）。よって、この場合は、１ピクセル＝１ミリ秒として設定されているので、韻律パラメータ編集部２３は、Ｘ軸座標のピクセル変位量と同値のミリ秒を、対応する音素表記の継続時間長（Ａ−Ａ０）から減じて、継続時間長パラメータの書換え値を算出する。 Here, in the slide operation from A → A ′, the slide direction of the X-axis coordinate is −, so the displacement is represented by − (A′−A) (the unit is the pixel value of the X-axis coordinate). . Therefore, in this case, since one pixel is set to 1 millisecond, the prosody parameter editing unit 23 sets the millisecond equivalent to the pixel displacement amount of the X-axis coordinate to the duration time (A Subtract from (A0) and calculate the rewrite value of the duration parameter.

さらに、矢示線Ｂ→Ｂ’ は、基本周波数表示シンボル８６０の変位を示す。まず、設定受付部２２が、ポインティングデバイスを介した矢示方向Ｂ→Ｂ’へのスライド操作を受付ける。すると、韻律パラメータ編集部２３は、操作対象の基本周波数表示シンボル８６０を、Ｂ’座標に変位して表示する。８６０の変位に従って、韻律結線シンボル８７０も同時に変位する。 Further, the arrow line B → B ′ indicates the displacement of the fundamental frequency display symbol 860. First, the setting reception unit 22 receives a slide operation in the arrow direction B → B ′ via the pointing device. Then, the prosodic parameter editing unit 23 displays the operation target fundamental frequency display symbol 860 by displacing it to the B ′ coordinate. According to the displacement of 860, the prosodic connection symbol 870 is also displaced simultaneously.

ここで、Ｂ→Ｂ’のスライド操作において、Ｙ軸座標のスライド方向は＋であるから、変位量は、（Ｂ’−Ｂ）で表される（単位はＹ軸座標のピクセル値）。韻律パラメータ編集部２３は、この変位量に、グラフのＹ軸範囲（対象データの最小周波数×０．８〜最大周波数÷０．７５）により定まる、１ピクセルに対応する周波数を表す係数（Ｈｚ／ｐｉｘｅｌ）を乗じて、基本周波数値を算出する。これを、対応する音韻表記の基本周波数Ｂに加算して、基本周波数パラメータの書換え値を算出する。 Here, in the slide operation of B → B ′, since the slide direction of the Y-axis coordinates is +, the displacement amount is represented by (B′−B) (the unit is the pixel value of the Y-axis coordinates). The prosodic parameter editing unit 23 uses a coefficient (Hz / Hz) representing a frequency corresponding to one pixel determined by the Y-axis range of the graph (the minimum frequency of the target data × 0.8 to the maximum frequency ÷ 0.75). pixel) to calculate the fundamental frequency value. This is added to the fundamental frequency B of the corresponding phoneme notation, and the rewritten value of the fundamental frequency parameter is calculated.

韻律パラメータ編集部２３は、以上の処理から算出された、継続時間長、基本周波数、および、フレーズ区切りの継続時間長の各パラメータの書換え値が、予め定められた基準値の範囲内にあることを確認する。 The prosodic parameter editing unit 23 determines that the rewriting values of the parameters of the duration time, the fundamental frequency, and the phrase duration time length calculated from the above processing are within a predetermined reference value range. Confirm.

韻律パラメータ編集部２３は、書換え値が予め定められた範囲外である場合には、画面上にエラー画面を表示させる（Ｓ１４５）。例えば、エラー画面には「周波数は〜Ｈｚ以内に設定してください。」「継続長は〜ｍｓ以上に設定してください。」等のエラーメッセージが表示される。 The prosodic parameter editing unit 23 displays an error screen on the screen when the rewritten value is outside the predetermined range (S145). For example, an error message such as “Set the frequency within ~ Hz” or “Set the duration to ~ ms or more” is displayed on the error screen.

継続時間長についてのエラーメッセージ表示と同時に、韻律パラメータ編集部２３は、書換え値が上限値を超過している場合には上限値の座標、下限値を超過している場合には下限値の座標に、継続時間長表示シンボル８５０を変位させる処理を実行する。基本周波数については、エラーメッセージの表示要因となる、基準値を超過するスライド操作が行なわれる直前の座標に、基本周波数表示シンボル８６０を変位させる処理を実行する。 At the same time as displaying the error message about the duration, the prosodic parameter editing unit 23 coordinates the upper limit value when the rewritten value exceeds the upper limit value, and coordinates the lower limit value when the rewrite value exceeds the lower limit value. In addition, processing for displacing the duration length display symbol 850 is executed. For the fundamental frequency, a process of displacing the fundamental frequency display symbol 860 to the coordinates immediately before the slide operation exceeding the reference value, which is a cause of displaying an error message, is executed.

書換え値が基準値内にある場合、韻律パラメータ編集部２３は、書換え値を更新韻律パラメータ６５０として、作業データ記憶領域３４上に格納する。さらに、中間言語データ６７０の韻律パラメータを、更新韻律パラメータ６５０の値に書き換える、書き換え処理を行なう（Ｓ３２４）。これにより、更新された中間言語データを、中間言語更新データ６８０として作業データ記憶領域３４上に格納する（Ｓ３２５）。 When the rewritten value is within the reference value, the prosodic parameter editing unit 23 stores the rewritten value as the updated prosodic parameter 650 in the work data storage area 34. Further, a rewriting process is performed in which the prosodic parameters of the intermediate language data 670 are rewritten to the values of the updated prosodic parameters 650 (S324). Thus, the updated intermediate language data is stored as intermediate language update data 680 on the work data storage area 34 (S325).

以上のようにして実行された韻律パラメータ編集処理によって生成された中間言語更新データ６８０から、音声合成部２５は、波形合成を実行する（Ｓ４）。さらに、合成波形は、音声出力部２６によって、出力装置６を介して出力される。 From the intermediate language update data 680 generated by the prosodic parameter editing process executed as described above, the speech synthesizer 25 executes waveform synthesis (S4). Further, the synthesized waveform is output by the audio output unit 26 via the output device 6.

本実施形態では、基本周波数は音韻表記毎、すなわち、母音開始周波数毎にのみが変更可能であり、母音開始周波数の値に応じて子音開始周波数が自動的に設定されるが、音素表記毎に基本周波数を設定可能な編集画面を構成してもよい（図１１参照）。 In this embodiment, the fundamental frequency can be changed only for each phoneme notation, that is, for each vowel start frequency, and the consonant start frequency is automatically set according to the value of the vowel start frequency. You may comprise the edit screen which can set a fundamental frequency (refer FIG. 11).

さらに、音素表記内に基本周波数が編集可能な基本周波数編集点８８０をさらに設けた編集画面を構成してもよい（図１２参照）。また、ユーザが音素表記内の座標を自由に指定し、基本周波数編集点８８０を、任意の位置に配置出来るような機能を設けてもよい。 Further, an editing screen may be configured in which a fundamental frequency editing point 880 that can edit the fundamental frequency is provided in the phoneme notation (see FIG. 12). Further, a function may be provided in which the user can freely specify the coordinates in the phoneme notation and arrange the fundamental frequency editing point 880 at an arbitrary position.

以上、第１の実施形態について説明した。第１の実施形態によれば、韻律パラメータをグラフで視認・編集操作することが可能な、韻律パラメータ編集画面が提供される。これにより、専門知識が乏しいユーザであっても、韻律を視覚的、かつ直感的に簡便な操作で調整することが可能である。逆に、専門知識を有するユーザであっても、韻律パラメータの値を具体的に指定することができ、予め定められた韻律パターンに限定されない。このように、本願発明は、ユーザビリティを向上することができる。 The first embodiment has been described above. According to the first embodiment, a prosodic parameter editing screen is provided on which prosody parameters can be visually recognized and edited on a graph. As a result, even a user with poor expertise can adjust the prosody visually and intuitively with a simple operation. Conversely, even users who have specialized knowledge can specify the values of prosodic parameters specifically, and are not limited to predetermined prosodic patterns. Thus, the present invention can improve usability.

以上、本発明について、例示的な実施形態と関連させて記載した。多くの代替物、修正および変形例が当業者にとって明らかであることは明白である。従って、上に記載の本発明の実施形態は、本発明の要旨と範囲を例示することを意図し、限定するものではない。 The present invention has been described in connection with exemplary embodiments. Obviously, many alternatives, modifications, and variations will be apparent to practitioners skilled in this art. Accordingly, the above-described embodiments of the present invention are intended to illustrate and not limit the spirit and scope of the present invention.

第一実施形態の音声合成装置の構成を示すブロック図。The block diagram which shows the structure of the speech synthesizer of 1st embodiment. 第一実施形態の音声合成装置の機能構成を示すブロック図The block diagram which shows the function structure of the speech synthesizer of 1st embodiment. 作業データ記憶領域の構成を示す概略図。Schematic which shows the structure of a work data storage area. 第一実施形態の音声合成装置が行なう全体の処理を示すフローチャート。The flowchart which shows the whole process which the speech synthesizer of 1st embodiment performs. 音声コーパスの構成を示す概略図。Schematic which shows the structure of an audio corpus. 中間言語生成処理を説明するフローチャートFlow chart explaining intermediate language generation processing 韻律パラメータ編集処理を説明するフローチャート。The flowchart explaining prosody parameter edit processing. テキスト入力画面の表示例を示す概略図。Schematic which shows the example of a display of a text input screen. 韻律パラメータ編集画面の表示例を示す概略図。Schematic which shows the example of a display of a prosodic parameter edit screen. 編集処理を受付ける韻律パラメータ編集画面の表示例を示した概略図。Schematic which showed the example of a display of the prosodic parameter edit screen which receives an edit process. 音素毎に基本周波数を指定可能とした韻律パラメータ編集画面の表示例を示した概略図。Schematic which showed the example of a display of the prosodic parameter edit screen which enabled the fundamental frequency to be designated for every phoneme. 音素内に基本周波数を指定可能な点をさらに設けた韻律パラメータ編集画面の表示例を示した概略図。Schematic which showed the example of a display of the prosodic parameter edit screen which further provided the point which can designate a fundamental frequency in phoneme. 形態素解析データ、音素分析データ、中間言語データの一例を示す説明図。Explanatory drawing which shows an example of morpheme analysis data, phoneme analysis data, and intermediate language data.

Explanation of symbols

１０…音声合成装置、１…ＣＰＵ、２…主記憶装置、３…外部記憶装置、４…入力装置、５…表示装置、６…出力装置、７…バス
２０…制御部、２２…設定受付部、２３…韻律パラメータ編集部、２４…中間言語生成部、２５…音声合成部、２６…音声出力部、２７…編集画面生成部
３０…記憶部、３２…辞書データ記憶領域、３３…音声コーパス記憶領域、３４…作業データ記憶領域、７００…辞書
６１０…テキストデータ、６２０…形態素解析データ、６３０…音素解析データ、６４０…基準韻律パラメータ、６５０…更新韻律パラメータ、６６０…検索結果データ、６７０…中間言語データ、６８０…中間言語更新データ
４００…音声コーパス、４１００・４２００…データセット、４１０…文字列表記データ、４２０…音声波形データ、４３０…基本周波数データ、４４０…継続時間長データ、４５０…形態素分割データ、４６０…音素分割データ
６０１…韻律編集ボタン、６０２…入力テキスト設定欄
８００…韻律パラメータ編集画面、８３０…音韻表記文字列、８４０…音素表記文字列、８５０…継続時間長表示シンボル、８６０…基本周波数表示シンボル、８７０…韻律結線シンボル、８８０…基本周波数編集点
９１０…音韻表記、９２０…音素表記、９３０…継続時間長パラメータ、９４０…基本周波数パラメータ DESCRIPTION OF SYMBOLS 10 ... Speech synthesizer, 1 ... CPU, 2 ... Main storage device, 3 ... External storage device, 4 ... Input device, 5 ... Display device, 6 ... Output device, 7 ... Bus 20 ... Control part, 22 ... Setting reception part , 23 ... Prosodic parameter editing section, 24 ... Intermediate language generation section, 25 ... Speech synthesis section, 26 ... Speech output section, 27 ... Editing screen generation section 30 ... Storage section, 32 ... Dictionary data storage area, 33 ... Speech corpus storage Area 34: work data storage area 700 dictionary 610 text data 620 morphological analysis data 630 phoneme analysis data 640 reference prosody parameter 650 update prosody parameter 660 search result data 670 intermediate Language data, 680 ... Intermediate language update data 400 ... Speech corpus, 4100, 4200 ... Data set, 410 ... Character string notation data, 420 ... Speech waveform data 430 ... fundamental frequency data, 440 ... duration length data, 450 ... morpheme division data, 460 ... phoneme division data 601 ... prosody edit button, 602 ... input text setting field 800 ... prosody parameter edit screen, 830 ... phoneme description character string , 840 ... Phoneme notation character string, 850 ... Duration length display symbol, 860 ... Fundamental frequency display symbol, 870 ... Prosodic connection symbol, 880 ... Fundamental frequency edit point 910 ... Phoneme notation, 920 ... Phoneme notation, 930 ... Duration length Parameter, 940 ... fundamental frequency parameter

Claims

A speech synthesizer that synthesizes speech corresponding to an input character string,
Prosody parameters including at least information specifying accent, duration length, and fundamental frequency as parameters, and voice data, corresponding to at least one of phoneme display characters and phoneme display characters as prosodic control units Storage means for storing the voice corpus accumulated for each voice control unit;
An intermediate language generating means for dividing the input character string into prosodic control units and generating an intermediate language in which prosodic parameters are associated with each divided prosodic control unit;
Among the parameters included in the prosodic parameters, a graph including coordinates in which the values of the first parameter and the second parameter are respectively arranged on the horizontal axis and the vertical axis is formed, and the first parameter and each of the divided prosodic control units Prosody parameter editing screen generating means for generating a prosodic parameter editing screen in which a predetermined display symbol is displayed on the graph at the coordinate position specified by the second parameter, and displaying on the display means;
A speech synthesizer comprising:

The speech synthesizer according to claim 1,
In the prosodic parameter editing screen displayed on the display means, a displacement of the coordinate position of the display symbol is received via the input means, and the prosodic parameter of the intermediate language corresponding to the prosodic control unit in which the display symbol is displaced Prosody parameter rewriting means for changing the value of the prosody parameter value to a prosodic parameter value specified from the coordinate position of the display symbol after displacement
A speech synthesizer further comprising:

The speech synthesizer according to claim 1,
The first parameter is a duration length and the second parameter is a fundamental frequency;
A speech synthesizer characterized by the above.

The speech synthesizer according to claim 1 or 3,
The prosodic parameter editing screen generation means includes:
Generating a prosodic parameter editing screen that further expands and displays a phoneme notation character string, a phoneme notation character string, or both as a phoneme control unit in the horizontal axis direction;
A speech synthesizer characterized by the above.

The speech synthesizer according to claim 3 or 4,
The prosodic parameter editing screen generation means includes:
A first display symbol representing a duration parameter corresponding to the horizontal axis, a second display symbol representing a fundamental frequency parameter corresponding to the vertical axis, and a connection connecting the adjacent second display symbols to each other; A prosody parameter editing screen that displays the corresponding to the prosody control unit,
The first display symbol is arranged at a coordinate corresponding to a duration length parameter value, and the second display symbol is arranged at a coordinate corresponding to a fundamental frequency parameter value on the first display symbol. ,
A speech synthesizer characterized by the above.

A program that causes a computer to function as a speech synthesizer that synthesizes speech corresponding to an input character string,
The computer,
Prosody parameters including at least information specifying accent, duration length, and fundamental frequency as parameters, and voice data, corresponding to at least one of phoneme display characters and phoneme display characters as prosodic control units Storage means for storing the voice corpus accumulated for each voice control unit;
An intermediate language generating means for dividing the input character string into prosodic control units and generating an intermediate language in which prosodic parameters are associated with each divided prosodic control unit;
Among the parameters included in the prosodic parameters, a graph including coordinates in which the values of the first parameter and the second parameter are respectively arranged on the horizontal axis and the vertical axis is formed, and the first parameter and each of the divided prosodic control units Prosody parameter editing screen generation means for generating a prosodic parameter editing screen in which a predetermined display symbol is displayed on the graph at the coordinate position specified by the second parameter, and causing the display means to display it,
A program characterized by functioning as

The program according to claim 6,
The computer,
In the prosodic parameter editing screen displayed on the display means, a displacement of the coordinate position of the display symbol is received via the input means, and the prosodic parameter of the intermediate language corresponding to the prosodic control unit in which the display symbol is displaced Prosody parameter rewriting means for changing the value of the prosody parameter value to a prosodic parameter value specified from the coordinate position of the display symbol after displacement
A program characterized by further functioning as

Prosody parameters including at least information specifying accent, duration length, and fundamental frequency as parameters, and voice data, corresponding to at least one of phoneme display characters and phoneme display characters as prosodic control units A speech synthesizing method in a speech synthesizer comprising a storage means for storing a speech corpus accumulated for each speech control unit and synthesizing speech corresponding to an input character string,
The intermediate language generating means of the speech synthesizer divides the input character string into prosodic control units, and generates an intermediate language in which prosodic parameters are associated with each divided prosodic control unit;
The prosody parameter editing screen generation means of the sound consistency device forms a graph having coordinates in which the values of the first parameter and the second parameter among the parameters included in the prosody parameter are arranged on the horizontal axis and the vertical axis, respectively. Generating a prosodic parameter editing screen in which a predetermined display symbol is displayed on the graph at the coordinate position specified by the first parameter and the second parameter for each of the divided prosodic control units and displaying it on the display means Process,
A speech synthesis method comprising:

The speech synthesis method according to claim 8,
The prosody parameter rewriting means of the speech synthesizer accepts the displacement of the coordinate position of the display symbol via the input means on the prosodic parameter editing screen displayed on the display means, and the prosody where the display symbol is displaced Changing the value of the prosodic parameter of the intermediate language corresponding to the control unit to the prosodic parameter value specified from the coordinate position of the display symbol after displacement;
A speech synthesis method, further comprising: