JP4411017B2

JP4411017B2 - SPEED SPEED CONVERTER, SPEED SPEED CONVERSION METHOD, AND PROGRAM

Info

Publication number: JP4411017B2
Application number: JP2003161588A
Authority: JP
Inventors: 寧佐藤
Original assignee: Kenwood KK
Current assignee: Kenwood KK
Priority date: 2003-06-06
Filing date: 2003-06-06
Publication date: 2010-02-10
Anticipated expiration: 2023-06-06
Also published as: JP2004361766A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speaking speed conversion apparatus and the like for obtaining synthesized voices which are easy to hear even if environment changes. <P>SOLUTION: When supplied with data representing a fixed form message, a sound piece edition section 8 retrieves the sound piece data of the sound piece matching in reading with the sound pieces within the fixed form message from a sound piece database 10, determines an utterance speed and has the sound piece data converted by a speaking speed conversion section 11 so as to match the speed with the determined speed. The utterance speed is determined based on the speed of a moving body detected by a speed detecting section 12. On the other hand, the sound piece edition section 8 performs the rhythm prediction of the fixed form message and the like and specifies the data most appropriately matching with the respective sound pieces within the fixed form message from the retrieved sound piece data piece by piece. The data indicating the synthesized voices is formed by coupling the specified sound piece data or in turn the waveform data supplied to an acoustic processing section 4 because the specification is not possible to each other. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
この発明は、話速変換装置、話速変換方法及びプログラムに関する。
【０００２】
【従来の技術】
音声を合成する手法として、録音編集方式と呼ばれる手法がある。録音編集方式は、駅の音声案内システムや、車載用のナビゲーション装置などに用いられている。
録音編集方式は、単語と、この単語を読み上げる音声を表す音声データとを対応付けておき、音声合成する対象の文章を単語に区切ってから、これらの単語に対応付けられた音声データを取得してつなぎ合わせる、という手法である（例えば、特許文献１参照）。
【０００３】
【特許文献１】
特開平１０−４９１９３号公報
【０００４】
【発明が解決しようとする課題】
しかし、音声データを単につなぎ合わせた場合、合成音声の発声スピード（発声する時間の長さ）は、あらかじめ用意されている音声データの発声スピードにより決まる値になる。一方、人の聴覚の特性は、音声を聞く人がいる場所や、移動速度、周囲の状況、また車内などにおいても、車のいる場所、車の速度、車の周囲の状況、車内の状況などの諸環境の条件に大きく影響されるので、これらの要因が変化すれば、同一の音声でも聞こえ方が大きく変化する。
【０００５】
この発明は、上記実状に鑑みてなされたものであり、環境が変化しても聴き取りやすい合成音声を得るための話速変換装置、話速変換方法及びプログラムを提供することを目的とする。
【０００６】
【課題を解決するための手段】
上記目的を達成すべく、この発明の第１の観点にかかる話速変換装置は、
音声の波形を表す音声データを取得する音声データ取得手段と、
移動体の速度又は加速度を検出して、当該検出された速度又は加速度に基づき、当該音声のスピードを指定する話速設定データを生成する話速設定データ生成手段と、
前記取得された音声データのスピードを、前記生成された話速設定データにより表されるスピードに基づいて変換する音声データ変換手段と、
を備え、
前記話速設定データ生成手段は、前記移動体の加速度のピークを検出し、当該検出された最新のピークを所定の複数のランクのいずれかに分類して記憶し、過去に検出されたピークが、最新のピークが分類されたランクと同一のランクへと過去どのような頻度で分類されたかを特定して、当該特定された結果に基づいて当該話速設定データを生成する、
ことを特徴とする。
【０００７】
前記音声データ変換手段は、前記取得された音声データをサンプリングする変換を行うことにより、変換後の音声データが表す音声のスピードが、当該話速設定データにより表されるスピードとなるようにするものであってもよい。
【０００８】
前記音声データ変換手段は、前記取得された音声データが表す波形のうち実質的に無音状態を表している部分を特定し、当該特定された部分の時間長を変化させる変換を行うことにより、当該音声データが表す音声のスピードを、当該話速設定データにより表されるスピードに基づいたスピードとなるようにするものであってもよい。
【００１０】
また、この発明の第２の観点にかかる話速変換装置は、
同一の読みの語句を互いに異なるスピードで発声する複数の音声の波形を表す複数の音声データを記憶する音声データ記憶手段と、
移動体の速度又は加速度を検出し、当該検出された速度又は加速度に基づき、音声のスピードを指定する話速設定データを生成する話速設定データ生成手段と、
前記音声データ変換手段に記憶されている音声データのうち、前記生成された話速設定データにより表されるスピードに最も近いスピードである音声データを選択する音声データ選択手段と、
を備え、
前記話速設定データ生成手段は、前記移動体の加速度のピークを検出し、当該検出された最新のピークを所定の複数のランクのいずれかに分類して前記音声データ記憶手段に記憶し、過去に検出されたピークが、最新のピークが分類されたランクと同一のランクへと過去どのような頻度で分類されたかを特定して、当該特定された結果に基づいて当該話速設定データを生成する、
ことを特徴とする。
【００１３】
また、この発明の第３の観点にかかる話速変換方法は、
記憶部を有する話速変換装置にて実行される話速変換方法であって、
音声の波形を表す音声データを取得する音声データ取得ステップと、
移動体の速度又は加速度を検出して、当該検出された速度又は加速度に基づき、音声のスピードを指定する話速設定データを生成する話速設定データ生成ステップと、
前記取得された音声データのスピードを、前記生成された話速設定データにより表されるスピードに基づいて変換する音声データ変換ステップと、
を備え、
前記話速設定データ生成ステップは、前記移動体の加速度のピークを検出し、当該検出された最新のピークを所定の複数のランクのいずれかに分類して前記記憶部に記憶し、過去に検出されたピークが、最新のピークが分類されたランクと同一のランクへと過去どのような頻度で分類されたかを特定して、当該特定された結果に基づいて当該話速設定データを生成する、
ことを特徴とする。
【００１４】
また、この発明の第４の観点にかかる話速変換方法は、
記憶部を有する話速変換装置にて実行される話速変換方法であって、
前記記憶部には、同一の読みの語句を互いに異なるスピードで発声する複数の音声の波形を表す複数の音声データが記憶され、
移動体の速度又は加速度を検出して、当該検出された速度又は加速度に基づき、音声のスピードを指定する話速設定データを生成する話速設定データ生成ステップと、
前記記憶されている音声データのうち、前記生成された話速設定データにより表されるスピードに最も近いスピードである音声データを選択する音声データ選択ステップと、
を備え、
前記話速設定データ生成ステップは、前記移動体の加速度のピークを検出し、当該検出された最新のピークを所定の複数のランクのいずれかに分類して前記記憶部に記憶し、過去に検出されたピークが、最新のピークが分類されたランクと同一のランクへと過去どのような頻度で分類されたかを特定して、当該特定された結果に基づいて話速設定データを生成する、
ことを特徴とする。
【００１６】
また、この発明の第５の観点にかかるプログラムは、
移動体の速度又は加速度を検出する装置を備えたコンピュータを、
音声の波形を表す音声データを取得する音声データ取得手段と、
前記移動体の速度又は加速度を検出して、当該検出された速度又は加速度に基づき、音声のスピードを指定する話速設定データを生成する話速設定データ生成手段と、
前記取得された音声データのスピードを、前記生成された話速設定データにより表されるスピードに基づいて変換する音声データ変換手段と、
して機能させ、
前記話速設定データ生成手段は、前記移動体の加速度のピークを検出し、当該検出された最新のピークを所定の複数のランクのいずれかに分類して記憶し、過去に検出されたピークが、最新のピークが分類されたランクと同一のランクへと過去どのような頻度で分類されたかを特定して、当該特定された結果に基づいて話速設定データを生成する、
ことを特徴とする。
【００１７】
また、この発明の第６の観点にかかるプログラムは、
移動体の速度又は加速度を検出する装置を備えたコンピュータを、
同一の読みの語句を互いに異なるスピードで発声する複数の音声の波形を表す複数の音声データを記憶する音声データ記憶手段と、
前記移動体の速度又は加速度を検出して、当該検出された速度又は加速度に基づき、音声のスピードを指定する話速設定データを生成する話速設定データ生成手段と、
前記音声データ変換手段に記憶されている音声データのうち、前記生成された話速設定データにより表されるスピードに最も近いスピードである音声データを選択する音声データ選択手段と、
して機能させ、
前記話速設定データ生成手段は、前記移動体の加速度のピークを検出し、当該検出された最新のピークを所定の複数のランクのいずれかに分類して前記音声データ記憶手段に記憶し、過去に検出されたピークが、最新のピークが分類されたランクと同一のランクへと過去どのような頻度で分類されたかを特定して、当該特定された結果に基づいて話速設定データを生成する、
ことを特徴とする。
【００１９】
【発明の実施の形態】
以下、この発明の実施の形態を、車両などの移動体に搭載されて利用される音声合成システムを例とし、図面を参照して説明する。
図１は、この発明の実施の形態に係る音声合成システムの構成を示す図である。図示するように、この音声合成システムは、本体ユニットＭと、音片登録ユニットＲとにより構成されている。
【００２０】
本体ユニットＭは、言語処理部１と、一般単語辞書２と、ユーザ単語辞書３と、音響処理部４と、検索部５と、伸長部６と、波形データベース７と、音片編集部８と、検索部９と、音片データベース１０と、話速変換部１１と、速度検出部１２とにより構成されている。
【００２１】
言語処理部１、音響処理部４、検索部５、伸長部６、音片編集部８、検索部９及び話速変換部１１は、いずれも、ＣＰＵ（Central Processing Unit）やＤＳＰ（Digital Signal Processor）等のプロセッサや、このプロセッサが実行するためのプログラムを記憶するメモリなどより構成されており、それぞれ後述する処理を行う。
なお、言語処理部１、音響処理部４、検索部５、伸長部６、音片編集部８、検索部９及び話速変換部１１の一部又は全部の機能を単一のプロセッサが行うようにしてもよい。
【００２２】
一般単語辞書２は、ＰＲＯＭ（Programmable Read Only Memory）やハードディスク装置等の不揮発性メモリより構成されている。一般単語辞書２には、表意文字（例えば、漢字など）を含む単語等と、この単語等の読みを表す表音文字（例えば、カナや発音記号など）とが、この音声合成システムの製造者等によって、あらかじめ互いに対応付けて記憶されている。
【００２３】
ユーザ単語辞書３は、ＥＥＰＲＯＭ（Electrically Erasable/Programmable Read Only Memory）やハードディスク装置等のデータ書き換え可能な不揮発性メモリと、この不揮発性メモリへのデータの書き込みを制御する制御回路とにより構成されている。なお、プロセッサがこの制御回路の機能を行ってもよく、言語処理部１、音響処理部４、検索部５、伸長部６、音片編集部８、検索部９及び話速変換部１１の一部又は全部の機能を行うプロセッサがユーザ単語辞書３の制御回路の機能を行うようにしてもよい。
ユーザ単語辞書３は、表意文字を含む単語等と、この単語等の読みを表す表音文字とを、ユーザの操作に従って外部より取得し、互いに対応付けて記憶する。ユーザ単語辞書３には、一般単語辞書２に記憶されていない単語等とその読みを表す表音文字とが格納されていれば十分である。
【００２４】
波形データベース７は、ＰＲＯＭやハードディスク装置等の不揮発性メモリより構成されている。波形データベース７には、表音文字と、この表音文字が表す単位音声の波形を表す波形データをエントロピー符号化して得られる圧縮波形データとが、この音声合成システムの製造者等によって、あらかじめ互いに対応付けて記憶されている。単位音声は、規則合成方式の手法で用いられる程度の短い音声であり、具体的には、音素や、ＶＣＶ（Vowel-Consonant-Vowel）音節などの単位で区切られる音声である。なお、エントロピー符号化される前の波形データは、例えば、ＰＣＭ（Pulse Code Modulation）されたデジタル形式のデータからなっていればよい。
【００２５】
音片データベース１０は、ＰＲＯＭやハードディスク装置等の不揮発性メモリより構成されている。
音片データベース１０には、例えば、図２に示すデータ構造を有するデータが記憶されている。すなわち、図示するように、音片データベース１０に格納されているデータは、ヘッダ部ＨＤＲ、インデックス部ＩＤＸ、ディレクトリ部ＤＩＲ及びデータ部ＤＡＴの４種に分かれている。
【００２６】
なお、音片データベース１０へのデータの格納は、例えば、この音声合成システムの製造者によりあらかじめ行われ、及び／又は、音片登録ユニットＲが後述する動作を行うことにより行われる。
【００２７】
ヘッダ部ＨＤＲには、音片データベース１０を識別するデータや、インデックス部ＩＤＸ、ディレクトリ部ＤＩＲ及びデータ部ＤＡＴのデータ量、データの形式、著作権等の帰属などを示すデータが格納される。
【００２８】
データ部ＤＡＴには、音片の波形を表す音片データをエントロピー符号化して得られる圧縮音片データが格納されている。
なお、音片とは、音声のうち音素１個以上を含む連続した１区間をいい、通常は単語１個分又は複数個分の区間からなる。
また、エントロピー符号化される前の音片データは、上述の圧縮波形データの生成のためエントロピー符号化される前の波形データと同じ形式のデータ（例えば、ＰＣＭされたデジタル形式のデータ）からなっていればよい。
【００２９】
ディレクトリ部ＤＩＲには、個々の圧縮音声データについて、
（Ａ）この圧縮音片データが表す音片の読みを示す表音文字を表すデータ（音片読みデータ）、
（Ｂ）この圧縮音片データが格納されている記憶位置の先頭のアドレスを表すデータ、
（Ｃ）この圧縮音片データのデータ長を表すデータ、
（Ｄ）この圧縮音片データが表す音片の発声スピード（再生した場合の時間長）を表すデータ（スピード初期値データ）、
（Ｅ）この音片のピッチ成分の周波数の時間変化を表すデータ（ピッチ成分データ）、
が、互いに対応付けられた形で格納されている。（なお、音片データベース１０の記憶領域にはアドレスが付されているものとする。）
【００３０】
なお、図２は、データ部ＤＡＴに含まれるデータとして、読みが「サイタマ」である音片の波形を表す、データ量１４１０ｈバイトの圧縮音片データが、アドレス００１Ａ３６Ａ６ｈを先頭とする論理的位置に格納されている場合を例示している。（なお、本明細書及び図面において、末尾に“ｈ”を付した数字は１６進数を表す。）
【００３１】
なお、上述の（Ａ）〜（Ｅ）のデータの集合のうち少なくとも（Ａ）のデータ（すなわち音片読みデータ）は、音片読みデータが表す表音文字に基づいて決められた順位に従ってソートされた状態で（例えば、表音文字がカナであれば、五十音順に従って、アドレス降順に並んだ状態で）、音片データベース１０の記憶領域に格納されている。
【００３２】
インデックス部ＩＤＸには、ディレクトリ部ＤＩＲのデータのおおよその論理的位置を音片読みデータに基づいて特定するためのデータが格納されている。具体的には、例えば、音片読みデータがカナを表すものであるとして、カナ文字と、先頭１字がこのカナ文字であるような音片読みデータがどのような範囲のアドレスにあるかを示すデータとが、互いに対応付けて格納されている。
【００３３】
なお、一般単語辞書２、ユーザ単語辞書３、波形データベース７及び音片データベース１０の一部又は全部の機能を単一の不揮発性メモリが行うようにしてもよい。
【００３４】
速度検出部１２は、例えば、速度センサより構成される。速度検出部１２は、この音声合成システムが搭載されている移動体の移動速度を検出し、検出した移動速度を示すデータを生成して音片編集部８へと供給する。
【００３５】
音片登録ユニットＲは、図１に示すように、収録音片データセット記憶部１３と、音片データベース作成部１４と、圧縮部１５とにより構成されている。なお、音片登録ユニットＲは音片データベース１０とは着脱可能に接続されていてもよく、この場合は、音片データベース１０に新たにデータを書き込むときを除いては、音片登録ユニットＲを本体ユニットＭから切り離した状態で本体ユニットＭに後述の動作を行わせてよい。
【００３６】
収録音片データセット記憶部１３は、ハードディスク装置等のデータ書き換え可能な不揮発性メモリより構成されている。
収録音片データセット記憶部１３には、音片の読みを表す表音文字と、この音片を人が実際に発声したものを集音して得た波形を表す音片データとが、この音声合成システムの製造者等によって、あらかじめ互いに対応付けて記憶されている。なお、この音片データは、例えば、ＰＣＭされたデジタル形式のデータからなっていればよい。
【００３７】
音片データベース作成部１４及び圧縮部１５は、ＣＰＵ等のプロセッサや、このプロセッサが実行するためのプログラムを記憶するメモリなどより構成されており、このプログラムに従って後述する処理を行う。
【００３８】
なお、音片データベース作成部１４及び圧縮部１５の一部又は全部の機能を単一のプロセッサが行うようにしてもよく、また、言語処理部１、音響処理部４、検索部５、伸長部６、音片編集部８、検索部９及び話速変換部１１の一部又は全部の機能を行うプロセッサが音片データベース作成部１４や圧縮部１５の機能を更に行ってもよい。また、音片データベース作成部１４や圧縮部１５の機能を行うプロセッサが、収録音片データセット記憶部１３の制御回路の機能を兼ねてもよい。
【００３９】
音片データベース作成部１４は、収録音片データセット記憶部１３より、互いに対応付けられている表音文字及び音片データを読み出し、この音片データが表す音声のピッチ成分の周波数の時間変化と、発声スピードとを特定する。
発声スピードの特定は、例えば、この音片データのサンプル数を数えることにより特定すればよい。
【００４０】
一方、ピッチ成分の周波数の時間変化は、例えば、この音片データにケプストラム解析を施すことにより特定すればよい。具体的には、例えば、音片データが表す波形を時間軸上で多数の小部分へと区切り、得られたそれぞれの小部分の強度を、元の値の対数（対数の底は任意）に実質的に等しい値へと変換し、値が変換されたこの小部分のスペクトル（すなわち、ケプストラム）を、高速フーリエ変換の手法（あるいは、離散的変数をフーリエ変換した結果を表すデータを生成する他の任意の手法）により求める。そして、このケプストラムのピークを与える周波数のうちの最小値を、この小部分におけるピッチ成分の周波数として特定する。
【００４１】
なお、ピッチ成分の周波数の時間変化は、例えば、特開２００３−１０８１７２号公報に開示された手法に従って音片データをピッチ波形データへと変換してから、このピッチ波形データに基づいて特定するようにすると良好な結果が期待できる。具体的には、音片データをフィルタリングしてピッチ信号を抽出し、抽出されたピッチ信号に基づいて、音片データが表す波形を単位ピッチ長の区間へと区切り、各区間について、ピッチ信号との相関関係に基づいて位相のずれを特定して各区間の位相を揃えることにより、音片データをピッチ波形信号へと変換すればよい。そして、得られたピッチ波形信号を音片データとして扱い、ケプストラム解析を行う等することにより、ピッチ成分の周波数の時間変化を特定すればよい。
【００４２】
一方、音片データベース作成部１４は、収録音片データセット記憶部１３より読み出した音片データを圧縮部１５に供給する。
圧縮部１５は、音片データベース作成部１４より供給された音片データをエントロピー符号化して圧縮音片データを作成し、音片データベース作成部１４に返送する。
【００４３】
音片データの発声スピード及びピッチ成分の周波数の時間変化を特定し、この音片データがエントロピー符号化され圧縮音片データとなって圧縮部１５より返送されると、音片データベース作成部１４は、この圧縮音片データを、データ部ＤＡＴを構成するデータとして、音片データベース１０の記憶領域に書き込む。
【００４４】
また、音片データベース作成部１４は、書き込んだ圧縮音片データが表す音片の読みを示すものとして収録音片データセット記憶部１３より読み出した表音文字を、音片読みデータとして音片データベース１０の記憶領域に書き込む。
また、書き込んだ圧縮音片データの、音片データベース１０の記憶領域内での先頭のアドレスを特定し、このアドレスを上述の（Ｂ）のデータとして音片データベース１０の記憶領域に書き込む。
また、この圧縮音片データのデータ長を特定し、特定したデータ長を、（Ｃ）のデータとして音片データベース１０の記憶領域に書き込む。
また、この圧縮音片データが表す音片の発声スピード及びピッチ成分の周波数の時間変化を特定した結果を示すデータを生成し、スピード初期値データ及びピッチ成分データとして音片データベース１０の記憶領域に書き込む。
【００４５】
次に、この音声合成システムの動作を説明する。
まず、言語処理部１が、この音声合成システムに音声を合成させる対象としてユーザが用意した、表意文字を含む文章（フリーテキスト）を記述したフリーテキストデータを外部から取得したとして説明する。
【００４６】
なお、言語処理部１がフリーテキストデータを取得する手法は任意であり、例えば、図示しないインターフェース回路を介して外部の装置やネットワークから取得してもよいし、図示しない記録媒体ドライブ装置にセットされた記録媒体（例えば、フロッピー（登録商標）ディスクやＣＤ−ＲＯＭなど）から、この記録媒体ドライブ装置を介して読み取ってもよい。また、言語処理部１の機能を行っているプロセッサが、自ら実行している他の処理で用いたテキストデータを、フリーテキストデータとして、言語処理部１の処理へと引き渡すようにしてもよい。
【００４７】
フリーテキストデータを取得すると、言語処理部１は、このフリーテキストに含まれるそれぞれの表意文字について、その読みを表す表音文字を、一般単語辞書２やユーザ単語辞書３を検索することにより特定する。そして、この表意文字を、特定した表音文字へと置換する。そして、言語処理部１は、フリーテキスト内の表意文字がすべて表音文字へと置換した結果得られる表音文字列を、音響処理部４へと供給する。
【００４８】
音響処理部４は、言語処理部１より表音文字列を供給されると、この表音文字列に含まれるそれぞれの表音文字について、当該表音文字が表す単位音声の波形を検索するよう、検索部５に指示する。
【００４９】
検索部５は、この指示に応答して波形データベース７を検索し、表音文字列に含まれるそれぞれの表音文字が表す単位音声の波形を表す圧縮波形データを索出する。そして、索出された圧縮波形データを伸長部６へと供給する。
【００５０】
伸長部６は、検索部５より供給された圧縮波形データを、圧縮される前の波形データへと復元し、検索部５へと返送する。検索部５は、伸長部６より返送された波形データを、検索結果として音響処理部４へと供給する。
音響処理部４は、検索部５より供給された波形データを、言語処理部１より供給された表音文字列内での各表音文字の並びに従った順序で、音片編集部８へと供給する。
【００５１】
音片編集部８は、音響処理部４より波形データを供給されると、この波形データを、供給された順序で互いに結合し、合成音声を表すデータ（合成音声データ）として出力する。フリーテキストデータに基づいて合成されたこの合成音声は、規則合成方式の手法により合成された音声に相当する。
【００５２】
なお、音片編集部８が合成音声データを出力する手法は任意であり、例えば、図示しないＤ／Ａ（Digital-to-Analog）変換器やスピーカを介して、この合成音声データが表す合成音声を再生するようにしてもよい。また、図示しないインターフェース回路を介して外部の装置やネットワークに送出してもよいし、図示しない記録媒体ドライブ装置にセットされた記録媒体へ、この記録媒体ドライブ装置を介して書き込んでもよい。また、音片編集部８の機能を行っているプロセッサが、自ら実行している他の処理へと、合成音声データを引き渡すようにしてもよい。
【００５３】
次に、音響処理部４が、外部より配信された、表音文字列を表すデータ（配信文字列データ）を取得したとする。（なお、音響処理部４が配信文字列データを取得する手法も任意であり、例えば、言語処理部１がフリーテキストデータを取得する手法と同様の手法で配信文字列データを取得すればよい。）
【００５４】
この場合、音響処理部４は、配信文字列データが表す表音文字列を、言語処理部１より供給された表音文字列と同様に扱う。この結果、配信文字列データが表す表音文字列に含まれる表音文字に対応する圧縮波形データが検索部５により索出され、圧縮される前の波形データが伸長部６により復元される。復元された各波形データは音響処理部４を介して音片編集部８へと供給され、音片編集部８が、この波形データを、配信文字列データが表す表音文字列内での各表音文字の並びに従った順序で互いに結合し、合成音声データとして出力する。配信文字列データに基づいて合成されたこの合成音声データも、規則合成方式の手法により合成された音声を表す。
【００５５】
次に、音片編集部８が、定型メッセージデータを取得したとする。なお、定型メッセージデータは、定型メッセージを表音文字列として表すデータである。
音片編集部８が定型メッセージデータを取得する手法は任意であり、例えば、言語処理部１がフリーテキストデータを取得する手法と同様の手法で定型メッセージデータを取得すればよい。
【００５６】
定型メッセージデータが音片編集部８に供給されると、音片編集部８は、定型メッセージに含まれる音片の読みを表す表音文字に合致する表音文字が対応付けられている圧縮音片データをすべて索出するよう、検索部９に指示する。
【００５７】
検索部９は、音片編集部８の指示に応答して音片データベース１０を検索し、該当する圧縮音片データと、該当する圧縮音片データに対応付けられている上述の音片読みデータ、スピード初期値データ及びピッチ成分データとを索出し、索出された圧縮音片データを伸長部６へと供給する。１個の音片につき複数の圧縮音片データが該当する場合も、該当する圧縮音片データすべてが、音声合成に用いられるデータの候補として索出される。一方、圧縮音片データを索出できなかった音片があった場合、検索部９は、該当する音片を識別するデータ（以下、欠落部分識別データと呼ぶ）を生成する。
【００５８】
伸長部６は、検索部９より供給された圧縮音片データを、圧縮される前の音片データへと復元し、検索部９へと返送する。検索部９は、伸長部６より返送された音片データと、索出された音片読みデータ、スピード初期値データ及びピッチ成分データとを、検索結果として話速変換部１１へと供給する。また、欠落部分識別データを生成した場合は、この欠落部分識別データも話速変換部１１へと供給する。
【００５９】
一方、音片編集部８は、移動体の移動速度を表すデータを速度検出部１２より供給されると、このデータが表す移動速度に基づいて、定型メッセージの発声スピード（この定型メッセージを発声する時間長）を決定する。そして、音片編集部８は、話速変換部１１に対し、話速変換部１１に供給された音片データを変換して、当該音片データが表す音片の時間長を、決定した発声スピードに合致するスピード（又は、決定した発声スピードに一定程度以上近いスピード）にすることを指示する。
【００６０】
なお、移動体の移動速度と発声スピードとの対応関係は任意であり、例えば音片編集部８は、移動体の移動速度が大きいほど発声スピードが速くなるように決定すればよい。
【００６１】
話速変換部１１は、音片編集部８の指示に応答し、検索部９より供給された音片データを指示に合致するように変換して、音片編集部８に供給する。具体的には、例えば、検索部９より供給された音片データの元の時間長を、索出されたスピード初期値データに基づいて特定した上、この音片データをリサンプリングして、この音片データのサンプル数を、音片編集部８の指示したスピードに合致する時間長にすればよい。
【００６２】
また、話速変換部１１は、検索部９より供給された音片読みデータ及びピッチ成分データも音片編集部８に供給し、欠落部分識別データを検索部９より供給された場合は、更にこの欠落部分識別データも音片編集部８に供給する。
【００６３】
なお、発声スピードデータが音片編集部８に供給されていない場合、音片編集部８は、話速変換部１１に対し、話速変換部１１に供給された音片データを変換せずに音片編集部８に供給するよう指示すればよく、話速変換部１１は、この指示に応答し、検索部９より供給された音片データをそのまま音片編集部８に供給すればよい。
【００６４】
音片編集部８は、話速変換部１１より音片データ、音片読みデータ及びピッチ成分データを供給されると、供給された音片データのうちから、定型メッセージを構成する音片の波形に最もよく近似できる波形を表す音片データを、音片１個につき１個ずつ選択する。
【００６５】
音片データを選択する基準は任意であり、例えば、音片編集部８は、定型メッセージについて韻律予測を行った上、定型メッセージ内のそれぞれの音片について、話速変換部１１より供給された音片データのうちから、ピッチ成分のの時間変化が韻律予測の結果との間で最も高い相関を示すものを１個ずつ選択するようにすればよい。
【００６６】
具体的には、まず音片編集部８は、定型メッセージデータが表す定型メッセージに、例えば「藤崎モデル」や「ＴｏＢＩ（Tone and Break Indices）」等の韻律予測の手法に基づいた解析を加えることにより、この定型メッセージ内の各音片のピッチ成分の周波数の時間変化を予測し、予測結果を表す関数を特定する。一方で、音片編集部８が、話速変換部１１より供給された音片データのピッチ成分の周波数の時間変化を表す関数を、話速変換部１１より供給されたピッチ成分データに基づいて特定する。
【００６７】
そして、音片編集部８は、定型メッセージ内のそれぞれの音片について、この音片のピッチ成分の周波数の時間変化の予測結果を表す関数と、この音片と読みが合致する音片の波形を表す各音片データのピッチ成分の周波数の時間変化を表す関数との相関係数を求め、最も高い相関係数を与えた音片データを選択する。
【００６８】
一方、音片編集部８は、話速変換部１１より欠落部分識別データも供給されている場合には、欠落部分識別データが示す音片の読みを表す表音文字列を定型メッセージデータより抽出して音響処理部４に供給し、この音片の波形を合成するよう指示する。
【００６９】
指示を受けた音響処理部４は、音片編集部８より供給された表音文字列を、配信文字列データが表す表音文字列と同様に扱う。この結果、この表音文字列に含まれる表音文字が示す音声の波形を表す圧縮波形データが検索部５により索出され、この圧縮波形データが伸長部６により元の波形データへと復元され、検索部５を介して音響処理部４へと供給される。音響処理部４は、この波形データを音片編集部８へと供給する。
【００７０】
音片編集部８は、音響処理部４より波形データを返送されると、この波形データと、話速変換部１１より供給された音片データのうち音片編集部８が特定したものとを、定型メッセージデータが示す定型メッセージ内での各音片の並びに従った順序で互いに結合し、合成音声を表すデータとして出力する。
【００７１】
なお、話速変換部１１より供給されたデータに欠落部分識別データが含まれていない場合は、音響処理部４に波形の合成を指示することなく直ちに、音片編集部８が選択した音片データを、定型メッセージデータが示す定型メッセージ内での各音片の並びに従った順序で互いに結合し、合成音声を表すデータとして出力すればよい。
【００７２】
以上説明した、この音声合成システムでは、合成音声の発声スピードは、この音声合成システムが搭載されている移動体の移動速度に応じて変化する。従って、例えばこの音声合成システムをカーナビゲーション装置におけるナビゲーション音声の発生に用いた場合に、車両の移動速度が大きいほど音片編集部８が音片データの発声スピードを速くするよう決定することによって、車両の走行状況に応じた適正な速さでのナビゲーション音声が得られる。例えば、車両が高速で交差点に近づいている場合、車両のスピードに応じた話速で必要な情報を発声することにより、搭乗者は、交差点に進入する前に必要な情報を聴取することができる。また、その他、移動体の速度が変化しても聴き取りやすい合成音声を容易に得ることができる。
【００７３】
なお、この音声合成システムの構成は上述のものに限られない。
例えば、波形データや音片データはＰＣＭ形式のデータである必要はなく、データ形式は任意である。
また、波形データベース７や音片データベース１０は波形データや音片データを必ずしもデータ圧縮された状態で記憶している必要はない。波形データベース７や音片データベース１０が波形データや音片データをデータ圧縮されていない状態で記憶している場合、本体ユニットＭは伸長部６を備えている必要はない。
【００７４】
また、音片データベース作成部１４は、図示しない記録媒体ドライブ装置にセットされた記録媒体から、この記録媒体ドライブ装置を介して、音片データベース１０に追加する新たな圧縮音片データの材料となる音片データや表音文字列を読み取ってもよい。
また、音片登録ユニットＲは、必ずしも収録音片データセット記憶部１３を備えている必要はない。
【００７５】
また、音片データベース作成部１４は、マイクロフォン、増幅器、サンプリング回路、Ａ／Ｄ（Analog-to-Digital）コンバータ及びＰＣＭエンコーダなどを備えていてもよい。この場合、音片データベース作成部１４は、収録音片データセット記憶部１３より音片データを取得する代わりに、自己のマイクロフォンが集音した音声を表す音声信号を増幅し、サンプリングしてＡ／Ｄ変換した後、サンプリングされた音声信号にＰＣＭ変調を施すことにより、音片データを作成してもよい。
【００７６】
また、ピッチ成分データは音片データが表す音片のピッチ長の時間変化を表すデータであってもよい。この場合、音片編集部８は、音片のピッチ長の時間変化を韻律予測を行うことにより予測し、予測結果と、この音片と読みが合致する音片の波形を表す音片データのピッチ長の時間変化を表すピッチ成分データとの相関を求めるようにすればよい。
【００７７】
また、音片編集部８は、例えば、言語処理部１と共にフリーテキストデータを取得し、このフリーテキストデータが表すフリーテキストに含まれる音片の波形に近い波形を表す音片データを、定型メッセージに含まれる音片の波形に近い波形を表す音片データを選択する処理と実質的に同一の処理を行うことによって選択して、音声の合成に用いてもよい。
この場合、音響処理部４は、音片編集部８が選択した音片データが表す音片については、この音片の波形を表す波形データを検索部５に索出させなくてもよい。なお、音片編集部８は、音響処理部４が合成しなくてよい音片を音響処理部４に通知し、音響処理部４はこの通知に応答して、この音片を構成する単位音声の波形の検索を中止するようにすればよい。
【００７８】
また、音片編集部８は、例えば、音響処理部４と共に配信文字列データを取得し、この配信文字列データが表す配信文字列に含まれる音片の波形に近い波形を表す音片データを、定型メッセージに含まれる音片の波形に近い波形を表す音片データを選択する処理と実質的に同一の処理を行うことによって選択して、音声の合成に用いてもよい。この場合、音響処理部４は、音片編集部８が選択した音片データが表す音片については、この音片の波形を表す波形データを検索部５に索出させなくてもよい。
【００７９】
また、音片編集部８は、音響処理部４より返送された波形データを話速変換部１１に供給することにより、当該波形データが表す波形の時間長を、発声スピードデータが示すスピードに合致させる（又は、当該スピードに一定程度以上近いスピードにする）ようにしてもよい。こうすることにより、音響処理部４が規則合成方式の手法により合成した音声の発生スピードも、移動体の移動速度に応じて変化する。
【００８０】
また、この音声合成システムは音片データベース１０を複数備えていてもよく、この場合は、各音片データベース１０が、互い重複しない異なった範囲の発声スピードに対応付けられていてもよい。この場合は、例えば、各音片データベース１０が記憶している圧縮音片データの読みの組み合わせは共通しているものとし、一方で、ある範囲の発声スピードに対応付けられている音片データベース１０内のある読みの圧縮音片データは、より高い（速い）発声スピードに対応付けられている音片データベース１０内の同一の読みの圧縮音片データより低い（遅い）発声スピードで読み上げられた音声を表しているものとなっていればよい。
【００８１】
この音声合成システムが上述のように音片データベース１０を複数備えている場合、音片編集部８は、例えば、決定した音声スピードに基づいて、どの音片データベース１０を用いるかを決定し、決定した音片データベース１０を示すデータを検索部９に供給すればよい。そして、検索部９は、このデータが示す音片データベース１０から、圧縮音片データの索出を行えばよい。
【００８２】
また、話速変換部１１は、音片データをリサンプリングする代わりに、音片データのうち実質的に無音状態を表している部分を特定し、特定した部分の時間長を調整することにより、当該音片データが表す音片の時間長を、発声スピードデータが示すスピードに合致させてもよい。
【００８３】
また、この音声合成システムが外部の状況について検出する物理量は必ずしも移動体の速度である必要はなく、その他の任意の物理量を表すものであってよい。
【００８４】
従って、速度検出部１２は、例えば、加速度センサ等より構成されていてもよく、速度検出部１２は、この音声合成システムが搭載されている移動体の加速度を検出し、検出した加速度を示すデータを生成して音片編集部８へと供給するようにしてもよい。また、検出した、加速度を積分するための積分回路等を更に備えていてもよく、この場合は、検出した加速度を積分した結果を示すデータを生成し、このデータを音片編集部８へと供給するようにしてもよい。
【００８５】
また、速度検出部１２は、この音声合成システムが搭載されている移動体の加速度のピークを検出し、検出したつど、検出された最新のピークの値を示すデータを生成して、音片編集部８へと供給するようにしてもよい。
一方で音片編集部８は、加速度のピークの値を示すデータを速度検出部１２より供給されるつど、このデータが示すピークの値を所定の複数のランクのいずれかに分類し、最新のピークの値が分類されたランクと同一のランクへと過去どのような頻度でピークの値が分類されたかを特定して、特定した結果に従って発声スピードを決定するようにしてもよい。
【００８６】
なお、この場合、速度検出部１２は、例えば、加速度センサが生成する信号をデジタル形式のデータに変換してピークを検出するためのＡ／Ｄ（Analog-to-Digital）変換器や論理回路などを備えていればよい。
【００８７】
また、この場合、音片編集部８は、例えばＰＲＯＭ等からなる不揮発性メモリを更に備えるものとし、図３（ａ）にデータ構造を示すテーブルをあらかじめ記憶し、このテーブルを参照することにより発声スピードを決定すればよい。図示するように、このテーブルは、発声スピードを、検出された加速度のピークが属するランクと、このランクに属するピークが検出された頻度とに対応付けた形で格納していればよい。
【００８８】
また、この音声合成システムは、例えば自動車に搭載されて利用されるものである場合、この自動車のブレーキの踏み込みの量のピークを検出して検出結果を表すデータを音片編集部８へ供給するブレーキ用のセンサを更に備えていてもよい。また、この自動車のハンドルの角速度のピークを検出して検出結果を表すデータを音片編集部８へ供給するハンドル用のセンサを更に備えていてもよい。
【００８９】
この場合、音片編集部８は、例えば、ブレーキの踏み込みの量のピークの値を示すデータや、ハンドルの角速度のピークの値を示すデータを供給されるつど、これらのデータが示すピークの値を、ブレーキの踏み込みの量及びハンドルの角速度についてそれぞれ定められた複数のランクのいずれかに分類し、最新のピークの値が分類されたランクと同一のランクへと過去どのような頻度でピークの値が分類されたかを特定する。
【００９０】
そして、音片編集部８は、自動車の加速度のピークの値ｐａと、ブレーキの踏み込みの量のピークの値ｐｂと、ハンドルの角速度のピークの値ｐωとより、数式１の右辺の値αを求める。
【００９１】
【数１】
α＝（Ｗ_Ａ１・ｐａ）＋（Ｗ_Ａ２・ｐｂ）＋（Ｗ_Ａ３・ｐω）
（ただし、Ｗ_Ａ１、Ｗ_Ａ２及びＷ_Ａ３は所定の係数）
【００９２】
また、音片編集部８は、自動車の加速度の最新のピークの値が分類されたランクと同一のランクへと過去どのような頻度でピークの値が分類されたかを示す値ｆａと、ブレーキの踏み込みの量の最新のピークの値が分類されたランクと同一のランクへと過去どのような頻度でピークの値が分類されたかを示す値ｆｂと、ハンドルの角速度のピークの値が分類されたランクと同一のランクへと過去どのような頻度でピークの値が分類されたかを示す値ｆωとより、数式２の右辺の値βを求める。
【００９３】
【数２】
β＝（Ｗ_Ｂ１・ｆａ）＋（Ｗ_Ｂ２・ｆｂ）＋（Ｗ_Ｂ３・ｆω）
（ただし、Ｗ_Ｂ１、Ｗ_Ｂ２及びＷ_Ｂ３は所定の係数）
【００９４】
一方でこの場合、音片編集部８は、例えば、図３（ｂ）にデータ構造を示すような、発声スピードをα及びβの値に対応付けた形で格納するテーブルをあらかじめ記憶しているものとし、このテーブルを参照することにより発声スピードを決定すればよい。
【００９５】
また、この音声合成システムは、音声の発声スピードを、現在時刻に基づいて変化させてもよい。この場合は、例えば音片編集部８が、水晶発振器などからなるタイマを備え、現在日時を示すデータをこのタイマから連続的に取得し、取得したデータに基づいて音片データの発生スピードを決定するなどすればよい。
【００９６】
また、この音声合成システムが外部の状況についての物理量の検出結果に応じて変化させる対象は必ずしも音声の発声スピードである必要はなく、音声を特徴付けるその他の任意の要素であってよい。
【００９７】
従って、この音声合成システムは、例えば、音片編集部８は、検索部９より供給される音片データや音響処理部４から音片編集部８を介して供給される波形データの振幅を変化させてもよい。
【００９８】
また、この音声合成システムは、移動体の内部あるいは外部の騒音のレベルを検出し検出結果を表すデータを音片編集部８へ供給するため、例えばマイクロホンやレベル検出回路などを備えていてもよい。この場合、音片編集部８は、例えばこのデータが表す騒音のレベルに基づいて合成音声の振幅を決定し、決定した振幅に合致するように音片データや波形データを変換すればよい。このような構成を有していれば、この音声合成システムは、騒音レベルが高いほど合成音声の振幅を大きくする等して、周囲の騒音が大きくても合成音声の聞きやすさを保つことができる。
【００９９】
また、この音声合成システムは、移動体の内部あるいは外部の騒音が占有する帯域を検出し検出結果を表すデータを音片編集部８へ供給するため、例えばマイクロホンやフーリエ変換装置などを備えていてもよい。この場合、音片編集部８は、例えば、このデータが表す騒音の占有帯域を合成音声の減衰帯域として決定し（あるいはその他、騒音の占有帯域に基づいて合成音声の減衰帯域を決定し）、決定した減衰帯域内のスペクトル成分を音片データや波形データから除去するようにしてもよい。このような構成を有していれば、この音声合成システムは、騒音が占める帯域と合成音声が占める帯域との重複を回避するなどして、周囲の騒音が大きくても合成音声の聞きやすさを保つことができる。
【０１００】
また、この音声合成システムでは、例えば、音片編集部８が音片データや波形データの声質を変化させるようにしてもよい。
具体的には、例えば音片編集部８は、音片データを、この音片データが表す音片のピッチ成分（基本周波数成分）及び高調波成分の時間変化を表すサブバンドデータへと変換し、得られたサブバンドデータを更に変換し、音片の波形を表すデータへと戻す。ただし、サブバンドデータを波形を表すデータへと変換する際、音片編集部８は、このサブバンドデータが表わすそれぞれの成分を、元来表している周波数と異なる周波数（例えば、元来の周波数の２倍の周波数）の成分の時間変化を表すものとして解釈して変換を行う。
【０１０１】
このような構成を有していれば、この音声合成システムは、例えばカーナビゲーション用の合成音声として、夜間には音声のピッチを昼間より高めにして眠気を誘いにくい合成音声を生成する、等することにより自動車を運転する時間帯に適した合成音声を得ることもできる。
【０１０２】
以上、この発明の実施の形態を説明したが、この発明にかかる話速変換装置は、専用のシステムによらず、通常のコンピュータシステムを用いて実現可能である。
例えば、速度センサ（又はその他任意の物理量を検出するためのセンサ）が接続されたパーソナルコンピュータに上述の言語処理部１、一般単語辞書２、ユーザ単語辞書３、音響処理部４、検索部５、伸長部６、波形データベース７、音片編集部８、検索部９、音片データベース１０及び話速変換部１１の動作を実行させるためのプログラムを格納した媒体（ＣＤ−ＲＯＭ、ＭＯ、フロッピー（登録商標）ディスク等）から該プログラムをインストールすることにより、上述の処理を実行する本体ユニットＭを構成することができる。
また、パーソナルコンピュータに上述の収録音片データセット記憶部１３、音片データベース作成部１４及び圧縮部１５の動作を実行させるためのプログラムを格納した媒体から該プログラムをインストールすることにより、上述の処理を実行する音片登録ユニットＲを構成することができる。
【０１０３】
そして、これらのプログラムを実行し本体ユニットＭや音片登録ユニットＲとして機能するパーソナルコンピュータが、図１の音声合成システムの動作に相当する処理として、例えば、図４〜図６に示す処理を行うものとする。
図４は、このパーソナルコンピュータがフリーテキストデータを取得した場合の処理を示すフローチャートである。
図５は、このパーソナルコンピュータが配信文字列データを取得した場合の処理を示すフローチャートである。
図６は、このパーソナルコンピュータが定型メッセージデータ及び発声スピードデータを取得した場合の処理を示すフローチャートである。
【０１０４】
すなわち、このパーソナルコンピュータが、外部より、上述のフリーテキストデータを取得すると（図４、ステップＳ１０１）、このフリーテキストデータが表すフリーテキストに含まれるそれぞれの表意文字について、その読みを表す表音文字を、一般単語辞書２やユーザ単語辞書３を検索することにより特定し、この表意文字を、特定した表音文字へと置換する（ステップＳ１０２）。なお、このパーソナルコンピュータがフリーテキストデータを取得する手法は任意である。
【０１０５】
そして、このパーソナルコンピュータは、フリーテキスト内の表意文字をすべて表音文字へと置換した結果を表す表音文字列が得られると、この表音文字列に含まれるそれぞれの表音文字について、当該表音文字が表す単位音声の波形を波形データベース７より検索し、表音文字列に含まれるそれぞれの表音文字が表す単位音声の波形を表す圧縮波形データを索出する（ステップＳ１０３）。
【０１０６】
次に、このパーソナルコンピュータは、索出された圧縮波形データを、圧縮される前の波形データへと復元し（ステップＳ１０４）、復元された波形データを、表音文字列内での各表音文字の並びに従った順序で互いに結合し、合成音声データとして出力する（ステップＳ１０５）。なお、このパーソナルコンピュータが合成音声データを出力する手法は任意である。
【０１０７】
また、このパーソナルコンピュータが、外部より、上述の配信文字列データを任意の手法で取得すると（図５、ステップＳ２０１）、この配信文字列データが表す表音文字列に含まれるそれぞれの表音文字について、当該表音文字が表す単位音声の波形を波形データベース７より検索し、表音文字列に含まれるそれぞれの表音文字が表す単位音声の波形を表す圧縮波形データを索出する（ステップＳ２０２）。
【０１０８】
次に、このパーソナルコンピュータは、索出された圧縮波形データを、圧縮される前の波形データへと復元し（ステップＳ２０３）、復元された波形データを、表音文字列内での各表音文字の並びに従った順序で互いに結合し、合成音声データとしてステップＳ１０５の処理と同様の処理により出力する（ステップＳ２０４）。
【０１０９】
一方、このパーソナルコンピュータが、外部より、上述の定型メッセージデータ及び発声スピードデータを任意の手法により取得すると（図６、ステップＳ３０１）、まず、この定型メッセージデータが表す定型メッセージに含まれる音片の読みを表す表音文字に合致する表音文字が対応付けられている圧縮音片データをすべて索出する（ステップＳ３０２）。
【０１１０】
また、ステップＳ３０２では、該当する圧縮音片データに対応付けられている上述の音片読みデータ、スピード初期値データ及びピッチ成分データも索出する。なお、１個の音片につき複数の圧縮音片データが該当する場合は、該当する圧縮音片データすべてを索出する。一方、圧縮音片データを索出できなかった音片があった場合は、上述の欠落部分識別データを生成する。そして、このパーソナルコンピュータは、索出された圧縮音片データを、圧縮される前の音片データへと復元する（ステップＳ３０３）。
【０１１１】
一方、このパーソナルコンピュータは、移動体の移動速度（又はその他任意の物理量）を表すデータを速度センサ（例えば、車両の車速を表す車速パルスを供給する装置：図示せず）等より供給されると、このデータが表す移動速度等に基づいて、定型メッセージの発声スピードを決定する（ステップＳ３０４）。なお、移動体の移動速度等と発声スピードとの対応関係は任意である。
【０１１２】
次に、このパーソナルコンピュータは、ステップＳ３０３で復元された音片データを、上述の音片編集部８が行う処理と同様の処理により変換して、当該音片データが表す音片の時間長を、ステップＳ３０４で決定した発声スピード（又は、決定した発声スピードに一定程度以上近いスピード）にする（ステップＳ３０５）。なお、発声スピードデータが供給されていない場合は、復元された音片データを変換しなくてもよい。
【０１１３】
次に、このパーソナルコンピュータは、音片の時間長が変換された音片データのうちから、定型メッセージを構成する音片の波形に最も近い波形を表す音片データを、上述の音片編集部８が行う処理と同様の処理を行うことにより、音片１個につき１個ずつ選択する（ステップＳ３０６）。
【０１１４】
ステップＳ３０６でこのパーソナルコンピュータは、例えば、定型メッセージデータが表す定型メッセージに韻律予測の手法に基づいた解析を加えることにより、この定型メッセージの韻律を予測し、一方で、索出された音片データのピッチ成分の周波数の時間変化を表す関数をピッチ成分データに基づいて特定して、定型メッセージ内のそれぞれの音片について、この音片のピッチ成分の周波数の時間変化の予測結果を表す関数と、この音片と読みが合致する音片の波形を表す各音片データのピッチ成分の周波数の時間変化を表す関数との相関係数を求め、最も高い相関係数を与えた音片データを選択するようにすればよい。
【０１１５】
一方、このパーソナルコンピュータは、欠落部分識別データを生成した場合、欠落部分識別データが示す音片の読みを表す表音文字列を定型メッセージデータより抽出し、この表音文字列につき、音素毎に、配信文字列データが表す表音文字列と同様に扱って上述のステップＳ２０２〜Ｓ２０３の処理を行うことにより、この表音文字列内の各表音文字が示す音声の波形を表す波形データを復元する（ステップＳ３０７）。
【０１１６】
そして、このパーソナルコンピュータは、復元した波形データと、ステップＳ３０６で選択した音片データとを、定型メッセージデータが示す定型メッセージ内での各音片の並びに従った順序で互いに結合し、合成音声を表すデータとして出力する（ステップＳ３０８）。
【０１１７】
なお、パーソナルコンピュータに本体ユニットＭの機能を行わせるプログラムは、パーソナルコンピュータに複数の音片データベース１０の機能を行わせてもよい。この場合、このパーソナルコンピュータは、ステップＳ３０２の処理を開始するまでにステップＳ３０４の処理を完了するものとし、一方でステップＳ３０２においては、例えば、決定した音声スピードに基づいて、どの音片データベース１０を用いるかを決定し、決定した音片データベース１０より圧縮音片データの索出を行えばよい。そして、ステップＳ３０５の処理は省略するようにすればよい。
【０１１８】
パーソナルコンピュータに複数の音片データベース１０の機能を行わせる場合、各音片データベース１０は、例えば、互い重複しない異なった範囲の発声スピードに対応付けられているものとし、各音片データベース１０が記憶している圧縮音片データの読みの組み合わせは共通しており、一方で、ある範囲の発声スピードに対応付けられている音片データベース１０内のある読みの圧縮音片データは、より高い（速い）発声スピードに対応付けられている音片データベース１０内の同一の読みの圧縮音片データより低い（遅い）発声スピードで読み上げられた音声を表しているものになっているものとする。
【０１１９】
また、パーソナルコンピュータに本体ユニットＭや音片登録ユニットＲの機能を行わせるプログラムは、例えば、通信回線の掲示板（ＢＢＳ）にアップロードし、これを通信回線を介して配信してもよく、また、これらのプログラムを表す信号により搬送波を変調し、得られた変調波を伝送し、この変調波を受信した装置が変調波を復調してこれらのプログラムを復元するようにしてもよい。
そして、これらのプログラムを起動し、ＯＳの制御下に、他のアプリケーションプログラムと同様に実行することにより、上述の処理を実行することができる。
【０１２０】
なお、ＯＳが処理の一部を分担する場合、あるいは、ＯＳが本願発明の１つの構成要素の一部を構成するような場合には、記録媒体には、その部分を除いたプログラムを格納してもよい。この場合も、この発明では、その記録媒体には、コンピュータが実行する各機能又はステップを実行するためのプログラムが格納されているものとする。
【０１２１】
【発明の効果】
以上説明したように、この発明によれば、環境が変化しても聴き取りやすい合成音声を得るための話速変換装置、話速変換方法及びプログラムが実現される。
【図面の簡単な説明】
【図１】この発明の実施の形態に係る音声合成システムの構成を示すブロック図である。
【図２】音片データベースのデータ構造を模式的に示す図である。
【図３】（ａ）は、車両の加速度に基づいて発声スピードを決定するために用いるテーブルのデータ構造を示す図であり、（ｂ）は、自動車の加速度、ブレーキの踏み込みの量、及びハンドルの角速度に基づいて発声スピードを決定するために用いるテーブルのデータ構造を示す図である。
【図４】この発明の実施の形態に係る音声合成システムの機能を行うパーソナルコンピュータがフリーテキストデータを取得した場合の処理を示すフローチャートである。
【図５】この発明の実施の形態に係る音声合成システムの機能を行うパーソナルコンピュータが配信文字列データを取得した場合の処理を示すフローチャートである。
【図６】この発明の実施の形態に係る音声合成システムの機能を行うパーソナルコンピュータが定型メッセージデータ及び発声スピードデータを取得した場合の処理を示すフローチャートである。
【符号の説明】
Ｍ本体ユニット
１言語処理部
２一般単語辞書
３ユーザ単語辞書
４音響処理部
５検索部
６伸長部
７波形データベース
８音片編集部
９検索部
１０音片データベース
１１話速変換部
１２速度検出部
Ｒ音片登録ユニット
１３収録音片データセット記憶部
１４音片データベース作成部
１５圧縮部
ＨＤＲヘッダ部
ＩＤＸインデックス部
ＤＩＲディレクトリ部
ＤＡＴデータ部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech speed conversion device, a speech speed conversion method, and a program.
[0002]
[Prior art]
As a technique for synthesizing speech, there is a technique called a recording editing system. The recording / editing system is used in a station voice guidance system, an in-vehicle navigation system, and the like.
The recording and editing method associates a word with voice data representing a voice that reads out the word, divides a sentence to be synthesized into words, and obtains voice data associated with these words. This is a technique of joining them together (for example, see Patent Document 1).
[0003]
[Patent Document 1]
JP 10-49193 A
[0004]
[Problems to be solved by the invention]
However, when speech data are simply joined together, the speech speed of synthesized speech (the length of time for speech) is a value determined by the speech speed of speech data prepared in advance. On the other hand, the characteristics of human hearing are the location where the person hears the voice, the speed of movement, the surroundings, and the interior of the vehicle. Therefore, if these factors change, the way the sound is heard will change greatly.
[0005]
The present invention has been made in view of the above circumstances, and an object thereof is to provide a speech speed conversion device, a speech speed conversion method, and a program for obtaining a synthesized speech that can be easily heard even when the environment changes.
[0006]
[Means for Solving the Problems]
  In order to achieve the above object, a speech speed converting apparatus according to the first aspect of the present invention provides:
  Audio data acquisition means for acquiring audio data representing an audio waveform;
  Detect the speed or acceleration of the moving objectwas detectedBased on speed or acceleration,ConcernedSpeaking speed setting data generating means for generating speaking speed setting data for specifying the speed of speech;
  SaidGetIsThe speed of the voice dataSaidVoice data conversion means for converting based on the speed represented by the generated speech speed setting data;
  With,
The speech speed setting data generating means detects the peak of acceleration of the mobile object, classifies and stores the detected latest peak into one of a plurality of predetermined ranks, and a peak detected in the past , Specifying how often the latest peak has been classified into the same rank as the classified rank in the past, and generating the speech speed setting data based on the identified result,
ThisAnd features.
[0007]
  The voice data conversion means includesSaidGetIsBy converting the sampled audio data, the speed of the audio represented by the converted audio dataConcernedThe speed represented by the speech speed setting data may be used.
[0008]
  The voice data conversion means includesSaidGetIsIdentify the portion of the waveform represented by the audio data that is substantially silent, andIdentifiedBy performing a conversion that changes the time length of the part,ConcernedThe speed of the voice represented by the voice dataConcernedThe speed may be based on the speed represented by the speech speed setting data.
[0010]
  Moreover, the speech speed converting apparatus according to the second aspect of the present invention is:
  Voice data storage means for storing a plurality of voice data representing waveforms of a plurality of voices uttering words of the same reading at different speeds;
  Detects the speed or acceleration of a moving objectAndConcernedwas detectedSpeaking speed setting data generating means for generating speaking speed setting data for designating the speed of speech based on speed or acceleration;
  Voice data conversion meansInMemoryHas beenOut of audio dataThe aboveClosest to the speed represented by the generated speech speed setting dataVoice data that is speedVoice data selection means for selecting,
  With,
The speech speed setting data generation means detects a peak of acceleration of the moving body, classifies the detected latest peak into any of a plurality of predetermined ranks, stores it in the voice data storage means, The frequency at which the detected peak was classified into the same rank as the rank in which the latest peak was classified in the past is identified, and the speech speed setting data is generated based on the identified result To
ThisAnd features.
[0013]
  In addition, the present invention3The speaking speed conversion method according to
A speech speed conversion method executed by a speech speed conversion device having a storage unit,
  Acquire audio data representing the waveform of the audioAudio data acquisition step to,
  Detect the speed or acceleration of the moving objectwas detectedGenerates speech speed setting data that specifies the speed of speech based on speed or accelerationThe speaking speed setting data generation step,
  SaidGetIsThe speed of the voice dataSaidConversion based on the speed represented by the generated speech speed setting dataAudio data conversion step,
With
The speech speed setting data generation step detects the peak of acceleration of the moving object, classifies the detected latest peak into one of a plurality of predetermined ranks, stores it in the storage unit, and detects it in the past Identifying the frequency at which the determined peak was previously classified into the same rank as the rank into which the latest peak was classified, and generating the speech speed setting data based on the identified result,
  It is characterized by that.
[0014]
  In addition, the present invention4The speaking speed conversion method according to
A speech speed conversion method executed by a speech speed conversion device having a storage unit,
  In the storage unit,Multiple audio data representing the waveforms of multiple voices that utter the same words at different speedsButMemoryIs,
  Detect the speed or acceleration of the moving objectwas detectedGenerates speech speed setting data that specifies the speed of speech based on speed or accelerationThe speaking speed setting data generation step,
  AboveRememberedOf the audio data,SaidClosest to the speed represented by the generated speech speed setting dataVoice data that is speedSelectAudio data selection step and,
With
The speech speed setting data generation step detects the peak of acceleration of the moving body, classifies the detected latest peak into one of a plurality of predetermined ranks, stores it in the storage unit, and detects it in the past Identify the frequency at which the peak was classified in the past into the same rank as the rank into which the latest peak was classified, and generate speech speed setting data based on the identified result.
  It is characterized by that.
[0016]
  In addition, the present invention5The program related to
  A computer equipped with a device for detecting the speed or acceleration of a moving object,
  Audio data acquisition means for acquiring audio data representing an audio waveform;
  SaidDetect the speed or acceleration of the moving objectwas detectedSpeaking speed setting data generating means for generating speaking speed setting data for designating the speed of speech based on speed or acceleration;
  SaidGetIsThe speed of the voice dataSaidVoice data conversion means for converting based on the speed represented by the generated speech speed setting data;
  To function,
The speech speed setting data generating means detects the peak of acceleration of the mobile object, classifies and stores the detected latest peak into one of a plurality of predetermined ranks, and a peak detected in the past , Specify how often the latest peak has been classified into the same rank as the classified rank in the past, and generate speech speed setting data based on the identified result,
ThisAnd features.
[0017]
  In addition, the present invention6The program related to
  A computer equipped with a device for detecting the speed or acceleration of a moving object,
  Voice data storage means for storing a plurality of voice data representing waveforms of a plurality of voices uttering words of the same reading at different speeds;
  SaidDetect the speed or acceleration of the moving objectwas detectedSpeaking speed setting data generating means for generating speaking speed setting data for designating the speed of speech based on speed or acceleration;
  Voice data conversion meansInMemoryHas beenOut of audio dataThe aboveTo the speed represented by the generated speech speed setting dataAudio data at the closest speedVoice data selection means for selecting,
  To function,
The speech speed setting data generation means detects a peak of acceleration of the moving body, classifies the detected latest peak into any of a plurality of predetermined ranks, stores it in the voice data storage means, The frequency at which the detected peak is classified into the same rank as the rank in which the latest peak was classified in the past is specified, and the speech speed setting data is generated based on the specified result ,
ThisAnd features.
[0019]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings, taking as an example a speech synthesis system that is mounted and used in a moving body such as a vehicle.
FIG. 1 is a diagram showing a configuration of a speech synthesis system according to an embodiment of the present invention. As shown in the figure, this speech synthesis system is composed of a main unit M and a sound piece registration unit R.
[0020]
The main unit M includes a language processing unit 1, a general word dictionary 2, a user word dictionary 3, an acoustic processing unit 4, a search unit 5, an expansion unit 6, a waveform database 7, and a sound piece editing unit 8. The search unit 9, the sound piece database 10, the speech speed conversion unit 11, and the speed detection unit 12 are configured.
[0021]
The language processing unit 1, the acoustic processing unit 4, the search unit 5, the decompression unit 6, the speech piece editing unit 8, the search unit 9, and the speech rate conversion unit 11 are all CPU (Central Processing Unit) or DSP (Digital Signal Processor). ) And a memory for storing a program to be executed by the processor, etc., each of which performs processing to be described later.
A single processor performs all or part of the functions of the language processing unit 1, the sound processing unit 4, the search unit 5, the decompression unit 6, the sound piece editing unit 8, the search unit 9, and the speech rate conversion unit 11. It may be.
[0022]
The general word dictionary 2 is composed of a nonvolatile memory such as a PROM (Programmable Read Only Memory) or a hard disk device. In the general word dictionary 2, words including ideographic characters (for example, kanji) and phonograms (for example, kana and phonetic symbols) representing the reading of these words are the manufacturer of this speech synthesis system. Etc., and stored in advance in association with each other.
[0023]
The user word dictionary 3 includes a nonvolatile memory capable of rewriting data, such as an EEPROM (Electrically Erasable / Programmable Read Only Memory) or a hard disk device, and a control circuit that controls writing of data to the nonvolatile memory. . The processor may perform the function of this control circuit. One of the language processing unit 1, the sound processing unit 4, the search unit 5, the decompression unit 6, the sound piece editing unit 8, the search unit 9, and the speech speed conversion unit 11. A processor that performs some or all of the functions may perform the function of the control circuit of the user word dictionary 3.
The user word dictionary 3 obtains words including ideograms and phonograms representing readings of these words from the outside according to user operations, and stores them in association with each other. It is sufficient that the user word dictionary 3 stores words and the like that are not stored in the general word dictionary 2 and phonograms representing the readings.
[0024]
The waveform database 7 is composed of a nonvolatile memory such as a PROM or a hard disk device. In the waveform database 7, phonetic characters and compressed waveform data obtained by entropy encoding the waveform data representing the waveform of the unit speech represented by the phonetic characters are mutually connected in advance by the manufacturer of the speech synthesis system. It is stored in association. The unit speech is speech that is short enough to be used in the rule synthesis method, and is specifically speech that is divided in units such as phonemes and VCV (Vowel-Consonant-Vowel) syllables. Note that the waveform data before entropy encoding may be, for example, PCM (Pulse Code Modulation) digital format data.
[0025]
The sound piece database 10 is composed of a nonvolatile memory such as a PROM or a hard disk device.
The sound piece database 10 stores, for example, data having a data structure shown in FIG. That is, as shown in the figure, the data stored in the sound piece database 10 is divided into four types: a header part HDR, an index part IDX, a directory part DIR, and a data part DAT.
[0026]
The data storage in the sound piece database 10 is performed in advance by, for example, the manufacturer of the speech synthesis system and / or by the sound piece registration unit R performing an operation described later.
[0027]
The header portion HDR stores data for identifying the sound piece database 10 and data indicating the index portion IDX, the data amount of the directory portion DIR and the data portion DAT, the data format, attribution of copyrights, and the like.
[0028]
The data portion DAT stores compressed sound piece data obtained by entropy encoding sound piece data representing a sound piece waveform.
Note that a sound piece refers to a continuous section including one or more phonemes in speech, and usually includes a section for one word or a plurality of words.
The speech piece data before entropy coding is composed of data in the same format as the waveform data before entropy coding (for example, PCM digital format data) for generating the compressed waveform data described above. It only has to be.
[0029]
In the directory part DIR, for each compressed audio data,
(A) Data representing a phonetic character indicating the reading of the sound piece represented by this compressed sound piece data (speech piece reading data),
(B) data representing the head address of the storage location where the compressed sound piece data is stored;
(C) data representing the data length of this compressed sound piece data;
(D) data (speed initial value data) representing the utterance speed of the sound piece represented by this compressed sound piece data (time length when played back),
(E) data (pitch component data) representing the time variation of the frequency of the pitch component of this sound piece;
Are stored in association with each other. (It is assumed that an address is assigned to the storage area of the sound piece database 10.)
[0030]
In FIG. 2, as data included in the data portion DAT, compressed sound piece data having a data amount of 1410 h bytes representing a waveform of a sound piece whose reading is “Saitama” is in a logical position starting at the address 001A36A6h. The case where it is stored is illustrated. (In this specification and drawings, the number with “h” at the end represents a hexadecimal number.)
[0031]
It should be noted that at least the data (A) (that is, the speech piece reading data) of the data sets (A) to (E) is sorted according to the order determined based on the phonetic characters represented by the speech piece reading data. (For example, if the phonetic character is kana, the phonetic characters are arranged in descending order of addresses in the order of the Japanese syllabary) and are stored in the storage area of the speech database 10.
[0032]
The index part IDX stores data for specifying the approximate logical position of the data in the directory part DIR based on the sound piece reading data. Specifically, for example, assuming that the sound piece reading data represents kana, the address range of the kana characters and the sound piece reading data whose first character is this kana character is in the range. The data shown are stored in association with each other.
[0033]
In addition, you may make it a single non-volatile memory perform a part or all function of the general word dictionary 2, the user word dictionary 3, the waveform database 7, and the sound piece database 10.
[0034]
The speed detection unit 12 is configured by a speed sensor, for example. The speed detection unit 12 detects the moving speed of the moving body on which the speech synthesis system is mounted, generates data indicating the detected moving speed, and supplies the data to the sound piece editing unit 8.
[0035]
As shown in FIG. 1, the sound piece registration unit R includes a recorded sound piece data set storage unit 13, a sound piece database creation unit 14, and a compression unit 15. The sound piece registration unit R may be detachably connected to the sound piece database 10. In this case, the sound piece registration unit R is used except when writing new data to the sound piece database 10. The main unit M may be made to perform an operation described later in a state where it is separated from the main unit M.
[0036]
The recorded sound piece data set storage unit 13 is composed of a rewritable nonvolatile memory such as a hard disk device.
The recorded sound piece data set storage unit 13 includes phonetic characters representing the reading of the sound pieces, and sound piece data representing the waveforms obtained by collecting the sound pieces actually uttered by a person. They are stored in advance in association with each other by the manufacturer of the speech synthesis system. The sound piece data may be composed of, for example, PCM digital data.
[0037]
The sound piece database creation unit 14 and the compression unit 15 are configured by a processor such as a CPU and a memory that stores a program to be executed by the processor, and performs processing described later according to the program.
[0038]
A single processor may perform a part or all of the functions of the speech piece database creation unit 14 and the compression unit 15, and the language processing unit 1, the acoustic processing unit 4, the search unit 5, and the decompression unit. 6. A processor that performs some or all of the functions of the sound piece editing unit 8, the search unit 9, and the speech speed conversion unit 11 may further perform the functions of the sound piece database creation unit 14 and the compression unit 15. Further, the processor that performs the functions of the sound piece database creation unit 14 and the compression unit 15 may also function as a control circuit of the recorded sound piece data set storage unit 13.
[0039]
The sound piece database creation unit 14 reads out the phonetic character and sound piece data associated with each other from the recorded sound piece data set storage unit 13, and the time variation of the frequency of the pitch component of the voice represented by the sound piece data , Specify the speaking speed.
The utterance speed may be specified by, for example, counting the number of samples of the sound piece data.
[0040]
On the other hand, the time change of the frequency of the pitch component may be specified by performing cepstrum analysis on the sound piece data, for example. Specifically, for example, the waveform represented by the sound piece data is divided into a number of small parts on the time axis, and the intensity of each obtained small part is converted to the logarithm of the original value (the base of the logarithm is arbitrary). Convert to a substantially equal value, and use this fast Fourier transform method (or generate data that represents the result of Fourier transform of discrete variables, etc.) (Any method). Then, the minimum value of the frequencies giving the peak of the cepstrum is specified as the frequency of the pitch component in this small part.
[0041]
The time change of the frequency of the pitch component is specified based on the pitch waveform data after the sound piece data is converted into the pitch waveform data according to the method disclosed in Japanese Patent Laid-Open No. 2003-108172, for example. A good result can be expected. Specifically, the pitch data is extracted by filtering the piece data, and the waveform represented by the piece data is divided into sections of unit pitch length based on the extracted pitch signal. It is only necessary to convert the sound piece data into a pitch waveform signal by identifying the phase shift based on the correlation and aligning the phases of each section. Then, the obtained pitch waveform signal is handled as sound piece data, and a cepstrum analysis is performed, for example, so that the time change of the frequency of the pitch component may be specified.
[0042]
On the other hand, the sound piece database creation unit 14 supplies the sound piece data read from the recorded sound piece data set storage unit 13 to the compression unit 15.
The compression unit 15 entropy-encodes the speech piece data supplied from the speech piece database creation unit 14 to create compressed speech piece data, and returns it to the speech piece database creation unit 14.
[0043]
When the voice data utterance speed and the pitch component frequency change over time are specified, and the voice piece data is entropy-encoded and returned as compressed voice piece data from the compression unit 15, the voice piece database creation unit 14 The compressed sound piece data is written in the storage area of the sound piece database 10 as data constituting the data part DAT.
[0044]
Further, the speech piece database creation unit 14 uses the phonogram read from the recorded speech piece data set storage unit 13 as the speech piece reading data to indicate the reading of the speech piece represented by the written compressed speech piece data. Write to 10 storage areas.
Further, the head address of the written compressed sound piece data in the storage area of the sound piece database 10 is specified, and this address is written in the storage area of the sound piece database 10 as the data (B) described above.
Further, the data length of the compressed sound piece data is specified, and the specified data length is written in the storage area of the sound piece database 10 as data (C).
In addition, data indicating the result of specifying the time variation of the utterance speed and the frequency of the pitch component of the sound piece represented by the compressed sound piece data is generated, and stored in the storage area of the sound piece database 10 as the initial speed value data and the pitch component data. Write.
[0045]
Next, the operation of this speech synthesis system will be described.
First, it is assumed that the language processing unit 1 has acquired free text data describing a sentence (free text) including an ideogram prepared by the user as a target for synthesizing speech in the speech synthesis system.
[0046]
The language processing unit 1 may acquire any free text data. For example, the language processing unit 1 may acquire the free text data from an external device or a network via an interface circuit (not shown), or may be set in a recording medium drive device (not shown). Alternatively, the data may be read from a recording medium (for example, a floppy (registered trademark) disk, a CD-ROM, or the like) via the recording medium drive device. Alternatively, the processor performing the function of the language processing unit 1 may deliver the text data used in other processing executed by itself to the processing of the language processing unit 1 as free text data.
[0047]
When the free text data is acquired, the language processing unit 1 specifies a phonetic character representing the reading of each ideographic character included in the free text by searching the general word dictionary 2 and the user word dictionary 3. . Then, the ideogram is replaced with the specified phonogram. The language processing unit 1 supplies the acoustic processing unit 4 with a phonetic character string obtained as a result of replacing all ideographic characters in the free text with phonetic characters.
[0048]
When the acoustic processing unit 4 is supplied with the phonetic character string from the language processing unit 1, the acoustic processing unit 4 searches for the waveform of the unit speech represented by the phonetic character for each phonetic character included in the phonetic character string. The search unit 5 is instructed.
[0049]
In response to this instruction, the search unit 5 searches the waveform database 7 and searches for compressed waveform data representing the waveform of the unit speech represented by each phonogram included in the phonogram string. Then, the retrieved compressed waveform data is supplied to the decompression unit 6.
[0050]
The decompression unit 6 restores the compressed waveform data supplied from the search unit 5 to the waveform data before being compressed, and returns it to the search unit 5. The search unit 5 supplies the waveform data returned from the decompression unit 6 to the sound processing unit 4 as a search result.
The sound processing unit 4 sends the waveform data supplied from the search unit 5 to the sound piece editing unit 8 in the order according to the order of each phonogram in the phonogram string supplied from the language processing unit 1. Supply.
[0051]
When the sound piece editing unit 8 is supplied with waveform data from the acoustic processing unit 4, the sound piece editing unit 8 combines the waveform data with each other in the supplied order, and outputs the combined data as synthesized speech data (synthesized speech data). This synthesized speech synthesized based on the free text data corresponds to speech synthesized by the rule synthesis method.
[0052]
Note that the method of outputting the synthesized speech data by the sound piece editing unit 8 is arbitrary. For example, the synthesized speech data represented by the synthesized speech data via a D / A (Digital-to-Analog) converter or a speaker (not shown). May be played back. Further, it may be sent to an external device or a network via an interface circuit (not shown), or may be written to a recording medium set in a recording medium drive device (not shown) via this recording medium drive device. Alternatively, the processor that performs the function of the sound piece editing unit 8 may deliver the synthesized speech data to another process that is being executed by the processor.
[0053]
Next, it is assumed that the acoustic processing unit 4 acquires data representing a phonetic character string (delivery character string data) distributed from the outside. (Note that the method by which the acoustic processing unit 4 acquires the distribution character string data is also arbitrary. For example, the distribution character string data may be acquired by a method similar to the method by which the language processing unit 1 acquires the free text data. )
[0054]
In this case, the acoustic processing unit 4 handles the phonetic character string represented by the distribution character string data in the same manner as the phonetic character string supplied from the language processing unit 1. As a result, compressed waveform data corresponding to the phonetic character included in the phonetic character string represented by the delivery character string data is retrieved by the search unit 5, and the waveform data before being compressed is restored by the decompression unit 6. Each restored waveform data is supplied to the sound piece editing unit 8 via the acoustic processing unit 4, and the sound piece editing unit 8 uses the waveform data in each phonetic character string represented by the distribution character string data. They are combined with each other in the order of the phonetic characters and output as synthesized speech data. This synthesized voice data synthesized based on the distribution character string data also represents voice synthesized by the rule synthesis method.
[0055]
Next, it is assumed that the sound piece editing unit 8 has acquired standard message data. The fixed message data is data representing the fixed message as a phonetic character string.
The method by which the sound piece editing unit 8 acquires the standard message data is arbitrary, and for example, the standard message data may be acquired by a method similar to the method by which the language processing unit 1 acquires the free text data.
[0056]
When the standard message data is supplied to the sound piece editing unit 8, the sound piece editing unit 8 compresses a compressed sound in which a phonetic character that matches the phonetic character representing the reading of the sound piece included in the fixed message is associated. The search unit 9 is instructed to search all pieces of data.
[0057]
The search unit 9 searches the sound piece database 10 in response to an instruction from the sound piece editing unit 8, and the corresponding compressed sound piece data and the above-described sound piece reading data associated with the corresponding compressed sound piece data. The speed initial value data and the pitch component data are retrieved, and the retrieved compressed sound piece data is supplied to the decompression unit 6. Even when a plurality of compressed sound piece data corresponds to one sound piece, all the corresponding compressed sound piece data are searched as data candidates used for speech synthesis. On the other hand, when there is a sound piece for which compressed sound piece data could not be found, the search unit 9 generates data for identifying the corresponding sound piece (hereinafter referred to as missing portion identification data).
[0058]
The decompression unit 6 restores the compressed sound piece data supplied from the search unit 9 to the sound piece data before being compressed, and returns it to the search unit 9. The retrieval unit 9 supplies the speech piece data returned from the decompression unit 6 and the retrieved speech piece reading data, speed initial value data, and pitch component data to the speech speed conversion unit 11 as retrieval results. Further, when missing part identification data is generated, this missing part identification data is also supplied to the speech speed conversion unit 11.
[0059]
On the other hand, when the sound piece editing unit 8 is supplied with data representing the moving speed of the moving body from the speed detecting unit 12, the sound piece editing unit 8 utters a standard message utterance speed (speaks this standard message) based on the moving speed represented by the data. Time). Then, the speech piece editing unit 8 converts the speech piece data supplied to the speech speed conversion unit 11 to the speech speed conversion unit 11, and determines the time length of the speech piece represented by the speech piece data. It is instructed to achieve a speed that matches the speed (or a speed that is close to a certain level of the determined utterance speed).
[0060]
Note that the correspondence between the moving speed of the moving body and the utterance speed is arbitrary, and for example, the sound piece editing unit 8 may determine that the utterance speed increases as the moving speed of the moving body increases.
[0061]
In response to the instruction from the sound piece editing unit 8, the speech speed conversion unit 11 converts the sound piece data supplied from the search unit 9 so as to match the instruction and supplies the sound piece data to the sound piece editing unit 8. Specifically, for example, after specifying the original time length of the sound piece data supplied from the search unit 9 based on the retrieved speed initial value data, the sound piece data is resampled, The number of samples of the sound piece data may be set to a time length that matches the speed designated by the sound piece editing unit 8.
[0062]
In addition, the speech speed conversion unit 11 also supplies the sound piece reading data and pitch component data supplied from the search unit 9 to the sound piece editing unit 8, and if the missing part identification data is supplied from the search unit 9, The missing part identification data is also supplied to the sound piece editing unit 8.
[0063]
If the speech speed data is not supplied to the sound piece editing unit 8, the sound piece editing unit 8 does not convert the sound piece data supplied to the speech speed conversion unit 11 to the speech speed conversion unit 11. What is necessary is just to instruct | indicate to supply to the sound piece edit part 8, and the speech speed conversion part 11 should just supply the sound piece data supplied from the search part 9 to the sound piece edit part 8 as it is in response to this instruction | indication.
[0064]
When the speech piece editing unit 8 is supplied with the speech piece data, the speech piece reading data, and the pitch component data from the speech speed conversion unit 11, the waveform of the speech piece that constitutes the standard message from the supplied speech piece data. One piece of sound piece data representing a waveform that can be approximated most closely is selected for each piece of sound.
[0065]
The standard for selecting sound piece data is arbitrary. For example, the sound piece editing unit 8 performs prosody prediction on the fixed message, and then the speech rate conversion unit 11 supplies each sound piece in the fixed message. One piece of sound piece data may be selected one by one that shows the highest correlation between the time change of the pitch component and the prosodic prediction result.
[0066]
Specifically, the sound piece editing unit 8 first adds an analysis based on a prosodic prediction method such as “Fujisaki model” or “ToBI (Tone and Break Indices)” to the fixed message represented by the fixed message data. Thus, the time change of the frequency of the pitch component of each sound piece in the fixed message is predicted, and a function representing the prediction result is specified. On the other hand, based on the pitch component data supplied from the speech rate conversion unit 11, the speech unit editing unit 8 calculates a function representing the time change of the frequency of the pitch component of the speech piece data supplied from the speech rate conversion unit 11. Identify.
[0067]
Then, for each sound piece in the standard message, the sound piece editing unit 8 has a function representing a prediction result of the time change of the frequency of the pitch component of this sound piece, and the waveform of the sound piece whose reading matches the sound piece. The correlation coefficient with the function representing the time change of the frequency of the pitch component of each sound piece data representing the sound piece is obtained, and the sound piece data giving the highest correlation coefficient is selected.
[0068]
On the other hand, when the missing part identification data is also supplied from the speech speed conversion unit 11, the voice piece editing unit 8 extracts a phonetic character string representing the reading of the voice piece indicated by the missing part identification data from the standard message data. The sound processing unit 4 is supplied to instruct to synthesize the waveform of the sound piece.
[0069]
Upon receiving the instruction, the acoustic processing unit 4 handles the phonetic character string supplied from the sound piece editing unit 8 in the same manner as the phonetic character string represented by the distribution character string data. As a result, the compressed waveform data representing the speech waveform indicated by the phonogram included in the phonogram string is retrieved by the search unit 5, and the compressed waveform data is restored to the original waveform data by the decompression unit 6. , And supplied to the acoustic processing unit 4 via the search unit 5. The sound processing unit 4 supplies the waveform data to the sound piece editing unit 8.
[0070]
When the sound piece editing unit 8 returns the waveform data from the sound processing unit 4, the sound piece editing unit 8 specifies the waveform data and the sound piece data supplied from the speech speed conversion unit 11. The voice messages in the standard message indicated by the standard message data are combined with each other in the order in which they are arranged, and output as data representing the synthesized speech.
[0071]
If the missing part identification data is not included in the data supplied from the speech speed conversion unit 11, the sound piece selected by the sound piece editing unit 8 is immediately selected without instructing the acoustic processing unit 4 to synthesize the waveform. The data may be combined with each other in the order of the sound pieces in the standard message indicated by the standard message data, and output as data representing the synthesized speech.
[0072]
In the speech synthesis system described above, the speech output speed of the synthesized speech changes in accordance with the moving speed of the moving body on which the speech synthesis system is mounted. Therefore, for example, when this speech synthesis system is used to generate navigation speech in a car navigation device, the sound piece editing unit 8 determines to increase the sound production speed of sound piece data as the moving speed of the vehicle increases. Navigation voice can be obtained at an appropriate speed in accordance with the traveling state of the vehicle. For example, when the vehicle is approaching the intersection at high speed, the passenger can listen to the necessary information before entering the intersection by speaking the necessary information at the speaking speed corresponding to the speed of the vehicle. . In addition, it is possible to easily obtain a synthesized voice that can be easily heard even if the speed of the moving body changes.
[0073]
Note that the configuration of this speech synthesis system is not limited to that described above.
For example, waveform data and sound piece data need not be data in PCM format, and the data format is arbitrary.
Further, the waveform database 7 and the sound piece database 10 do not necessarily need to store the waveform data and sound piece data in a compressed state. When the waveform database 7 or the sound piece database 10 stores the waveform data or sound piece data in a state where the data is not compressed, the main body unit M does not need to include the expansion unit 6.
[0074]
Further, the sound piece database creation unit 14 becomes a material for new compressed sound piece data to be added to the sound piece database 10 from a recording medium set in a recording medium drive device (not shown) via the recording medium drive device. Sound piece data and phonetic character strings may be read.
The sound piece registration unit R does not necessarily need to include the recorded sound piece data set storage unit 13.
[0075]
The sound piece database creation unit 14 may include a microphone, an amplifier, a sampling circuit, an A / D (Analog-to-Digital) converter, a PCM encoder, and the like. In this case, instead of acquiring the sound piece data from the recorded sound piece data set storage unit 13, the sound piece database creating unit 14 amplifies a sound signal representing the sound collected by its own microphone, samples it, and performs A / After D conversion, the piece data may be created by performing PCM modulation on the sampled audio signal.
[0076]
Further, the pitch component data may be data representing a time change of the pitch length of the sound piece represented by the sound piece data. In this case, the sound piece editing unit 8 predicts the time change of the pitch length of the sound piece by performing prosodic prediction, and the sound piece data representing the prediction result and the waveform of the sound piece whose reading matches the sound piece. What is necessary is just to obtain | require the correlation with the pitch component data showing the time change of pitch length.
[0077]
For example, the sound piece editing unit 8 acquires free text data together with the language processing unit 1, and converts the sound piece data representing a waveform close to the waveform of the sound piece included in the free text represented by the free text data to the standard message. May be selected by performing substantially the same process as the process of selecting sound piece data representing a waveform close to the waveform of the sound piece included in the sound piece, and may be used for speech synthesis.
In this case, the sound processing unit 4 does not have to search the search unit 5 for waveform data representing the waveform of the sound piece for the sound piece represented by the sound piece data selected by the sound piece editing unit 8. Note that the sound piece editing unit 8 notifies the sound processing unit 4 of a sound piece that the sound processing unit 4 does not need to synthesize, and the sound processing unit 4 responds to this notification and the unit sound constituting the sound piece. The search for the waveform may be stopped.
[0078]
For example, the sound piece editing unit 8 acquires the distribution character string data together with the acoustic processing unit 4 and generates sound piece data representing a waveform close to the waveform of the sound piece included in the distribution character string represented by the distribution character string data. The selection may be performed by performing substantially the same process as the process of selecting sound piece data representing a waveform close to the waveform of the sound piece included in the standard message, and may be used for speech synthesis. In this case, the sound processing unit 4 does not have to search the search unit 5 for waveform data representing the waveform of the sound piece for the sound piece represented by the sound piece data selected by the sound piece editing unit 8.
[0079]
In addition, the sound piece editing unit 8 supplies the waveform data returned from the sound processing unit 4 to the speech speed conversion unit 11, so that the time length of the waveform represented by the waveform data matches the speed indicated by the utterance speed data. (Or a speed close to a certain level or more). By doing so, the generation speed of the voice synthesized by the acoustic processing unit 4 by the rule synthesis method also changes according to the moving speed of the moving body.
[0080]
In addition, the speech synthesis system may include a plurality of sound piece databases 10, and in this case, each sound piece database 10 may be associated with an utterance speed in a different range that does not overlap each other. In this case, for example, it is assumed that the combination of readings of compressed sound piece data stored in each sound piece database 10 is common, and on the other hand, the sound piece database 10 associated with a certain range of utterance speeds. The compressed speech piece data of a certain reading is a voice read out at a lower (slower) speech speed than the compressed speech piece data of the same reading in the speech database 10 associated with a higher (fast) speech speed. As long as it represents.
[0081]
When the speech synthesis system includes a plurality of speech unit databases 10 as described above, the speech unit editing unit 8 determines, for example, which speech unit database 10 to use based on the determined speech speed. What is necessary is just to supply the search unit 9 with data indicating the sound piece database 10 that has been performed. And the search part 9 should just search the compression sound piece data from the sound piece database 10 which this data shows.
[0082]
Further, instead of resampling the speech piece data, the speech speed conversion unit 11 specifies a portion of the speech piece data that substantially represents a silent state, and adjusts the time length of the specified portion, The time length of the sound piece represented by the sound piece data may be matched with the speed indicated by the utterance speed data.
[0083]
Further, the physical quantity detected by the speech synthesis system regarding the external situation does not necessarily have to be the speed of the moving body, and may represent any other physical quantity.
[0084]
Therefore, the speed detection unit 12 may be configured by, for example, an acceleration sensor, and the speed detection unit 12 detects the acceleration of the moving body on which the speech synthesis system is mounted, and data indicating the detected acceleration. May be generated and supplied to the sound piece editing unit 8. Further, an integrated circuit for integrating the detected acceleration may be further provided. In this case, data indicating the result of integrating the detected acceleration is generated, and this data is supplied to the sound piece editing unit 8. You may make it supply.
[0085]
Further, the speed detection unit 12 detects the peak of acceleration of the moving body on which the speech synthesis system is mounted, and generates data indicating the latest detected peak value each time it is detected. You may make it supply to the part 8. FIG.
On the other hand, the sound piece editing unit 8 classifies the peak value indicated by this data into one of a plurality of predetermined ranks each time data indicating the peak value of acceleration is supplied from the speed detection unit 12, and The frequency at which the peak value is classified into the same rank as the rank in which the peak value is classified in the past may be specified, and the utterance speed may be determined according to the specified result.
[0086]
In this case, for example, the speed detector 12 converts an A / D (Analog-to-Digital) converter or a logic circuit to detect a peak by converting a signal generated by the acceleration sensor into digital data. As long as it has.
[0087]
In this case, the sound piece editing unit 8 further includes a non-volatile memory such as a PROM, and stores a table showing the data structure in FIG. 3A in advance. Just decide the speed. As shown in the figure, this table only needs to store the utterance speed in association with the rank to which the detected acceleration peak belongs and the frequency at which the peak belonging to this rank is detected.
[0088]
In addition, when this speech synthesis system is used, for example, in an automobile, the peak of the brake depression amount of the automobile is detected and data representing the detection result is supplied to the sound piece editing unit 8. A brake sensor may be further provided. Further, a steering wheel sensor for detecting the peak of the angular velocity of the steering wheel of the automobile and supplying data representing the detection result to the sound piece editing unit 8 may be further provided.
[0089]
In this case, for example, each time the sound piece editing unit 8 is supplied with data indicating the peak value of the brake depression amount or data indicating the peak value of the angular velocity of the steering wheel, the peak value indicated by these data is supplied. Are classified into one of a plurality of ranks respectively defined for the amount of brake depression and the angular velocity of the steering wheel, and the frequency of the peak in the past is changed to the same rank as the classified rank. Determine if the value was classified.
[0090]
Then, the sound piece editing unit 8 calculates the value α on the right side of Formula 1 from the peak value pa of the acceleration of the automobile, the peak value pb of the brake depression amount, and the peak value pω of the steering wheel angular velocity. Ask.
[0091]
[Expression 1]
α = (W_A1・ Pa) + (W_A2・ Pb) + (W_A3・ Pω)
(However, W_A1, W_A2And W_A3Is a predetermined coefficient)
[0092]
The sound piece editing unit 8 also includes a value fa indicating how often the peak value has been classified into the same rank as the rank in which the latest peak value of the acceleration of the vehicle is classified, and the brake The value fb indicating how often the peak value has been classified into the same rank as the rank of the latest peak value of the amount of depression, and the peak value of the angular velocity of the handle are classified The value β on the right side of Equation 2 is obtained from the value fω indicating how often the peak value has been classified into the same rank as the rank in the past.
[0093]
[Expression 2]
β = (W_B1・ Fa) + (W_B2・ Fb) + (W_B3・ Fω)
(However, W_B1, W_B2And W_B3Is a predetermined coefficient)
[0094]
On the other hand, in this case, the sound piece editing unit 8 stores in advance a table for storing the utterance speed in association with the values of α and β as shown in the data structure of FIG. The utterance speed may be determined by referring to this table.
[0095]
In addition, this speech synthesis system may change the speech production speed based on the current time. In this case, for example, the sound piece editing unit 8 includes a timer composed of a crystal oscillator or the like, and continuously obtains data indicating the current date and time from this timer, and determines the generation speed of sound piece data based on the obtained data. You can do it.
[0096]
In addition, the target to be changed by the speech synthesis system according to the detection result of the physical quantity regarding the external situation is not necessarily the speech production speed, and may be any other element that characterizes the speech.
[0097]
Therefore, in this speech synthesis system, for example, the sound piece editing unit 8 changes the amplitude of the sound piece data supplied from the search unit 9 or the waveform data supplied from the acoustic processing unit 4 via the sound piece editing unit 8. You may let them.
[0098]
In addition, this speech synthesis system may include, for example, a microphone and a level detection circuit in order to detect the noise level inside or outside the moving body and supply data representing the detection result to the sound piece editing unit 8. . In this case, the sound piece editing unit 8 may determine the amplitude of the synthesized speech based on, for example, the noise level represented by this data, and convert the sound piece data and the waveform data so as to match the determined amplitude. With such a configuration, this speech synthesis system can maintain the ease of hearing the synthesized speech even if the surrounding noise is large, for example, by increasing the amplitude of the synthesized speech as the noise level increases. it can.
[0099]
In addition, this speech synthesis system includes, for example, a microphone and a Fourier transform device in order to detect a band occupied by noise inside or outside the moving body and supply data representing the detection result to the sound piece editing unit 8. Also good. In this case, for example, the sound piece editing unit 8 determines the noise occupation band represented by the data as the attenuation band of the synthesized speech (or otherwise determines the attenuation band of the synthesized speech based on the noise occupation band). Spectral components within the determined attenuation band may be removed from the sound piece data and waveform data. With such a configuration, this speech synthesis system avoids duplication of the band occupied by the noise and the band occupied by the synthesized speech. Can keep.
[0100]
In this speech synthesis system, for example, the sound piece editing unit 8 may change the voice quality of sound piece data or waveform data.
Specifically, for example, the sound piece editing unit 8 converts the sound piece data into subband data representing the time change of the pitch component (fundamental frequency component) and the harmonic component of the sound piece represented by the sound piece data. Then, the obtained subband data is further converted back to data representing the waveform of the sound piece. However, when converting the sub-band data into data representing the waveform, the sound piece editing unit 8 converts each component represented by the sub-band data to a frequency different from the originally represented frequency (for example, the original frequency). Conversion is performed by interpreting it as representing a time change of the component of the frequency of twice the frequency.
[0101]
If it has such a configuration, this speech synthesis system, for example, generates synthesized speech that is less likely to induce drowsiness by increasing the pitch of the speech at daytime as synthesized speech for car navigation, for example. Therefore, it is possible to obtain synthesized speech suitable for the time zone when the automobile is driven.
[0102]
Although the embodiment of the present invention has been described above, the speech speed conversion device according to the present invention can be realized using a normal computer system, not a dedicated system.
For example, the language processing unit 1, the general word dictionary 2, the user word dictionary 3, the acoustic processing unit 4, the search unit 5, and the like described above are connected to a personal computer to which a speed sensor (or other sensor for detecting an arbitrary physical quantity) is connected. Medium (CD-ROM, MO, floppy (registration) storing programs for executing operations of the decompression unit 6, waveform database 7, sound piece editing unit 8, search unit 9, sound piece database 10, and speech speed conversion unit 11 The main unit M that executes the above-described processing can be configured by installing the program from a trademark) disk or the like.
Further, by installing the program from a medium storing programs for causing the personal computer to execute the operations of the recorded sound piece data set storage unit 13, the sound piece database creation unit 14, and the compression unit 15, the above-described processing is performed. Can be configured.
[0103]
Then, a personal computer that executes these programs and functions as the main unit M and the sound piece registration unit R performs, for example, the processes shown in FIGS. 4 to 6 as the process corresponding to the operation of the speech synthesis system of FIG. Shall.
FIG. 4 is a flowchart showing processing when the personal computer acquires free text data.
FIG. 5 is a flowchart showing processing when the personal computer acquires distribution character string data.
FIG. 6 is a flowchart showing processing when the personal computer acquires the standard message data and the utterance speed data.
[0104]
That is, when the personal computer obtains the above-mentioned free text data from the outside (FIG. 4, step S101), the phonogram representing the reading of each ideographic character included in the free text represented by the free text data. Is identified by searching the general word dictionary 2 and the user word dictionary 3, and the ideogram is replaced with the identified phonogram (step S102). Note that the method of acquiring free text data by this personal computer is arbitrary.
[0105]
And when this personal computer obtains a phonetic character string representing the result of replacing all ideographic characters in the free text with phonetic characters, for each phonetic character contained in this phonetic character string, The waveform of the unit speech represented by the phonetic character is searched from the waveform database 7, and compressed waveform data representing the waveform of the unit speech represented by each phonetic character included in the phonetic character string is retrieved (step S103).
[0106]
Next, the personal computer restores the retrieved compressed waveform data to the waveform data before being compressed (step S104), and the restored waveform data is converted to each phonetic sound in the phonetic character string. They are combined with each other in the order in which the characters are arranged and output as synthesized speech data (step S105). Note that the method by which the personal computer outputs the synthesized voice data is arbitrary.
[0107]
When this personal computer obtains the above-mentioned distribution character string data from the outside by an arbitrary method (FIG. 5, step S201), each phonogram included in the phonogram string represented by this distribution character string data The waveform of the unit speech represented by the phonetic character is searched from the waveform database 7, and compressed waveform data representing the waveform of the unit speech represented by each phonetic character included in the phonetic character string is retrieved (step S202). ).
[0108]
Next, the personal computer restores the retrieved compressed waveform data to the waveform data before being compressed (step S203), and the restored waveform data is converted into each phonetic phonetic string in the phonetic character string. They are combined with each other in the order in which the characters are arranged, and output as synthesized speech data by the same processing as the processing in step S105 (step S204).
[0109]
On the other hand, when the personal computer obtains the above-mentioned fixed message data and utterance speed data from the outside by an arbitrary method (FIG. 6, step S301), first, the sound piece included in the fixed message represented by the fixed message data is recorded. All the compressed speech piece data associated with the phonetic character that matches the phonetic character representing the reading is retrieved (step S302).
[0110]
In step S302, the above-described sound piece reading data, speed initial value data, and pitch component data associated with the corresponding compressed sound piece data are also retrieved. In addition, when a plurality of compressed sound piece data corresponds to one sound piece, all the corresponding compressed sound piece data are searched. On the other hand, if there is a sound piece for which compressed sound piece data could not be found, the above-described missing portion identification data is generated. Then, the personal computer restores the retrieved compressed sound piece data to the sound piece data before being compressed (step S303).
[0111]
On the other hand, when this personal computer is supplied with data representing the moving speed (or any other physical quantity) of the moving body from a speed sensor (for example, a device for supplying a vehicle speed pulse representing the vehicle speed of the vehicle: not shown). Based on the moving speed represented by the data, the utterance speed of the standard message is determined (step S304). The correspondence relationship between the moving speed of the moving body and the utterance speed is arbitrary.
[0112]
Next, the personal computer converts the sound piece data restored in step S303 by a process similar to the process performed by the sound piece editing unit 8 described above, and calculates the time length of the sound piece represented by the sound piece data. The utterance speed determined in step S304 (or a speed close to a certain level or more than the determined utterance speed) is set (step S305). In addition, when the utterance speed data is not supplied, the restored sound piece data may not be converted.
[0113]
Next, the personal computer converts the sound piece data representing the waveform closest to the waveform of the sound piece constituting the standard message from the sound piece data in which the time length of the sound piece is converted, to the above-described sound piece editing unit. By performing the same process as the process performed by No. 8, one piece is selected for each sound piece (step S306).
[0114]
In step S306, the personal computer predicts the prosody of the standard message by adding an analysis based on the prosody prediction method to the standard message represented by the standard message data, for example, while searching for the piece data retrieved. A function representing a time change of the frequency of the pitch component of the sound piece is specified based on the pitch component data, and for each sound piece in the standard message, The correlation coefficient with the function representing the time change of the frequency of the pitch component of each sound piece data representing the sound waveform of the sound piece whose reading matches the sound piece is obtained, and the sound piece data giving the highest correlation coefficient is obtained. You may make it choose.
[0115]
On the other hand, when the personal computer generates the missing part identification data, the personal computer extracts a phonetic character string representing the reading of the sound piece indicated by the missing part identification data from the standard message data. By processing the above-mentioned steps S202 to S203 in the same manner as the phonetic character string represented by the delivery character string data, waveform data representing the waveform of the voice indicated by each phonetic character in the phonetic character string is obtained. Restoration is performed (step S307).
[0116]
Then, the personal computer combines the restored waveform data and the sound piece data selected in step S306 with each other in the order in which the sound pieces in the fixed message indicated by the fixed message data are arranged. It outputs as the data to represent (step S308).
[0117]
The program that causes the personal computer to perform the function of the main unit M may cause the personal computer to perform the functions of the plurality of sound piece databases 10. In this case, this personal computer completes the process of step S304 before starting the process of step S302. On the other hand, in step S302, for example, which sound piece database 10 is selected based on the determined voice speed. It is only necessary to determine whether to use the compressed sound piece data from the determined sound piece database 10. And the process of step S305 should just be abbreviate | omitted.
[0118]
When causing the personal computer to perform the functions of the plurality of sound piece databases 10, each sound piece database 10 is assumed to be associated with, for example, different ranges of utterance speeds that do not overlap each other, and each sound piece database 10 stores the same. The combination of readings of the compressed speech piece data being shared is common, while the compressed speech piece data of a reading in the speech piece database 10 associated with a certain range of speech speed is higher (faster). It is assumed that the voice is read out at a lower (slower) utterance speed than the compressed sound piece data of the same reading in the voice database 10 associated with the utterance speed.
[0119]
Further, a program for causing the personal computer to perform the functions of the main unit M and the sound piece registration unit R may be uploaded to a bulletin board (BBS) of a communication line and distributed via the communication line. The carrier wave may be modulated with a signal representing these programs, the obtained modulated wave may be transmitted, and a device that receives the modulated wave may demodulate the modulated wave to restore these programs.
The above-described processing can be executed by starting up these programs and executing them under the control of the OS in the same manner as other application programs.
[0120]
When the OS shares a part of the processing, or when the OS constitutes a part of one component of the present invention, a program excluding the part is stored in the recording medium. May be. Also in this case, in the present invention, it is assumed that the recording medium stores a program for executing each function or step executed by the computer.
[0121]
【The invention's effect】
As described above, according to the present invention, the speech speed conversion device, the speech speed conversion method, and the program for obtaining the synthesized speech that can be easily heard even when the environment changes are realized.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a speech synthesis system according to an embodiment of the present invention.
FIG. 2 is a diagram schematically showing a data structure of a sound piece database.
FIG. 3A is a diagram showing a data structure of a table used for determining the speaking speed based on the acceleration of the vehicle, and FIG. It is a figure which shows the data structure of the table used in order to determine an utterance speed based on the angular velocity of.
FIG. 4 is a flowchart showing processing when a personal computer that performs the function of the speech synthesis system according to the embodiment of the present invention acquires free text data;
FIG. 5 is a flowchart showing processing when a personal computer that performs the function of the speech synthesis system according to the embodiment of the present invention acquires distribution character string data;
FIG. 6 is a flowchart showing processing when a personal computer that performs the function of the speech synthesis system according to the embodiment of the present invention acquires fixed message data and utterance speed data.
[Explanation of symbols]
M Main unit
1 Language Processing Department
2 General word dictionary
3 User word dictionary
4 Sound processing section
5 search part
6 Extension part
7 Waveform database
8 Sound piece editor
9 Search part
10 Sound Database
11 Speaking speed converter
12 Speed detector
R sound piece registration unit
13 Recorded sound piece data set storage
14 Sound Piece Database Creation Department
15 Compression unit
HDR header
IDX index part
DIR directory section
DAT data section

Claims

Audio data acquisition means for acquiring audio data representing an audio waveform;
Speech speed setting data generating means for detecting the speed or acceleration of the moving body and generating speech speed setting data for designating the speed of the voice based on the detected speed or acceleration;
Voice data conversion means for converting the speed of the acquired voice data based on the speed represented by the generated speech speed setting data;
With
The speech speed setting data generating means detects the peak of acceleration of the mobile object, classifies and stores the detected latest peak into one of a plurality of predetermined ranks, and a peak detected in the past , Specifying how often the latest peak has been classified into the same rank as the classified rank in the past, and generating the speech speed setting data based on the identified result,
A speech speed conversion device characterized by that.

The voice data conversion means performs a conversion for sampling the acquired voice data so that the speed of the voice represented by the converted voice data becomes a speed represented by the speech speed setting data.
The speech speed conversion apparatus according to claim 1.

The voice data conversion means specifies a part that substantially represents a silence state in the waveform represented by the acquired voice data, and performs conversion to change the time length of the specified part, The speed of the voice represented by the voice data is set to a speed based on the speed represented by the speech speed setting data.
The speech speed conversion apparatus according to claim 1.

Voice data storage means for storing a plurality of voice data representing waveforms of a plurality of voices uttering words of the same reading at different speeds;
Speech speed setting data generating means for detecting the speed or acceleration of the moving body and generating speech speed setting data for designating the speed of voice based on the detected speed or acceleration;
Of the voice data stored in the voice data conversion means, voice data selection means for selecting voice data having a speed closest to the speed represented by the generated speech speed setting data;
With
The speech speed setting data generation means detects a peak of acceleration of the moving body, classifies the detected latest peak into any of a plurality of predetermined ranks, stores it in the voice data storage means, The frequency at which the detected peak was classified into the same rank as the rank in which the latest peak was classified in the past is identified, and the speech speed setting data is generated based on the identified result To
A speech speed conversion device characterized by that.

A speech speed conversion method executed by a speech speed conversion device having a storage unit,
An audio data acquisition step for acquiring audio data representing an audio waveform;
A speech speed setting data generation step for detecting the speed or acceleration of the moving body and generating speech speed setting data for designating the speed of the voice based on the detected speed or acceleration;
A voice data conversion step of converting the speed of the acquired voice data based on a speed represented by the generated speech speed setting data;
With
The speech speed setting data generation step detects the peak of acceleration of the moving body, classifies the detected latest peak into one of a plurality of predetermined ranks, stores it in the storage unit, and detects it in the past Identifying the frequency at which the determined peak was previously classified into the same rank as the rank into which the latest peak was classified, and generating the speech speed setting data based on the identified result,
The speech speed conversion method characterized by this.

A speech speed conversion method executed by a speech speed conversion device having a storage unit,
The storage unit stores a plurality of voice data representing waveforms of a plurality of voices uttering words of the same reading at different speeds,
A speech speed setting data generation step for detecting the speed or acceleration of the moving body and generating speech speed setting data for designating the speed of the voice based on the detected speed or acceleration;
A voice data selection step of selecting voice data having a speed closest to the speed represented by the generated speech speed setting data among the stored voice data;
With
The speech speed setting data generation step detects the peak of acceleration of the moving body, classifies the detected latest peak into one of a plurality of predetermined ranks, stores it in the storage unit, and detects it in the past Identify the frequency at which the peak was classified in the past into the same rank as the rank into which the latest peak was classified, and generate speech speed setting data based on the identified result.
The speech speed conversion method characterized by this.

A computer equipped with a device for detecting the speed or acceleration of a moving object,
Audio data acquisition means for acquiring audio data representing an audio waveform;
Speech speed setting data generating means for detecting the speed or acceleration of the mobile body and generating speech speed setting data for designating the speed of speech based on the detected speed or acceleration;
Voice data conversion means for converting the speed of the acquired voice data based on the speed represented by the generated speech speed setting data;
To function,
The speech speed setting data generating means detects the peak of acceleration of the mobile object, classifies and stores the detected latest peak into one of a plurality of predetermined ranks, and a peak detected in the past , Specify how often the latest peak has been classified into the same rank as the classified rank in the past, and generate speech speed setting data based on the identified result,
A program characterized by that.

A computer equipped with a device for detecting the speed or acceleration of a moving object,
Voice data storage means for storing a plurality of voice data representing waveforms of a plurality of voices uttering words of the same reading at different speeds;
Speech speed setting data generating means for detecting the speed or acceleration of the mobile body and generating speech speed setting data for designating the speed of speech based on the detected speed or acceleration;
Of the voice data stored in the voice data conversion means, voice data selection means for selecting voice data having a speed closest to the speed represented by the generated speech speed setting data;
To function,
The speech speed setting data generation means detects a peak of acceleration of the moving body, classifies the detected latest peak into any of a plurality of predetermined ranks, stores it in the voice data storage means, The frequency at which the detected peak is classified into the same rank as the rank in which the latest peak was classified in the past is specified, and the speech speed setting data is generated based on the specified result ,
A program characterized by that.