JP4304934B2

JP4304934B2 - CHORAL SYNTHESIS DEVICE, CHORAL SYNTHESIS METHOD, AND PROGRAM

Info

Publication number: JP4304934B2
Application number: JP2002235039A
Authority: JP
Inventors: 秀紀劔持; ボナダジョルディ
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2002-08-12
Filing date: 2002-08-12
Publication date: 2009-07-29
Anticipated expiration: 2022-08-12
Also published as: JP2004077608A

Abstract

<P>PROBLEM TO BE SOLVED: To provide an apparatus and a method for chorus synthesis and a program that can synthesis a chorus sound giving more natural impressions to an audience. <P>SOLUTION: The chorus synthesizing apparatus 100 has a speech sample database 110 storing three generated speech sample data groups 110a, 110b, and 110c generated based on different speech and three singing generators 120, 121, and 122. When a chorus sound signal of a musical piece consisting of three parts is synthesized, the singing generators 120, 121, and 122 generates singing sound signals for the respective parts under the control of a chorus control part 140 according to lyrics information and melody information and then put together. For the generation, the singing generators 120, 121, and 122 use phoneme sample data included in the different speech sample data groups 110a, 110b, and 110c. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、合唱音信号を合成する合唱合成装置、合唱合成方法、および合唱音を合成するためのプログラムに関する。
【０００２】
【従来の技術】
従来より、歌詞情報やメロディ情報に基づいて、歌唱音信号を合成して歌声を発音する合唱合成装置が提案されている。このように歌唱音信号を合成する装置としては、規則音声合成技術を応用した装置等の種々の装置が提案されている。規則合成技術を応用した歌唱合成装置では、予め発声者が発した音声から、音素や複数の音素を含む音素連鎖を単位とする音声試料データを作成してデータベースに記憶しておく。そして、歌詞情報にしたがって必要となる音素等の音声試料データを読み出して接続することにより歌唱音信号を合成している。
【０００３】
ところで、上記のような歌唱音を合成する歌唱音合成装置では、文章読み上げ装置等の音声合成装置と異なり、斉唱や重唱といった合唱時の歌唱音を電子的に出力するといった利用形態も考えられる。したがって、合唱時の歌唱音（合唱音）を合成する機能を備えた合唱合成装置の開発も行われている。
【０００４】
このような合唱時の合唱音信号を合成する機能を備えた合唱合成装置は、複数のパートの各々に基づいて、音声試料データを読み出して接続することにより合唱音信号を生成する。そして、各々のパートについて生成した歌唱音信号を重ね合わせて出力することにより、合唱音を電子的に出力することができるようになっている。
【０００５】
【発明が解決しようとする課題】
しかし、従来の合唱音信号を合成する機能を備えた合唱合成装置では、各パート毎に歌詞情報やメロディ情報にしたがって歌唱音信号を生成する際に、同一の音声試料データを用いているため、各パート毎に生成された歌唱音はメロディが異なっているものの、生成された各パート毎の音声波形の微細な特徴（ピッチのゆらぎ等）は基本的に同一となってしまう。したがって、これらを重ね合わせた合唱音は、聴取者にとって不自然な合唱音に聴こえてしまう。これは、各パート間の相関関係（微細な特徴が一致する）を聴取者が聴き取ってしまい、不自然な印象を与えているものと考えられる。
【０００６】
また、斉唱時の合唱音信号を合成する場合には、上記のように各パート毎に単純に歌唱音信号を生成して重ね合わせる手法では、全く同じ歌唱音が重ねられて出力されてしまい、この結果聴取者に不自然な印象を与えてしまうことになる。そこで、従来の合唱音合成装置において、斉唱時の合唱音信号を合成する場合には、各パート（内容は同一）毎に生成した歌唱音の発音タイミングを若干ずらしたり、各パート毎に生成した歌唱音のピッチを若干ずらしたりすることにより、全く同一の歌唱音が重ねられて発音されてしまうことを防止していた。しかしながら、発音タイミングやピッチを若干ずらした場合にも、上記のように各パート毎に生成された音声波形の微細な特徴（ゆらぎ等）は基本的に同一となってしまう。したがって、これらを重ね合わせた合唱音は、上記と同様、聴取者にとって不自然な合唱音に聴こえてしまう。
【０００７】
また、特開平７−１４６６９５号公報には、合唱音信号を生成する装置が開示されており、この装置では、各パート毎に歌唱音信号を生成する際に、各パート毎に異なるピッチのゆらぎ成分を付与した歌唱音信号を生成している。このように各パート毎に異なるピッチのゆらぎ成分を付与した歌唱音信号を重ねて出力することにより、各パート間の相関関係を小さくすることができる。しかしながら、この公報に記載された装置において、各パート毎の歌唱音信号に付与されるピッチ成分は、人の音声を基にしたものではなく、人工的に作られたものであるため、各パート間の相関関係は小さくなるものの、合成された合唱音が不自然に聴こえてしまうことがある。
【０００８】
本発明は、上記の事情を考慮してなされたものであり、より自然な印象を聴取者に与えることが可能な合唱音を合成することができる合唱合成装置、合唱合成方法およびプログラムを提供することを目的とする。
【０００９】
【課題を解決するための手段】
上記課題を解決するため、本発明に係る合唱合成装置は、楽曲データに基づいて合唱音信号を合成する合唱合成装置であって、複数の音声試料データからなる音声試料データ群であって複数の異なる音声に基づいて各々作成された前記音声試料データ群を音域毎に記憶するデータベースと、前記楽曲データにしたがって歌唱音信号を生成する手段であって、必要となる前記音声試料データを前記データベースから読み出して当該歌唱音信号の生成に用いる複数の歌唱生成手段と、前記複数の歌唱生成手段で生成された歌唱音信号から合唱音信号を合成する歌唱合成手段とを具備し、前記楽曲データが複数のパートからなり、前記複数の歌唱生成手段の各々が各前記パートに対応する歌唱音信号を生成する際に、少なくとも２つの前記歌唱生成手段の各々は、前記歌唱生成手段の各々のパートに対応する音域に応じた音声試料データ群に含まれる前記音声試料データを前記データベースから読み出して前記歌唱音信号の生成に用いることを特徴としている。
【００１０】
この構成によれば、各歌唱生成手段が対応するパートの歌唱音信号を生成する際に、少なくとも２つの歌唱生成手段が異なる音声に基づいて作成した音声試料データを用いることになる。ここで、異なる音声に基づいて作成した音声試料データは、微細な特徴等が異なっているため、上記少なくとも２つの歌唱生成手段から出力される歌唱音信号は微細な特徴が異なったものとなる。したがって、各パートに応じた歌唱音として、固有の特徴を有する歌唱音が放音されるので、聴取者に対してより自然な印象を与えることができる。
【００１２】
この構成によれば、各歌唱生成手段が対応するパートの歌唱音信号を生成する際に、少なくとも２つの歌唱生成手段が音声試料データの異なる時間に対応する部分から使用を開始して生成を行うことになる。ここで、音声に基づいて作成されたある時間長を有する音声試料データは、その時間長の間微細な特徴（音声波形のゆらぎ）が一定ではなく、時間によって微細な特徴等が異なっている。このため、上記少なくとも２つの歌唱生成手段から出力される歌唱音信号は微細な特徴が異なったものとなる。したがって、各パートに応じた歌唱音として、固有の特徴を有する歌唱音が放音されるので、聴取者に対してより自然な印象を与えることができる。
【００１３】
また、本発明に係る合唱合成方法は、楽曲データに基づいて生成された複数の歌唱音信号から合唱音信号を合成する合唱合成方法であって、複数のパートからなる前記楽曲データにしたがって前記複数のパートに対応する歌唱音信号を生成する際には、複数の音声試料データからなる音声試料データ群であって複数の異なる音声に基づいて各々作成された音声試料データ群を音域毎に記憶するデータベースから必要となる前記音声試料データを読み出し、少なくとも２つの前記パートに対応する歌唱音信号の生成には、該パート毎に、各々のパートに対応する音域に応じた音声試料データ群に含まれる前記音声試料データを前記データベースから読み出して前記歌唱音信号の生成に用いることを特徴としている。
【００１４】
また、本発明の別の態様の合唱合成方法は、楽曲データに基づいて生成された複数の歌唱音信号から合唱音信号を合成する合唱合成方法であって、複数のパートからなる前記楽曲データにしたがって前記複数のパートに対応する歌唱音信号を生成する際には、音声に基づいて作成された所定の時間長を有する音声試料データを記憶するデータベースから必要となる前記音声試料データを読み出し、少なくとも２つの前記パートに対応する歌唱音信号の生成には、前記データベースから読み出した前記音声試料データの異なる時間に対応する部分から使用を開始して前記歌唱音信号を生成することを特徴としている。
【００１５】
また、本発明に係るプログラムは、コンピュータを、楽曲データにしたがって、複数の音声試料データからなる音声試料データ群であって複数の異なる音声に基づいて各々作成された音声試料データ群を音域毎に記憶するデータベースから必要となる前記音声試料データを読み出して歌唱音信号を生成する手段であって、前記楽曲データが複数のパートからなり、前記複数のパートに対応する歌唱音信号を生成する場合には、少なくとも２つの前記パートに対応する歌唱音信号の生成の際に、該パート毎に各々のパートに対応する音域に応じた音声試料データ群に含まれる前記音声試料データを前記データベースから読み出して前記歌唱音信号の生成に用いる歌唱音生成手段と、前記生成された歌唱音信号から合唱音信号を合成する歌唱合成手段として機能させることを特徴としている。
【００１６】
また、本発明の別の態様のプログラムは、コンピュータを、楽曲データにしたがって、音声に基づいて作成された所定の時間長を有する音声試料データを記憶するデータベースから必要となる前記音声試料データを読み出して歌唱音信号を生成する手段であって、前記楽曲データが複数のパートからなり、前記複数のパートに対応する歌唱音信号を生成する場合には、少なくとも２つの前記パートに対応する歌唱音信号の生成する際に、前記データベースから読み出した前記音声試料データの異なる時間に対応する部分から使用を開始して前記歌唱音信号を生成する歌唱音生成手段と、前記生成された歌唱音信号から合唱音信号を合成する歌唱合成手段として機能させることを特徴としている。
【００１７】
【発明の実施の形態】
以下、図面を参照して本発明の実施形態について説明する。
Ａ．第１実施形態
Ａ−１．第１実施形態の基本構成
まず、図１は本発明の第１実施形態に係る合唱合成装置の基本構成を示すブロック図である。同図に示すように、この合唱合成装置１００は、音声試料データベース１１０と、複数（図示の例では３つ）の歌唱生成器１２０，１２１，１２２と、合唱制御部１４０と、歌唱生成器１２０，１２１，１２２の各々が出力する歌唱音信号を加算して合成し、出力する加算器１３０とを備えている。
【００１８】
音声試料データベース１１０には、人が発声した自然の音声に基づいて作成された音声試料データが記憶されている。この音声試料データベース１１０には、単一の音素または複数の音素で構成される音素連鎖を１つの単位とする音声試料データ（以下、音声素片試料データという）が記憶されている。
【００１９】
多数の短時間長の音声試料データをデータベースに蓄積しておいて、歌詞等に応じてこれらの音声試料データを接続して音声合成処理技術では、合成単位として音素が用いられるのが基本である。このため、この合唱合成装置１００における音声試料データベース１１０に、音素（３０〜５０種類程度）単位のみの音声素片試料データを蓄積するようにしてもよいが、音素間の結合規則は複雑であるため、音素単位のみの音声試料データを蓄積した場合には、良好な品質を得ることが難しい。したがって、音声試料データベース１１０には、音素単位のみの音声素片試料データに加え、音素よりもやや大きい単位（音素連鎖）の音声素片試料データも蓄積しておくことが好ましい。音素よりも大きい単位としては、ＣＶ（子音→母音）、ＶＣ（母音→子音）、ＶＣＶ（母音→子音→母音）、ＣＶＣ（子音→母音→子音）といった単位がある。これらの単位の音声素片試料データを全て蓄積しておくことも考えられるが、合唱音を合成する合唱合成装置１００においては、歌唱において使用頻度の高い母音など長く発音する伸ばし音を１単位とした音声素片試料データ、子音から母音（ＣＶ）および母音から子音（ＶＣ）を１単位とした音声素片試料データ、子音から子音を１単位としたの音声素片試料データ、および母音から母音を１単位とした音声素片試料データを蓄積しておくようにすればよい。
【００２０】
音声試料データベース１１０には、上述したような音素あるいは音素連鎖を１単位とした音声素片試料データが格納されているが、この音声試料データベース１１０では、同一種類の音素（例えば「a」）あるいは音素連鎖（例えば、「ai」）について３つの音声素片試料データを記憶している。すなわち、音声試料データベース１１０には、音素あるいは音素連鎖を１単位とした所定数の単位音声素片試料データからなる３つの音声試料データ群１１０ａ，１１０ｂ，１１０ｃが記憶されているのである。
【００２１】
音声試料データベース１１０に記憶されている３つの音声試料データ群１１０ａ，１１０ｂ，１１０ｃは、各々異なる音声に基づいて作成されたデータである。ここで、異なる音声とは、発声者が異なることのみを意味するわけではなく、同じ発声者であっても別の機会に発した音声や異なる発声部分を用いたものであってもよい。このように音声試料データ群１１０ａ，１１０ｂ，１１０ｃは、別の発声者または同じ発声者であっても別の機会に発した音声や別の発声部分に基づいて作成されているのである。このように各音声試料データ群１１０ａ，１１０ｂ，１１０ｃに含まれる同一の音素（あるいは音素連鎖）についてのデータは、各々のデータを作成するために使用した基となる音声が異なっているため、微細な特徴（ピッチのゆらぎ等）が異なったものとなっている。
【００２２】
音声試料データベース１１０には、上述したような３つの音声試料データ群１１０ａ，１１０ｂ，１１０ｃが記憶されており、各歌唱生成器１２０，１２１，１２２は、歌唱音信号を生成する際にこの音声試料データベース１１０から音声素片試料データを読み出して用いることになる。
【００２３】
歌唱生成器１２０，１２１，１２２の各々は、歌詞情報およびメロディ情報を有する楽曲情報にしたがって、音声試料データベース１１０から必要となる音声素片試料データを読み出し、読み出した音声素片試料データを用いて歌唱音信号を生成する。
【００２４】
より具体的には、歌唱生成器１２０，１２１，１２２の各々は、歌詞情報にしたがって音素列を求め、その音素列を構成するために必要な音声素片試料データを決定し、音声試料データベース１１０から読み出す。そして、読み出した音声素片試料データを時系列に接続し、接続した音声素片試料データをメロディ情報にしたがったピッチに応じて適宜調整し、歌唱音信号を生成するのである。
【００２５】
本実施形態に係る合唱合成装置１００は、歌詞情報およびメロディ情報にしたがって歌唱音信号を生成することができる３つの歌唱生成器１２０，１２１，１２２を備えており、これにより３つのパートかならなる合唱曲の楽曲情報（歌詞情報およびメロディ情報）にしたがって、この合唱曲に対応した合唱音信号を合成することができるようになっている。
【００２６】
合唱制御部１４０は、当該合唱合成装置１００において、合唱曲の楽曲情報に基づいて合唱音に対応した合唱音信号を合成する際に、楽曲情報を各パート毎に分割して各歌唱生成器１２０，１２１，１２２に出力する。これにより、３つのパートからなる楽曲情報にしたがって合唱音信号を合成する場合には、各歌唱生成器１２０，１２１，１２２が合唱制御部１４０から供給される各々のパートの歌詞情報およびメロディ情報にしたがって歌唱音信号を生成し、各歌唱生成器１２０，１２１，１２２の各々が生成した歌唱音信号が加算器１３０に出力される。これにより、加算器１３０からは３つのパートからなる合唱曲の楽曲情報にしたがって、この合唱曲に対応した合唱音信号を合成することができるのである。
【００２７】
また、この合唱合成装置１００において、上記のように合唱音信号を合成する際には、合唱制御部１４０は、各歌唱生成器１２０，１２１，１２２の各々が、音声試料データベース１１０に記憶されている音声試料データ群１１０ａ，１１０ｂ，１１０ｃのうち、どの音声試料データ群から音声素片試料データを読み出して用いるかを指定する指定情報を歌唱生成器１２０，１２１，１２２に出力する。ここで、合唱制御部１４０は、各歌唱生成器１２０，１２１，１２２が互いに異なる音声試料データ群１１０ａ，１１０ｂ，１１０ｃに含まれる音声素片試料データを用いて歌唱音信号を生成するように、各歌唱生成器１２０，１２１，１２２に異なるデータ群を指定する指定情報を出力する。
【００２８】
具体的に例示すると、歌唱生成器１２０に対しては音声試料データ群１１０ａを指定する指定情報を出力し、歌唱生成器１２１に対しては音声試料データ群１１０ｂを指定する指定情報を出力し、歌唱生成器１２２に対しては音声試料データ群１１０ｃを指定する指定情報を出力するといった具合である。このような指定情報が合唱制御部１４０から供給されると、歌唱生成器１２０は音声試料データ群１１０ａに含まれる音声素片試料データを読み出して歌唱音信号の生成に用い、歌唱生成器１２１は音声試料データ群１１０ｂに含まれる音声素片試料データを読み出して歌唱音信号の生成に用い、歌唱生成器１２２は音声試料データ群１１０ｃに含まれる音声素片試料データを用いて歌唱音信号を生成することになる。
【００２９】
合唱合成装置１００において、３つのパートからなる合唱曲の歌唱音信号を合成する場合に、上述したように各歌唱生成器１２０，１２１，１２２が互いに異なる音声試料データ群１１０ａ，１１０ｂ，１１０ｃに含まれる音声素片試料データを用いることにより、より自然な印象を聴取者に与えることが可能な合唱音信号を合成することができる。すなわち、音声試料データベース１１０に記憶されている音声試料データ群１１０ａ，１１０ｂ，１１０ｃは、各々異なる音声に基づいて作成されたものであり、同一種類の音素や音素連鎖についてのデータであっても、各音声試料データ群１１０ａ，１１０ｂ，１１０ｃに含まれるデータに示される音声の微細な特徴（ピッチのゆらぎ等）は異なっている。このように微細な特徴が異なっている音声素片試料データが含まれる音声試料データ群１１０ａ，１１０ｂ，１１０ｃのうち、各歌唱生成器１２０，１２１，１２２が異なる音声試料データ群に含まれる音声素片試料データを用いて歌唱音信号を生成することにより、各歌唱生成器１２０，１２１，１２２によって生成される歌唱音信号は、互いに微細な特徴が異なるものとなっている。したがって、これらを重ね合わせた合唱音は、各パート間の相関関係がほとんどない固有の特徴を有するものとなり、聴取者に不自然な印象を与えてしまうことを低減することができる。
【００３０】
また、音声試料データベース１１０に記憶されている音声試料データ群１１０ａ，１１０ｂ，１１０ｃに含まれる音声素片試料データは、各々人が発声した音声に基づいて作成されたデータである。したがって、各音声試料データ群に含まれる音声素片試料データに示される音声の微細な特徴の相違は、予め用意されたピッチのゆらぎを付与するといった人工的に作り出されたものではない。したがって、合成された合唱音が不自然なものとなってしまうことを低減することができる。
【００３１】
Ａ−２．合唱合成装置の具体的な構成
以上説明したのが本実施形態に係る合唱合成装置１００の基本的な構成である。この合唱合成装置１００においては、歌唱生成器１２０，１２１，１２２として、歌詞情報にしたがって音声試料データベース１１０から音声素片試料データを読み出して接続し、メロディ情報にしたがったピッチに応じて接続した音声素片試料データを調整して歌唱音信号を出力するといった歌唱生成器であれば、規則音声合成技術等を応用した歌唱生成器等の公知の種々の歌唱生成器を用いることができ、音声試料データベース１１０には採用する歌唱生成器に対応した音声素片試料データを記憶させておけばよい。以下においては、歌唱生成器１２０，１２１，１２２として、米国特許第５０２９５０９号や特許第２９０６９７０号において提案されているスペクトルモデリング合成（ＳＭＳ：Spectral Modeling Synthesis）技術を利用した歌唱生成器を適用した場合を例に挙げて、合唱合成装置１００について具体的に説明する。
【００３２】
まず、ＳＭＳ技術を利用した歌唱生成器１２０，１２１，１２２を備えた歌唱合成装置１００における音声試料データベース１１０の作成手法について説明する。
【００３３】
上述したように、この合唱合成装置１００における音声試料データベース１１０には、発声者の発した音声に基づいて作成された音声素片試料データが記憶されている。ＳＭＳ技術は、オリジナルの音を２つの成分、すなわち調和成分（deterministic component）と、非調和成分（stochastic component）で表すモデルを使用して楽音の分析および合成を行う技術であり、ＳＭＳ技術を利用した音声合成においては、音素あるいは音素連鎖といった１単位の音声素片試料データとして、上記調和成分および非調和成分からなるデータが音声合成に用いられる。したがって、ＳＭＳ技術を利用した合唱合成装置１００においては、音声試料データベース１１０に、発声者の発した音声をＳＭＳ分析することにより得られた調和成分および非調和成分を示すデータが１つの音声素片試料データとして記憶される。以下、図２を参照しながら、音声試料データベース１１０の作成手法について説明する。
【００３４】
同図に示すように、音声試料データベース１１０の作成のために発声者が発した音声は、ＳＭＳ分析部２００に入力され、ＳＭＳ分析部２００においてＳＭＳ分析される。ここで、音声試料データベース１１０には、異なる音声に基づいて作成した音声試料データ群１１０ａ，１１０ｂ，１１０ｃを記憶しておく必要があるため、３つの異なる音声がＳＭＳ分析部２００に入力されることになる。なお、図示では、３つの異なる音声が並列にＳＭＳ分析部２００に入力されるように表されているが、各音声についてのＳＭＳ分析は同時に並列して行う必要はなく、個別に行うようにしてもよい。
【００３５】
ＳＭＳ分析部２００は、入力される音声に対してＳＭＳ分析を行い、各フレーム毎のＳＭＳ分析データを出力する。より具体的には、以下の手法により各フレーム毎のＳＭＳ分析データを出力する。
【００３６】
まず、入力される音声を一連のフレームに分ける。ここで、ＳＭＳ分析に用いるフレーム周期としては、一定の固定長であってもよいし、入力音声のピッチ等に応じてその周期を変更する可変長の周期であってもよい。
【００３７】
次に、フレームに分けた音声に対して高速フーリエ変換（ＦＦＴ：Fast Fourier Transform）等の周波数分析を行う。この周波数分析によって得られた周波数スペクトル（複素スペクトル）から振幅スペクトルと位相スペクトルを求め、振幅スペクトルのピークに対応する特定の周波数のスペクトルを線スペクトルとして抽出する。このとき、基本周波数およびその整数倍の周波数近傍の周波数を持つスペクトルを線スペクトルとする。このようにして抽出した線スペクトルが上述した調和成分に対応している。
【００３８】
次に、上記のように入力音声から線スペクトルを抽出するとともに、抽出した線スペクトルをそのフレームの入力音声（ＦＦＴ後の波形）から減算することにより、残差スペクトルを得る。あるいは、抽出した線スペクトルから合成した調和成分の時間波形データをそのフレームの入力音声波形データから減算して残差成分の時間波形データを取得した後、これに対してＦＦＴ等の周波数分析を行うことにより残差スペクトルを得るようにしてもよい。このようにして得られた残差スペクトルが上述した非調和成分に対応している。
【００３９】
ＳＭＳ分析部２００は、上記のようにして取得した線スペクトル（調和成分）および残差スペクトル（非調和成分）からなる各フレーム毎のＳＭＳ分析データを区間切り出し部２０１に出力する。
【００４０】
区間切り出し部２０１は、ＳＭＳ分析部２００から供給される各フレーム毎のＳＭＳ分析データを、音声試料データベース１１０に記憶すべき音声素片試料データの１単位（音素あるいは音素連鎖）の長さに対応するように切り出す。区間切り出し部２０１は、各素片の単位長さに対応するようにＳＭＳ分析データを切り出し、音声試料データベース１１０に記憶させる。
【００４１】
ここで、音声試料データベース１１０に記憶される音声素片試料データは、音素あるいは音素連鎖毎に切り出されたＳＭＳデータであり、調和成分については、その音素あるいは音素連鎖に含まれるフレーム全てのスペクトル包絡（線スペクトル（倍音系列）の強度（振幅）および位相のスペクトル）が記憶される。なお、このようなスペクトル包絡そのものを調和成分として記憶させるようにしてもよいが、該スペクトル包絡を何らかの関数で表現したものを記憶させるようにしてもよいし、調和成分を逆ＦＦＴ等して得た時間波形として記憶させるようにしてもよい。本実施形態では、非調和成分についても調和成分と同様に、強度スペクトルと位相スペクトルとして記憶させることとするが、上記調和成分と同様、関数や時間波形として記憶させるようにしてもよい。
【００４２】
このような音声に対するＳＭＳ分析および区間切り出しが３つの異なる入力音声の各々について行われ、この結果、音声試料データ群１１０ａ，１１０ｂ，１１０ｃといった３つの異なる音声に基づいて作成された音声素片試料データ（音素あるいは音素連鎖毎のＳＭＳ分析データ）の群が音声試料データベース１１０に記憶される。
【００４３】
以上が本実施形態に係る合唱合成装置１００の音声試料データベース１１０の作成手法の詳細である。
【００４４】
次に、上述したように異なる音声に基づいて作成された３つの音声試料データ群１１０ａ，１１０ｂ，１１０ｃを記憶する音声試料データベース１１０を用いて歌唱音信号を生成する各歌唱生成器１２０，１２１，１２２について説明する。なお、歌唱生成器１２０，１２１，１２２は、各々同様の構成であるため、以下においては歌唱生成器１２０の構成について図３を参照しながら説明し、他の歌唱生成器１２１，１２２についての説明を割愛する。
【００４５】
同図に示すように、この歌唱生成器１２０は、音声素片選択部３０１と、ピッチ決定部３０２と、継続時間長調整部３０３と、音声素片接続部３０４と、調和成分生成部３０５と、加算部３０６と、逆ＦＦＴ（高速フーリエ変換）部３０７と、窓掛け部３０８と、オーバーラップ部３０９とを備えている。
【００４６】
音声素片選択部３０１は、合唱制御部１４０（図１参照）から供給される歌詞情報および指定情報にしたがって、必要となる音声素片試料データを音声試料データベース１１０から読み出す。より具体的には、供給される歌詞情報を音声記号（音素あるいは音素連鎖）列に変換し、変換した音声記号列にしたがって音声試料データベース１１０から音声素片試料データを読み出す。例えば、「サイタ」（saita）といった歌詞情報にしたがって歌唱音信号を生成する場合には、該歌詞情報が「#s」、「s」、「sa」、「a」、「ai」、「i」、「it」、「t」、「ta」、「a」、「a#」といった音声記号列に変換され、これらの各音声記号に対応する音声素片試料データが音声試料データベース１１０から読み出されることになる。
【００４７】
音声素片選択部３０１は、上記のように歌詞情報にしたがって読み出すべき音声素片試料データを決定し、合唱制御部１４０から供給される指定情報に指定される音声試料データ群の中から決定した音声素片試料データを読み出す。例えば、指定情報が音声試料データ群１１０ａを指定している場合には、音声試料データベース１１０の音声試料データ群１１０ａに含まれる「#s」、「s」、「sa」、「a」、「ai」、「i」、「it」、「t」、「ta」、「a」、「a#」に対応した音声素片試料データを読み出す。
【００４８】
ピッチ決定部３０２は、合唱制御部１４０（図１参照）から供給されるメロディ情報に応じて歌唱音のピッチを決定し、決定したピッチを示すピッチ情報を調和成分生成部３０５に出力する。
【００４９】
継続時間長調整部３０３には、音声素片選択部３０１によって読み出された音声素片試料データ（調和成分および非調和成分）が供給される。ここで、音声素片選択部３０１は、読み出した音声素片試料データをそのまま継続時間長調整部３０３に供給するようにしてもよいが、メロディ情報に示されるピッチ等に応じて適当な補正処理を施してから継続時間長調整部３０３に供給するようにしてもよい。
【００５０】
継続時間長調整部３０３は、メロディ情報等によって決定される音素あるいは音素連鎖毎の発音時間長に応じて音声素片選択部３０１から供給された各音声素片試料データの時間長を変更する処理を行う。より具体的には、ある音声素片試料データを、その時間長より短い時間として使用する場合には、該音声素片試料データからフレームを間引く処理を行う。一方、ある音声素片試料データを、その時間長よりも長い時間継続して使用する場合には、その音声素片試料データを使用する時間長の間繰り返して時間を長くするループ処理を行う。このループ処理において、ある音声素片試料データを繰り返す場合には、当該音声素片試料データの最初から最後（０〜ｔ）までのデータの後に、当該音声素片試料データｊを最初（０）からデータを接続して繰り返すようにしてもよいし、最初から最後（０〜ｔ）までのデータの後に、当該音声素片試料データの時間的に最後（ｔ）の部分から最初の部分に向かってデータを接続して繰り返すようにしてもよい。
【００５１】
継続時間長調整部３０３は、上記のように各音声素片の発音時間長に応じて音声素片試料データ（調和成分および非調和成分）の継続時間長を調整した後、時間調整後の音声素片試料データを音声素片接続部３０４に出力する。
【００５２】
音声素片接続部３０４は、継続時間長調整部３０３から供給された音声素片試料データの調和成分のデータを時系列に接続するとともに、非調和成分のデータを時系列に接続する。このような接続に際し、接続する２つの調和成分のスペクトル包絡の形状の差が大きい場合には、スムージング処理等を施すようにすればよい。音声素片接続部３０４は、接続した調和成分のデータを調和成分生成部３０５に出力するとともに、接続した非調和成分のデータを加算部３０６に出力する。
【００５３】
調和成分生成部３０５には、音声素片接続部３０４から調和成分のデータ（スペクトル包絡情報）が供給されるとともに、ピッチ決定部３０２からメロディ情報にしたがったピッチ情報が供給される。調和成分生成部３０５は、音声素片接続部３０４からのスペクトル包絡情報に示されるスペクトル包絡形状を維持しつつ、ピッチ決定部３０２からのピッチ情報に対応する倍音成分を生成する。
【００５４】
加算部３０６には、音声素片接続部３０４からの非調和成分のデータと、調和成分生成部３０５からの調和成分のデータが供給され、加算部３０６は両者を合成して逆ＦＦＴ部３０７に出力する。逆ＦＦＴ部３０７は、加算部３０６から供給される加算された周波数領域の信号に対し、逆ＦＦＴを施すことにより時間領域の波形信号に変換し、変換後の波形信号を窓掛け部３０８に出力する。窓掛け部３０８では、時間領域の波形信号に対してフレーム長に対応した窓関数が乗算され、オーバーラップ部３０９が乗算後の波形信号をオーバーラップさせながら歌唱音信号を生成する。このようにして歌唱生成器１２０では、合唱制御部１４０（図１参照）から供給された楽曲情報のあるパートの歌詞情報およびメロディ情報にしたがった歌唱音信号が生成され、生成された歌唱音信号が加算器１３０（図１参照）に出力される。
【００５５】
以上が歌唱生成器１２０の詳細な構成であり、図１に示す他の歌唱生成器１２１，１２２（歌唱生成器１２０と同様の構成）からも上記のように合唱制御部１４０から供給された楽曲情報のあるパートの歌詞情報およびメロディ情報にしたがって生成された歌唱音信号が出力される。ここで、上述したように各歌唱生成器１２０，１２１，１２２は、合唱制御部１４０から振り分けられたパートに対応する歌唱音信号を生成する際に、各々異なる音声試料データ群１１０ａ，１１０ｂ，１１０ｃから音声素片試料データを読み出して生成に用いているので、各々が生成する歌唱音信号の微細な特徴（ピッチのゆらぎ等）は異なったものとなる。
【００５６】
加算器１３０は、このように合唱曲の楽曲情報の各パートにしたがって歌唱生成器１２０，１２１，１２２が生成した歌唱音信号を合成して出力する。加算器１３０から出力された３つのパートの歌唱音信号が合成された合唱音信号は、図示せぬＤ／Ａ（Digital to Analog）変換器によってアナログの音声波形信号に変換された後、アンプ等を介してスピーカから放音される。これにより、聴取者は、複数パートからなる合唱曲の楽曲情報にしたがった合唱音を聴くことができる。この合唱合成装置１００から放音される合唱音は、各パートの歌唱音の微細な特徴（ピッチのゆらぎ等の相違に起因する声質等）が相違しており、聴取者により自然な印象を与えることが可能な合唱音を発音することができるのである。
【００５７】
Ｂ．第２実施形態
次に、本発明の第２実施形態に係る合唱合成装置について、図４を参照しながら説明する。同図に示すように、上記第１実施形態における合唱合成装置１００の音声試料データベース１１０が３つの音声試料データ群１１０ａ，１１０ｂ，１１０ｃを記憶していたのに対し、第２実施形態に係る合唱合成装置４００における音声試料データベース１１０には、同一の音素または音素連鎖については１種類の音声素片試料データしか記憶されていない点で相違している。第２実施形態に係る合唱合成装置４００は、このように１つの音素または音素連鎖について１つの音声素片試料データのみを記憶する音声試料データベース１１０を用いて、上記第１実施形態と同様により自然な印象を与えることが可能な合唱音信号を合成することができるようになっている。以下、合唱合成装置４００の構成について、上記第１実施形態に係る合唱合成装置１００との相違点を中心に説明する。
【００５８】
合唱合成装置４００における歌唱生成器１２０，１２１，１２２の各々は、上記第１実施形態と同様であり、歌詞情報およびメロディ情報を有する楽曲情報にしたがって音声試料データベース１１０から必要となる音声素片試料データを読み出し、読み出した音声素片試料データを用いて歌唱音信号を生成する。第２実施形態においては、音声試料データベース１１０には１つの音素あるいは音素連鎖については１つの音声素片試料データしか記憶されていないため、合唱曲の楽曲情報にしたがって歌唱音信号を生成する際には、各歌唱生成器１２０，１２１，１２２が同一の音声素片試料データを用いることもあり得る。上述したように複数のパートの歌唱音信号を同一の音声素片試料データを用いて生成した場合、微細な特徴（ピッチのゆらぎ等）が基本的に同一になるため、聴取者に不自然な印象を与えてしまう。
【００５９】
そこで、この合唱合成装置４００では、合唱制御部１４０が合唱曲の楽曲情報の歌詞情報およびメロディ情報を各パート毎に分割して各歌唱生成器１２０，１２１，１２２に出力するとともに、音声試料データベース１１０に記憶されている音声素片試料データをどの時間に対応する部分から使用を開始するかを指定する指定情報を各歌唱生成器１２０，１２１，１２２に出力するようになっている。
【００６０】
上述した第１実施形態で説明したように、音声試料データベース１１０に記憶される音声素片試料データは、発声者の発した音声に基づいて作成されたものであり、所定の時間長（１フレーム〜数フレーム等）の音声波形に基づいて作成されたデータである。すなわち、前記所定の時間内における時間と振幅との関係で表される音声波形に基づいて作成されたデータである。したがって、上記第１実施形態のように音声素片試料データが周波数領域のデータとして記憶されている場合にも、そのデータは時間領域の音声波形にＦＦＴ等を施して得られたものである。合唱制御部１４０は、このように時間に伴って変化する情報である音声素片試料データをどの時間に対応する部分から使用するかを指定する指定情報を各歌唱生成器１２０，１２１，１２２に供給するのである。
【００６１】
ここで、合唱制御部１４０は、各歌唱生成器１２０，１２１，１２２が音声素片試料データを、互いに異なる時間に対応する部分から使用を開始して歌唱音信号を生成するように、各歌唱生成器１２０，１２１，１２２に異なる使用開始時間を指定する指定情報を出力する。
【００６２】
各歌唱生成器１２０，１２１，１２２は、合唱制御部１４０から供給される各パートの歌詞情報に基づいて必要となる音声素片試料データを音声試料データベース１１０から読み出すと共に、読み出した音声素片試料データを、合唱制御部１４０から指定情報に指定される時間に対応する部分から使用を開始して歌唱音信号の生成を行う。
【００６３】
以下、３つのパートの歌詞情報にしたがって読み出される音声素片試料データが母音の「a」であり、音声素片試料データ「a」がＦ０〜Ｆ１３といった１３のフレーム（時間０〜Ｔ）からなり、該音声素片試料データを１３フレーム分の長さを使用して各歌唱生成器１２０，１２１，１２２が歌唱音信号を生成する場合について、図５および図６を参照しながら具体的に例示して説明する。
【００６４】
図５に示す例では、歌唱生成器１２０に対しては最初のフレームＦ０から使用を開始するように指定する指定情報が供給されており、歌唱生成器１２１に対してはフレームＦ３から使用を開始するように指定する指定情報が供給されており、歌唱生成器１２２に対してはフレームＦ６から使用を開始するように指定する指定情報が供給されている。なお、図示では説明の便宜上、音声素片試料データが時間領域の音声波形として示されているが、音声試料データベース１１０に記憶しておくデータは、上記第１実施形態のように周波数領域で表現される調和成分（線スペクトル）および非調和成分（残差スペクトル）といった形態であってもよい。
【００６５】
このような指定情報が供給されている場合には、図６に示すように、歌唱生成器１２０は、フレームＦ０，Ｆ１，Ｆ２……Ｆ１３といった順序、つまり音声素片試料データ「a」をそのまま使用して歌唱音信号の生成に用いる。また、歌唱生成器１２１は、フレームＦ３，Ｆ４，Ｆ５……Ｆ１３，Ｆ０，Ｆ１，Ｆ２，Ｆ３といった順次で音声素片試料データ「a」を使用して歌唱音信号の生成を行う。さらに、歌唱生成器１２２は、フレームＦ６，Ｆ７……Ｆ１３，Ｆ０，Ｆ１……Ｆ５といった順序の音声素片試料データ「a」を使用して歌唱音信号の生成を行う。
【００６６】
このように合唱制御部１４０が互いに異なる時間に対応する部分から使用を開始して歌唱音信号を生成するように指定情報を出力することにより、同じ音素「a」を同じ時間長（０〜Ｔまで）だけ用いて歌唱音信号を生成する際にも、各歌唱生成器１２０，１２１，１２２が実際に用いるデータは異なるものとなっている。すなわち、各歌唱生成器１２０，１２１，１２２が実際に用いる音声素片試料データに示される微細な特徴（ピッチのゆらぎ等）は異なったものとなり、１つの音素あるいは音素連鎖について１種類の音声素片試料データを用いて、各歌唱生成器１２０，１２１，１２２が微細な特徴の異なる歌唱音信号を生成することができるのである。
【００６７】
ところで、音素「a」のように単一の音素についての音声素片試料データを用いる場合には、上記のように単純にデータ中の使用開始時間をずらすといった手法により、各パートについて生成される歌唱音の微細な特徴を変えてより自然な合唱歌唱音を合成することができるが、複数の音素が連なる音素連鎖についての音声素片試料データの場合には、単純にデータ中の使用開始時間をずらすだけでは不都合が生じることもある。例えば、「ai」といった音素連鎖についての音声試料データの場合、時間領域における前半部分は「a」の音素をより強く反映したデータであり、後半部分は「i」の音素をより強く反映したデータである。したがって、音素連鎖「ai」の歌唱音信号を生成するために、音素「i」の影響の強い後半部分から使用を開始した場合には、音素連鎖「ia」に類似した傾向を持つデータを用いることになってしまう虞があり、この場合、生成すべき音素連鎖「ai」についての信号が正確に生成できなくなってしまう。
【００６８】
そこで、本実施形態では、複数の音素連鎖に対応する音声素片試料データを用いる場合には、合唱制御部１４０は、図７に示すような指定情報を各歌唱生成器１２０，１２１，１２２に出力するようにしている。同図に示す例では、歌唱生成器１２０に対しては最初のフレームＦ０から使用を開始するように指定する指定情報が供給されており、歌唱生成器１２１に対してはフレームＦ２から使用を開始するように指定する指定情報が供給されており、歌唱生成器１２２に対してはフレームＦ４から使用を開始するように指定する指定情報が供給されている。すなわち、上記単一の音素についての指定情報と比較すると、各歌唱生成器１２０，１２１，１２２に対して指定する使用開始時間が前半部分（「a」の影響の強い）に集中している。このように各パートの使用開始時間をデータの前半部分に集中させることで、上記のように実際に使用するデータが音素連鎖「ia」に類似してしまうことを抑制している。
【００６９】
また、上記のように指定情報が供給されている場合に、歌唱生成器１２１がフレームＦ２，Ｆ３，Ｆ４……Ｆ１３，Ｆ０，Ｆ１，Ｆ２といった順序、すなわち一方向に順番に音声素片試料データを使用して歌唱音信号の生成を行うと、音素「a」の影響の強いフレームＦ０〜Ｆ２が本来「i」の影響を強くすべき後半部分のデータとして用いられてしまうことになる。そこで、本実施形態では、複数の音素からなる音素連鎖についての音声素片試料データを用いる場合には、最後のフレーム（Ｆ１３）の後にフレームＦ１に戻るのではなく、フレームＦ１２，Ｆ１１……といったように逆方向に戻る順序でフレームを用いるようにしている。したがって、図７に示すように使用開始フレームが指定されている場合には、図８に示すように、歌唱生成器１２０は、フレームＦ０，Ｆ１，Ｆ２……Ｆ１３といった順序、つまり音声試料データベース１１０に記憶されている音声素片試料データをそのまま使用して歌唱音信号の生成に用いる。また、歌唱生成器１２１は、フレームＦ２，Ｆ３，Ｆ４……Ｆ１３，Ｆ１２，Ｆ１１といった順次で音声素片試料データ」を使用して歌唱音信号の生成を行う。さらに、歌唱生成器１２２は、フレームＦ４，Ｆ５，Ｆ６……Ｆ１３，Ｆ１２，Ｆ１１，Ｆ１０，Ｆ９といった順序の音声素片試料データを使用して歌唱音信号の生成を行う。なお、フレームＦ１３からフレームＦ１２といったように逆方向に戻る順序でフレームを使用する際には両者の接続部分に雑音等が生じる虞があるため、各フレームの接続部分において振幅調整処理やクロスフェード処理等を施すようにすればよい。
【００７０】
複数の音素からなる音素連鎖の音声素片試料データを各歌唱生成器１２０，１２１，１２２で用いる場合には、以上のようにすることでより正確に音素連鎖を生成することができ、また各歌唱生成器１２０，１２１，１２２から出力される歌唱音信号の微細な特徴等が異なるものとなる。
【００７１】
以上説明したように、第２実施形態に係る合唱合成装置４００では、１つの音素あるいは音素連鎖について１つの音声素片試料データしか記憶されていなくても、１つの音声素片試料データを用いて、上記第１実施形態と同様により自然な印象を与えることが可能な合唱音信号を合成することができる。すなわち、音声試料データベース１１０に記憶させておくデータ量を抑制しつつ、より自然な印象を与えることが可能な合唱音信号を合成することができるのである。
【００７２】
Ｃ．変形例
なお、本発明は、上述した第１および第２実施形態に限定されるものではなく、以下に例示するような種々の変形が可能である。
【００７３】
（変形例１）
上述した各実施形態においては、音素あるいは音素連鎖といった単位の音声素片試料データを接続して歌唱音信号を生成するようにしているが、ビブラートといわれる歌唱表現法があり、上記各実施形態における合唱合成装置にこのビブラートによる歌唱表現を加える機能を付加するようにしてもよい。
【００７４】
従来より、ビブラートによる歌唱音を電子的に発音するための歌唱音信号を生成する手法としては、上記各実施形態のように音素あるいは音素連鎖単位の音声素片試料データを接続するとともに、該接続した音声素片試料データによって表現される波形に約６Ｈｚ程度の周波数変調を付与する方法が知られている。このような方法を実施するための構成を上記各実施形態における合唱合成装置に加えるようにしてもよいが、聴取者により自然な印象を与えることが可能なビブラート歌唱音信号の生成方法として、発声者がビブラート歌唱法で歌唱した時の音声に基づいて作成したビブラート音声試料データを用いる方法があり、この方法を実施するための構成を上記各実施形態に係る合唱合成装置に付加することが好ましい。
【００７５】
以下、図９を参照しながら、発声者のビブラート歌唱音声に基づいて作成したビブラート音声試料データを用いて歌唱音信号を生成する機能を上記第１実施形態における合唱合成装置に付加した場合を例に挙げて説明する。
【００７６】
同図に示すように、この合唱合成装置１００’における音声試料データベース１１０には、上記音声試料データ群１１０ａ，１１０ｂ，１１０ｃといった音素あるいは音素連鎖を単位とした音声素片試料データに加え、ビブラート歌唱時の歌唱音声に基づいて作成されたビブラート音声試料データが記憶されている。ここで、音声試料データベース１１０には、各々異なる音声に基づいて作成された３つのビブラート音声試料データＢＤａ，ＢＤｂ，ＢＤｃが記憶されている。
【００７７】
この構成の下、合唱制御部１４０は、上述した第１実施形態と同様、各歌唱生成器１２０，１２１，１２２に各パートの歌詞情報およびメロディ情報と、使用する音声試料データ群を指定する指定情報とに加え、３つのビブラート音声試料データＢＤａ，ＢＤｂ，ＢＤｃのいずれを使用するかを指定する第２の指定情報を供給するようになっている。ここで、各歌唱生成器１２０，１２１，１２２に供給される第２の指定情報は、異なるビブラート音声試料データを使用するように指定する情報である。このような第２の指定情報を各歌唱生成器１２０，１２１，１２２に供給することによって、各歌唱生成器１２０，１２１，１２２はビブラート歌唱音信号を生成する際に各々異なるビブラート音声試料データを読み出し、上記実施形態と同様に接続した音声素片試料データによって表現される音声波形に、読み出したビブラート音声試料データによって表現される波形を重ね、重ね合わせた波形信号を歌唱音信号として出力する。
【００７８】
このようにビブラート歌唱音信号を生成する際に、各歌唱生成器１２０，１２１，１２２が異なる音声に基づいて作成された３つのビブラート音声試料データＢＤａ，ＢＤｂ，ＢＤｃを各パート毎に使い分けることにより、生成されるビブラート歌唱音信号の微細な特徴（ビブラート時の周波数の変動具合等）も各パート毎に異なったものとなる。このようにビブラート歌唱音の各パート毎の相関関係がほとんどなく、各々のパートが固有の特徴を持つことになるため、当該合唱合成装置１００’によって合成された合唱音信号に基づいた歌唱音のビブラート部分を聴いた聴取者に対して、より自然な印象を与えることが可能となる。
【００７９】
ところで、合唱音において各パートのビブラート部分の特徴が基本的に同一であることは、聴取者にとって他の部分の特徴が同一である場合よりも不自然な印象を与えるものである。したがって、ビブラート部分だけでも各パート毎に固有の特徴を付与した装置が要望されることもあり得る。このような場合には、上記各実施形態のように音素あるいは音素連鎖についての音声試料データは、各パートで同一のものをそのまま使用して歌唱音信号を生成し、生成した歌唱音信号に各パート毎に異なるビブラート音声試料データによって表現される波形を加算してビブラート効果を付与するようにしてもよい。
【００８０】
（変形例２）
また、図９に示すように、各歌唱生成器１２０，１２１，１２２の数に対応して３つのビブラート音声試料データを用いるようにしてもよいが、図１０に示す合唱歌合成装置４００’のように、歌唱生成器１２０，１２１，１２２が同一のビブラート音声試料データを用いてビブラート部分の歌唱音信号を生成するようにしてもよい。
【００８１】
上述した実施形態で説明したように歌唱生成器１２０，１２１，１２２は、音声試料データ群１１０ａ，１１０ｂ，１１０ｃを使い分けることにより各々異なる固有の特徴を有する歌唱音信号を生成することができるので、このように生成した歌唱音信号に同一のビブラート音声試料データによって表現される波形を加算しても、各々の歌唱生成器１２０，１２１，１２２から出力されるビブラート部分の歌唱音信号は固有の特徴を有したものとなる。したがって、単純に１つのビブラート音声試料データを各歌唱生成器１２０，１２１，１２２が用いるようにしてもよいが、ビブラート音声試料データについても、上記第２実施形態において各歌唱生成器１２０，１２１，１２２による音声素片試料データの使用方法として説明したように、各々の歌唱生成器１２０，１２１，１２２が同一のビブラート音声試料データの異なる時間に対応する部分から使用を開始するようにしてもよい。この場合、合唱制御部１４０がどの時間に対応する部分から使用を開始するかを指定する指定情報を各歌唱生成器１２０，１２１，１２２に供給するようにすればよい。このようにすることで、各歌唱生成器１２０，１２１，１２２がビブラート付与のために用いる実際の音声試料データは異なる特徴を有するものとなる。したがって、各々のパートのビブラート部分の歌唱音信号が固有の特徴を有するものとなり、当該合唱合成装置４００’によって合成された合唱音信号に基づいた歌唱音のビブラート部分を聴いた聴取者に対して、より自然な印象を与えることが可能となる。
【００８２】
（変形例３）
また、上記変形例においては、生成する歌唱音信号にビブラート効果を付与するために音声試料データベース１１０にビブラート音声試料データを記憶させておくようにしていたが、ビブラート以外のトレモロ、ポルタメント等の種々の歌唱法による歌唱音を電子的に放音するために、発声者によるトレモロ部分の歌唱音声や、ポルタメント部分の歌唱音声に基づいて作成した音声試料データを音声試料データベース１１０に記憶させておくようにしてもよい。この場合にも、上述した変形例におけるビブラート音声試料データと同様、各パート毎に音声試料データを用意しておいたり、同じ音声試料データであっても異なる時間に対応した部分から使用を開始したりすることにより、各パートのトレモロやポルタメント部分の歌唱音信号に固有の特徴を持たせることができる。
【００８３】
（変形例４）
また、上述した第１実施形態では、３つの音声試料データ群１１０ａ，１１０ｂ，１１０ｃを音声試料データベース１１０に記憶させるようにしていたが、高音、中音、低音といったように異なるピッチの音声に基づいて、各音声試料データ群１１０ａ，１１０ｂ，１１０ｃに含まれる音声素片試料データを作成するようにしてもよい。例えば、高音の音声に基づいて作成した音声素片試料データは音声試料データ群１１０ａに含ませるようにし、中音の音声に基づいて作成した音声素片試料データは音声試料データ群１１０ｂに含ませるようにし、低音の音声に基づいて作成した音声素片試料データを音声試料データ群１１０ｃに含ませるようにしてもよい。
【００８４】
このように各音域毎に作成された音声試料データ群１１０ａ，１１０ｂ，１１０ｃを記憶した音声試料データベース１１０を用いる場合、合唱制御部１４０は、楽曲情報に含まれる複数のパートのうち、高音域のメロディからなるパートの歌唱音信号の生成を担当する歌唱生成器に対し、高音の音声に基づいて作成した音声試料データ群１１０ａを用いるように指定する指定情報を出力する。また、中音域のメロディからなるパートの歌唱音信号の生成を担当する歌唱生成器に対し、中音の音声に基づいて作成した音声試料データ群１１０ｂを用いるように指定する指定情報を出力し、さらに低音域のメロディからなるパートの歌唱音信号の生成を担当する歌唱生成器に対し、低音の音声に基づいて作成した音声試料データ群１１０ｃを用いるように指定する指定情報を出力する。これにより各歌唱生成器１２０，１２１，１２２は、各々が担当するパートの歌唱音信号の生成により好適な音声素片試料データを用いることができ、より高品位の歌唱音信号を生成することができる。
【００８５】
なお、上記のようにある楽曲に対応する歌唱音信号の生成時には各歌唱生成器１２０，１２１，１２２が使用する音声試料データ群１１０ａ，１１０ｂ，１１０ｃを固定するようにしてもよいが、各パート毎にメロディ情報によって決定される各パート毎のピッチの高低が時間毎に変化することも考えられる。この場合には、ある１つの楽曲の歌唱音信号を生成する際に、各パート毎のメロディ情報によって決定される各パート毎のピッチに高低に応じて合唱制御部１４０が各歌唱生成器１２０，１２１，１２２に対して指定する音声試料データ群を楽曲の途中で逐次変更するような指定情報を出力するようにしてもよい。
【００８６】
（変形例５）
また、上述した変形例では、異なるピッチ毎の音声に基づいて作成した音声試料データ群１１０ａ，１１０ｂ，１１０ｃを音声試料データベース１１０に記憶させるようにしていたが、歌唱時には同じ音韻を発声している間にピッチが大きく変動することもある。したがって、音声試料データベース１１０に、同じ音韻、例えば「a」を発声している間にピッチ（音高）を変動させて発した音声に基づいて音声素片試料データを作成し、該音声素片試料データを音声試料データベース１１０に記憶させるようにしてもよい。このように音声試料データベース１１０には、上述した各実施形態において説明した同一ピッチの音素あるいは音素連鎖だけではなく、歌唱時に起こりうる様々なピッチ変動等を考慮して音声素片試料データを作成しておくようにしてもよい。
【００８７】
（変形例６）
また、上述した第１実施形態では、音声試料データベース１１０に記憶されている音声試料データ群１１０ａ，１１０ｂ，１１０ｃを各歌唱生成器１２０，１２１，１２２が使い分けて用い、第２実施形態では、同一の音声素片試料データを異なる時間に対応する部分から使用を開始することにより、より自然な印象を与えることが可能な合唱音信号を合成していた。このような第１および第２実施形態に係る合唱合成装置に、音声試料データベース１１０から読み出した音声素片試料データに示される何らかの値（すなわち音を決定付けるパラメータ）を各歌唱生成器１２０，１２１，１２２毎に変更してから供給するパラメータ変更手段を設けるようにしてもよい。このようなパラメータ変更手段を上記第１実施形態に係る合唱合成装置に付加した場合の構成を図１１に示す。
【００８８】
同図に示すように、この合唱合成装置１００”は、上記第１実施形態における合唱合成装置１００の構成に加え、各歌唱生成器１２０，１２１，１２２に対応して設けられるパラメータ変更部２２０，２２１，２２２を備えている。この構成の下、合唱制御部１４０は、上述した第１実施形態と同様、歌唱生成器１２０，１２１，１２２に各パートの歌詞情報およびメロディ情報と、どの音声試料データ群を使用するかを指定する指定情報を出力するとともに、パラメータ変更部２２０，２２１，２２２の各々に対してパラメータの変更内容を示す変更情報を出力する。ここで、合唱制御部１４０は、音声試料データベース１１０から読み出した音声素片試料データに対して各々異なる内容の変更が施されるような変更情報を各パラメータ変更部２２０，２２１，２２２に出力する。
【００８９】
パラメータ変更部２２０，２２１，２２２は、対応する歌唱生成器１２０，１２１，１２２が必要とする音声素片試料データを指定情報に示される音声試料データ群の中から読み出し、合唱制御部１４０から供給される変更情報にしたがって読み出した音声素片試料データを変更する。そして、変更後の音声素片試料データが対応する歌唱生成器１２０，１２１，１２２に供給する。そして、各歌唱生成器１２０，１２１，１２２が変更後の音声素片試料データを用いて歌唱音信号を生成する。
【００９０】
ここで、パラメータ変更部２２０，２２１，２２２が音声試料データベース１１０から読み出した音声素片試料データに対して行う変更処理の内容としては、音韻性を損なわない程度に音色等を変更する処理であれば種々の変更処理を適用することができる。例えば、音声試料データベース１１０から読み出したある音声素片試料データによって表現される音声のフォルマント構造をモデル化し、フォルマントのバンド幅を数％変更したり、バンドの中心周波数を１０Ｈｚ程度シフトする等によって音色を微妙に変更する方法がある。この場合、変更するフォルマントのバンド幅の割合や、バンドの中心周波数のシフトする量を各パラメータ変更部２２０，２２１，２２２毎に異なる値とすることにより、各パラメータ変更部２２０，２２１，２２２によって読み出された音声素片試料データに示される音声の音色が微妙に異なるものとなる。
【００９１】
（変形例７）
また、上述した各実施形態において、各歌唱生成器１２０，１２１，１２２によって生成された合唱音がより自然な印象を聴取者に与えるために、各パート毎に生成された歌唱音信号による歌唱音の発音タイミングをずらすようにしてもよい。この場合、合唱制御部１４０が各パートに対して発音タイミングをどの程度ずらすかを指定するタイミング指定情報を供給する。この際、合唱制御部１４０は、各歌唱生成器１２０，１２１，１２２での発音タイミングが微妙にずれるようなタイミング指定情報を各歌唱生成器１２０，１２１，１２２に供給する。例えば、歌唱生成器１２０に対しては、合唱制御部１４０から供給される歌詞情報およびメロディ情報にしたがって生成した歌唱音信号を遅延させることなく加算器１３０に出力させ、歌唱生成器１２１に対しては、１０msec遅延させて歌唱音信号を加算器１３０に出力させ、歌唱生成器１２２に対しては２０msec遅延させて歌唱音信号を加算器１３０に出力させるようにすれば、各パートの歌唱音が微妙にずれて発音され、聴取者に対してより自然な印象を与えることができる。
【００９２】
なお、上記のようにある１つの楽曲の歌唱音信号を生成している際に、各歌唱生成器１２０，１２１，１２２の発音タイミングの相関関係を固定するようにしてもよいが、ある楽曲の途中であっても歌唱生成器１２０，１２１，１２２の発音タイミングの相関関係を変動させるようにしてもよい。例えば楽曲の前半部分では、上記例のように歌唱生成器１２０，１２１，１２２といった順序で発音するようにし、楽曲の後半部分では歌唱生成器１２２，１２１，１２０といった順序で発音するようにしてもよい。
【００９３】
（変形例８）
また、上述した第１実施形態では、歌唱生成器１２０，１２１，１２２の数（３つ）に応じた種類の音声試料データ群を音声試料データベース１１０に記憶させるようにしていたが、歌唱生成器の数よりも多い種類の音声試料データ群を記憶させるようにしてもよい。
【００９４】
また、歌唱生成器１２０，１２１，１２２といった３つの歌唱生成器を備えている場合に、音声試料データベース１１０に２つの音声試料データ群１１０ａ，１１０ｂしか記憶されていない場合には、少なくとも２つの歌唱生成器が異なる音声試料データ群１１０ａ，１１０ｂを用いて歌唱音信号を生成するようにすればよい。この場合には、歌唱生成器１２０が音声試料データ群１１０ａを用い、歌唱生成器１２１が音声試料データ群１１０ｂを用い、歌唱生成器１２２が音声試料データ群１１０ａ，１１０ｂのいずれかを歌唱生成器１２０，１２１と異なる時間に対応する部分から使用を開始すれば、３つの歌唱生成器１２０，１２１，１２２が実際には異なる音声素片試料データを用いて歌唱音信号を生成することになり、上記各実施形態と同様、自然な印象を与えることが可能な合唱音信号を合成することができる。
【００９５】
（変形例９）
上述した各実施形態および変形例における合唱合成装置は、専用のハードウェア回路で構成するようにしてもよいが、図１２に示すようなコンピュータシステムによるソフトウェアによって構成するようにしてもよい。同図に示すように、このコンピュータシステムは、装置全体を制御するＣＰＵ（Central Processing Unit）３２０、各種制御データやプログラム群を記憶するＲＯＭ（Read Only Memory）３２１、ワークエリアとして使用されるＲＡＭ（Random Access Memory）３２２、楽曲情報やプログラム群を記憶するハードディスクやＣＤ−ＲＯＭ（Compact Disc Read Only Memory）ドライブ等の外部記憶装置３２３、キーボードやマウス等の操作部３２４、各種情報をユーザに表示する表示部３２５、Ｄ／Ａ変換器３２６、アンプ３２７、スピーカ３２８を備えている。
【００９６】
ＣＰＵ３２０は、ＲＯＭ３２１もしくはハードディスク等の外部記憶装置３２３に記憶されているプログラム群にしたがって、音声試料データベース１１０をＲＡＭ３２２もしくは外部記憶装置３２３に構築し、音声試料データベース１１０を用いて上記各実施形態や変形例と同様に各パート毎の歌唱音信号合成処理を行う。そして、ＣＰＵ３２０は、生成した各パート毎の歌唱音信号を加算した後、加算後の合唱音信号をＤ／Ａ変換器３２６に出力する。Ｄ／Ａ変換器３２６では合唱音信号がアナログ信号に変換され、該合唱音のアナログ信号アンプ３２７によって増幅された後、スピーカ３２８から放音される。
【００９７】
このように上記各実施形態および変形例における合唱合成装置は、コンピュータシステムによるソフトウェアによって構成することが可能であり、上記各実施形態等と同様の合唱音合成処理をコンピュータシステムに実行させるためのプログラムの形態でユーザに提供するようにしてもよい。このようなプログラムの提供方法としては、ＣＤ−ＲＯＭやフロッピーディスク等の各種記録媒体に記憶して提供する方法や、インターネット等の通信回線を介して提供する方法等がある。
【００９８】
【発明の効果】
以上説明したように、本発明によれば、より自然な印象を聴取者に与えることが可能な合唱音を合成することができる。
【図面の簡単な説明】
【図１】本発明の第１実施形態に係る合唱合成装置の基本構成を示すブロック図である。
【図２】前記合唱合成装置の構成要素でる音声試料データベースの作成手法を説明するための図である。
【図３】前記合唱合成装置の構成要素である歌唱生成器の機能構成を示すブロック図である。
【図４】本発明の第２実施形態に係る合唱合成装置の基本構成を示すブロック図である。
【図５】第２実施形態に係る前記合唱合成装置による歌唱音信号生成方法を説明するための図である。
【図６】第２実施形態に係る前記合唱合成装置による歌唱音信号生成方法を説明するための図である。
【図７】第２実施形態に係る前記合唱合成装置による歌唱音信号生成方法を説明するための図である。
【図８】第２実施形態に係る前記合唱合成装置による歌唱音信号生成方法を説明するための図である。
【図９】第１実施形態に係る前記合唱合成装置の変形例の基本構成を示すブロック図である。
【図１０】第２実施形態に係る前記合唱合成装置の変形例の基本構成を示すブロック図である。
【図１１】第１実施形態に係る前記合唱合成装置の他の変形例の基本構成を示すブロック図である。
【図１２】前記合唱合成装置による機能をソフトウェアによって実現するためのコンピュータシステムの構成を示すブロック図である。
【符号の説明】
１００、１００’、１００”……合唱合成装置、１１０……音声試料データベース、１１０ａ，１１０ｂ，１１０ｃ……音声試料データ群、１２０……歌唱生成器、１２１……歌唱生成器、１２２……歌唱生成器、１３０……加算器、１４０……合唱制御部、２００……ＳＭＳ分析部、２０１……区間切り出し部、２２０、２２１，２２２……パラメータ変更部、３０１……音声素片選択部、３０２……ピッチ決定部、３０３……継続時間長調整部、３０４……音声素片接続部、３０５……調和成分生成部、３０６……加算部、３０７……逆ＦＦＴ部、３０８……窓掛け部、３０９……オーバーラップ部、４００、４００’……合唱合成装置。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a chorus synthesizer that synthesizes a choral sound signal, a choral synthesis method, and a program for synthesizing a choral sound.
[0002]
[Prior art]
Conventionally, a chorus synthesizer that synthesizes a singing sound signal and produces a singing voice based on lyrics information and melody information has been proposed. As devices for synthesizing a singing sound signal in this way, various devices such as a device applying a regular speech synthesis technique have been proposed. In a singing voice synthesizing apparatus to which a rule synthesis technique is applied, voice sample data in units of phonemes or phoneme chains including a plurality of phonemes is created from voices uttered by a speaker in advance and stored in a database. The voice signal data such as phonemes required according to the lyric information is read and connected to synthesize a singing sound signal.
[0003]
By the way, in the singing sound synthesizing apparatus that synthesizes the singing sound as described above, unlike a speech synthesizing apparatus such as a text-to-speech apparatus, a usage form of electronically outputting the singing sound at the time of chorusing such as singing or singing can be considered. Therefore, development of a chorus synthesizer having a function of synthesizing a singing sound (choral sound) at the time of chorus is also being performed.
[0004]
A choir synthesizing apparatus having a function of synthesizing a chorus sound signal at the time of chorus generates a chorus sound signal by reading and connecting audio sample data based on each of the plurality of parts. Then, by superimposing and outputting the singing sound signals generated for each part, the chorus sound can be electronically output.
[0005]
[Problems to be solved by the invention]
However, in the choral synthesizer having the function of synthesizing the conventional choral sound signal, the same voice sample data is used when generating the singing sound signal according to the lyric information and the melody information for each part. Although the singing sound generated for each part has a different melody, fine features (such as pitch fluctuation) of the generated speech waveform for each part are basically the same. Therefore, the chorus sound obtained by superimposing these sounds as an unnatural chorus sound for the listener. This is considered that the listener listens to the correlation between the parts (the fine features match), giving an unnatural impression.
[0006]
In addition, when synthesizing the chorus sound signal at the time of singing, the method of simply generating and superimposing the singing sound signal for each part as described above, the same singing sound is superimposed and output, As a result, an unnatural impression is given to the listener. Therefore, in the conventional chorus sound synthesizer, when synthesizing the chorus sound signal at the time of singing, the sound generation timing of the singing sound generated for each part (the content is the same) is slightly shifted or generated for each part. By slightly shifting the pitch of the singing sound, the same singing sound was prevented from being overlaid and pronounced. However, even when the sound generation timing and pitch are slightly shifted, the fine features (fluctuations and the like) of the speech waveform generated for each part as described above are basically the same. Therefore, the chorus sound obtained by superimposing these sounds like a chorus sound unnatural for the listener.
[0007]
Japanese Patent Application Laid-Open No. 7-146695 discloses an apparatus for generating a chorus sound signal. In this apparatus, when generating a singing sound signal for each part, the pitch fluctuations that differ for each part are disclosed. A singing sound signal with components added is generated. Thus, the correlation between each part can be made small by overlapping and outputting the singing sound signal which provided the fluctuation component of a different pitch for every part. However, in the apparatus described in this publication, the pitch component added to the singing sound signal for each part is not based on human voice, but is artificially created. Although the correlation between the two becomes small, the synthesized chorus may be heard unnaturally.
[0008]
The present invention has been made in consideration of the above circumstances, and provides a choral synthesizer, a choral synthesis method, and a program capable of synthesizing a choral sound that can give a listener a more natural impression. For the purpose.
[0009]
[Means for Solving the Problems]
In order to solve the above problems, a choral synthesizer according to the present invention is a choral synthesizer that synthesizes a choral sound signal based on music data, and is a voice sample data group including a plurality of voice sample data, The voice sample data group created based on different voices For each range A database to be stored; means for generating a singing sound signal according to the music data; and a plurality of singing sound generating means used for generating the singing sound signal by reading out the required voice sample data from the database; Singing synthesis means for synthesizing a chorus sound signal from singing sound signals generated by a plurality of song generation means, and the music data is composed of a plurality of parts, and each of the plurality of song generation means is in each of the parts. When generating the corresponding singing sound signal, each of the at least two singing generation means, According to the range corresponding to each part of the song generation means The voice sample data included in the voice sample data group is read from the database and used to generate the singing sound signal.
[0010]
According to this configuration, when the singing sound signal of the part to which each singing generation unit corresponds is generated, the audio sample data created based on different sounds by at least two singing generation units is used. Here, since the voice sample data created based on different voices have different fine features and the like, the singing sound signals output from the at least two song generation means have different fine features. Therefore, since the singing sound having a unique characteristic is emitted as the singing sound corresponding to each part, a more natural impression can be given to the listener.
[0012]
According to this structure, when generating the song sound signal of the part to which each song generating unit corresponds, at least two song generating units start using the portions corresponding to different times of the voice sample data and generate them. It will be. Here, in the voice sample data having a certain length of time created based on the voice, the fine features (fluctuation of the voice waveform) are not constant during the time length, and the fine features and the like vary with time. For this reason, the singing sound signals output from the at least two singing generation means have different fine features. Therefore, since the singing sound having a unique characteristic is emitted as the singing sound corresponding to each part, a more natural impression can be given to the listener.
[0013]
Moreover, the choral synthesis method according to the present invention is a choral synthesis method for synthesizing a choral sound signal from a plurality of singing sound signals generated based on music data, and the plurality of the choral synthesizing methods according to the music data composed of a plurality of parts. When generating a singing sound signal corresponding to the part, a voice sample data group composed of a plurality of voice sample data, each created based on a plurality of different voices, For each range For reading out the voice sample data required from the database to be stored and generating the singing sound signal corresponding to at least two of the parts, for each part, Depending on the range corresponding to each part The voice sample data included in the voice sample data group is read from the database and used to generate the singing sound signal.
[0014]
Moreover, the chorus synthesis method according to another aspect of the present invention is a choral synthesis method for synthesizing a choral sound signal from a plurality of singing sound signals generated based on music data. Therefore, when generating the singing sound signal corresponding to the plurality of parts, the required voice sample data is read out from a database that stores voice sample data having a predetermined time length created based on the voice, and at least The generation of the singing sound signal corresponding to the two parts is characterized in that the singing sound signal is generated by starting the use from the part corresponding to the different time of the audio sample data read from the database.
[0015]
Further, the program according to the present invention allows a computer to generate a voice sample data group composed of a plurality of voice sample data according to music data, each created based on a plurality of different voices. For each range A means for generating the singing sound signal by reading out the voice sample data required from the database to be stored, wherein the music data is composed of a plurality of parts, and the singing sound signal corresponding to the plurality of parts is generated. Is generated for each part when generating a singing sound signal corresponding to at least two parts. Depending on the range corresponding to each part Functions as singing sound generating means for reading out the voice sample data included in the voice sample data group from the database and used to generate the singing sound signal, and as a singing sound synthesizing means for synthesizing a chorus sound signal from the generated singing sound signal. It is characterized by letting.
[0016]
According to another aspect of the present invention, there is provided a program that reads out the required voice sample data from a database that stores voice sample data having a predetermined time length created based on voice according to music data. A singing sound signal corresponding to at least two of the parts when the song data is composed of a plurality of parts and the singing sound signal corresponding to the plurality of parts is generated. Singing sound generating means for generating the singing sound signal by starting use from portions corresponding to different times of the audio sample data read from the database, and chorusing from the generated singing sound signal It is characterized by functioning as a song synthesis means for synthesizing sound signals.
[0017]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
A. First embodiment
A-1. Basic configuration of the first embodiment
First, FIG. 1 is a block diagram showing a basic configuration of a choral synthesizer according to the first embodiment of the present invention. As shown in the figure, the chorus synthesizer 100 includes a voice sample database 110, a plurality (three in the illustrated example) of song generators 120, 121, 122, a chorus control unit 140, and a song generator 120. , 121, and 122, adder 130 that adds and synthesizes and outputs the singing sound signals.
[0018]
The audio sample database 110 stores audio sample data created based on natural sounds uttered by a person. The speech sample database 110 stores speech sample data (hereinafter referred to as speech segment sample data) having a phoneme chain composed of a single phoneme or a plurality of phonemes as one unit.
[0019]
A large number of short-time long voice sample data is accumulated in a database, and these voice sample data are connected according to the lyrics etc., and in the voice synthesis processing technology, phoneme is basically used as a synthesis unit. . For this reason, speech unit sample data of only phoneme (about 30 to 50 types) units may be stored in the speech sample database 110 in the choral synthesizer 100, but the rules for combining phonemes are complex. For this reason, it is difficult to obtain good quality when speech sample data of only phoneme units is accumulated. Therefore, it is preferable that the speech sample database 110 stores speech unit sample data of a unit (phoneme chain) slightly larger than the phoneme in addition to the speech unit sample data of only the phoneme unit. Units larger than phonemes include units such as CV (consonant → vowel), VC (vowel → consonant), VCV (vowel → consonant → vowel), and CVC (consonant → vowel → consonant). It is conceivable to store all of the speech unit sample data of these units, but in the chorus synthesizer 100 that synthesizes the choral sound, one unit is an extended sound that is pronounced for a long time, such as a vowel frequently used in singing. Speech unit sample data, speech unit sample data with consonant to vowel (CV) and vowel to consonant (VC) as one unit, speech unit sample data with consonant to consonant as one unit, and vowel to vowel It is only necessary to store speech segment sample data with 1 as a unit.
[0020]
The speech sample database 110 stores speech segment sample data in units of phonemes or phoneme chains as described above. In the speech sample database 110, the same type of phonemes (for example, “a”) or Three speech segment sample data are stored for the phoneme chain (for example, “ai”). In other words, the speech sample database 110 stores three speech sample data groups 110a, 110b, and 110c composed of a predetermined number of unit speech segment sample data with a phoneme or phoneme chain as one unit.
[0021]
The three audio sample data groups 110a, 110b, and 110c stored in the audio sample database 110 are data created based on different sounds. Here, different voices do not only mean that the speakers are different, but even the same speaker may use voices uttered at different occasions or different utterance parts. As described above, the voice sample data groups 110a, 110b, and 110c are created based on voices uttered at different occasions or different voice parts even if they are different voicers or the same voicers. As described above, the data about the same phoneme (or phoneme chain) included in each of the audio sample data groups 110a, 110b, and 110c is different because the sound used as a basis for creating each data is different. Features (pitch fluctuations, etc.) are different.
[0022]
The voice sample database 110 stores the three voice sample data groups 110a, 110b, and 110c as described above, and each of the song generators 120, 121, and 122 generates the voice sample when generating the song signal. The speech segment sample data is read from the database 110 and used.
[0023]
Each of the singing generators 120, 121, and 122 reads out the required speech segment sample data from the speech sample database 110 according to the music information having the lyric information and the melody information, and uses the read speech segment sample data. A singing sound signal is generated.
[0024]
More specifically, each of the singing generators 120, 121, and 122 obtains a phoneme string according to the lyric information, determines speech segment sample data necessary to construct the phoneme string, and the speech sample database 110. Read from. Then, the read speech segment sample data is connected in time series, and the connected speech segment sample data is appropriately adjusted according to the pitch according to the melody information to generate a singing sound signal.
[0025]
The chorus synthesizer 100 according to the present embodiment includes three song generators 120, 121, and 122 that can generate a song sound signal according to lyrics information and melody information, and thus includes three parts. According to the music information (lyric information and melody information) of the choral music, it is possible to synthesize a chorus sound signal corresponding to the choral music.
[0026]
When the chorus control unit 140 synthesizes the chorus sound signal corresponding to the chorus sound based on the song information of the chorus music in the choir synthesizing apparatus 100, the song information is divided for each part and each song generator 120 is divided. , 121, 122. Thereby, when synthesizing a chorus sound signal according to music information consisting of three parts, each song generator 120, 121, 122 uses the lyrics information and melody information of each part supplied from the chorus control unit 140. Therefore, a singing sound signal is generated, and the singing sound signal generated by each of the singing generators 120, 121, 122 is output to the adder 130. Thereby, according to the music information of the choral music which consists of three parts from the adder 130, the chorus sound signal corresponding to this choral music can be synthesize | combined.
[0027]
In addition, when the chorus synthesizer 100 synthesizes the chorus sound signal as described above, the chorus control unit 140 stores each of the song generators 120, 121, and 122 in the voice sample database 110. Of the voice sample data groups 110a, 110b, and 110c, the designation information for designating which voice sample data is read out and used is output to the song generators 120, 121, and 122. Here, the chorus control unit 140 generates the singing sound signal by using the speech segment sample data included in the speech sample data groups 110a, 110b, and 110c different from each other. Designation information for designating different data groups is output to each song generator 120, 121, 122.
[0028]
Specifically, for the song generator 120, designation information for designating the voice sample data group 110a is output, for the song generator 121, designation information for designating the voice sample data group 110b is output, For example, designation information for designating the voice sample data group 110c is output to the song generator 122. When such designation information is supplied from the chorus control unit 140, the singing generator 120 reads out the voice segment sample data included in the voice sample data group 110a and uses it to generate a singing sound signal. The speech unit sample data included in the speech sample data group 110b is read out and used to generate a singing sound signal. The singing generator 122 generates a singing sound signal using the speech unit sample data included in the speech sample data group 110c. Will do.
[0029]
When the chorus synthesizer 100 synthesizes the singing sound signal of the chorus composed of three parts, as described above, the singing generators 120, 121, 122 are included in the audio sample data groups 110a, 110b, 110c different from each other. By using the speech segment sample data, it is possible to synthesize a chorus sound signal that can give the listener a more natural impression. That is, the voice sample data groups 110a, 110b, 110c stored in the voice sample database 110 are created based on different voices, and even if they are data on the same type of phonemes and phoneme chains, The fine features (pitch fluctuations, etc.) of the voice shown in the data included in each of the voice sample data groups 110a, 110b, 110c are different. Of the speech sample data groups 110a, 110b, and 110c including the speech segment sample data having different fine features as described above, the singing generators 120, 121, and 122 are included in different speech sample data groups. By generating the singing sound signal using the piece sample data, the singing sound signals generated by the singing generators 120, 121, and 122 have different fine features. Therefore, the chorus sound obtained by superimposing these has a unique characteristic with little correlation between the parts, and it is possible to reduce giving an unnatural impression to the listener.
[0030]
The speech segment sample data included in the speech sample data groups 110a, 110b, and 110c stored in the speech sample database 110 is data created based on speech uttered by each person. Therefore, the difference in the fine features of the speech shown in the speech segment sample data included in each speech sample data group is not artificially created to give a pitch fluctuation prepared in advance. Therefore, it can reduce that the synthesized choral sound becomes unnatural.
[0031]
A-2. Specific configuration of the choral synthesizer
What has been described above is the basic configuration of the choral synthesizer 100 according to the present embodiment. In this chorus synthesizer 100, as the song generators 120, 121, and 122, the speech segment sample data is read from the speech sample database 110 according to the lyrics information and connected, and the speech connected according to the pitch according to the melody information. If it is a singing generator that adjusts the segment sample data and outputs a singing sound signal, various known singing generators such as a singing generator applying a regular speech synthesis technology, etc. can be used. The database 110 may store speech segment sample data corresponding to the song generator to be employed. In the following, as a song generator 120, 121, 122, a song generator using a spectral modeling synthesis (SMS) technique proposed in US Pat. Nos. 5,029,509 and 2,906,970 is applied. As an example, the chorus synthesizer 100 will be specifically described.
[0032]
First, a method for creating the voice sample database 110 in the singing voice synthesizing apparatus 100 including the singing generators 120, 121, and 122 using the SMS technology will be described.
[0033]
As described above, the speech sample database 110 in the choir synthesizer 100 stores speech segment sample data created based on speech uttered by a speaker. SMS technology is a technology that analyzes and synthesizes musical sounds using a model that represents the original sound with two components: a harmonic component (deterministic component) and a harmonic component (stochastic component). In the speech synthesis, data consisting of the harmonic component and the anharmonic component is used for speech synthesis as one unit speech unit sample data such as a phoneme or phoneme chain. Therefore, in the choral synthesizer 100 using the SMS technology, the speech sample database 110 includes data indicating harmonic components and inharmonic components obtained by performing SMS analysis on speech uttered by a speaker. Stored as sample data. Hereinafter, a method for creating the voice sample database 110 will be described with reference to FIG.
[0034]
As shown in the figure, the voice uttered by the speaker to create the voice sample database 110 is input to the SMS analysis unit 200 and subjected to SMS analysis in the SMS analysis unit 200. Here, since it is necessary to store voice sample data groups 110a, 110b, and 110c created based on different voices in the voice sample database 110, three different voices are input to the SMS analysis unit 200. become. In the drawing, three different sounds are shown to be input to the SMS analysis unit 200 in parallel. However, SMS analysis for each sound need not be performed in parallel at the same time. Also good.
[0035]
The SMS analysis unit 200 performs SMS analysis on the input voice and outputs SMS analysis data for each frame. More specifically, SMS analysis data for each frame is output by the following method.
[0036]
First, the input voice is divided into a series of frames. Here, the frame period used for the SMS analysis may be a fixed length, or may be a variable length period in which the period is changed according to the pitch of the input speech.
[0037]
Next, frequency analysis such as Fast Fourier Transform (FFT) is performed on the voice divided into frames. An amplitude spectrum and a phase spectrum are obtained from the frequency spectrum (complex spectrum) obtained by this frequency analysis, and a spectrum of a specific frequency corresponding to the peak of the amplitude spectrum is extracted as a line spectrum. At this time, a spectrum having a frequency in the vicinity of the fundamental frequency and an integral multiple of the fundamental frequency is defined as a line spectrum. The line spectrum extracted in this way corresponds to the harmonic component described above.
[0038]
Next, a line spectrum is extracted from the input speech as described above, and a residual spectrum is obtained by subtracting the extracted line spectrum from the input speech (waveform after FFT) of the frame. Alternatively, the time waveform data of the harmonic component synthesized from the extracted line spectrum is subtracted from the input speech waveform data of the frame to obtain the time waveform data of the residual component, and then subjected to frequency analysis such as FFT. Thus, a residual spectrum may be obtained. The residual spectrum obtained in this way corresponds to the above-described anharmonic component.
[0039]
The SMS analysis unit 200 outputs SMS analysis data for each frame including the line spectrum (harmonic component) and the residual spectrum (non-harmonic component) acquired as described above to the section cutout unit 201.
[0040]
The segment cutout unit 201 corresponds to the length of one unit (phoneme or phoneme chain) of the speech segment sample data to be stored in the speech sample database 110 for the SMS analysis data for each frame supplied from the SMS analysis unit 200. Cut out as you do. The section cutout unit 201 cuts out SMS analysis data so as to correspond to the unit length of each segment and stores it in the voice sample database 110.
[0041]
Here, the speech segment sample data stored in the speech sample database 110 is SMS data cut out for each phoneme or phoneme chain. For harmonic components, the spectral envelopes of all frames included in the phoneme or phoneme chain are included. (Line spectrum (overtone series) intensity (amplitude) and phase spectrum) are stored. Note that such a spectral envelope itself may be stored as a harmonic component, but the spectral envelope expressed by some function may be stored, or the harmonic component may be obtained by inverse FFT or the like. Alternatively, it may be stored as a time waveform. In the present embodiment, the anharmonic component is also stored as an intensity spectrum and a phase spectrum as in the harmonic component, but may be stored as a function or a time waveform as with the harmonic component.
[0042]
SMS analysis and segmentation for such speech are performed for each of three different input speech, and as a result, speech segment sample data created based on three different speech such as speech sample data groups 110a, 110b, and 110c. A group of (SMS analysis data for each phoneme or phoneme chain) is stored in the speech sample database 110.
[0043]
The above is the detail of the production method of the audio sample database 110 of the choral synthesizer 100 according to the present embodiment.
[0044]
Next, each singing generator 120, 121, which generates a singing sound signal using the voice sample database 110 storing the three voice sample data groups 110a, 110b, 110c created based on different voices as described above. 122 will be described. Since the song generators 120, 121, and 122 have the same configuration, the configuration of the song generator 120 will be described below with reference to FIG. 3 and the other song generators 121 and 122 will be described. Omit.
[0045]
As shown in the figure, the singing generator 120 includes a speech unit selection unit 301, a pitch determination unit 302, a duration length adjustment unit 303, a speech unit connection unit 304, and a harmonic component generation unit 305. , An addition unit 306, an inverse FFT (fast Fourier transform) unit 307, a windowing unit 308, and an overlap unit 309.
[0046]
The speech segment selection unit 301 reads necessary speech segment sample data from the speech sample database 110 in accordance with the lyrics information and designation information supplied from the chorus control unit 140 (see FIG. 1). More specifically, the supplied lyric information is converted into a phonetic symbol (phoneme or phoneme chain) sequence, and speech segment sample data is read from the speech sample database 110 according to the converted phonetic symbol sequence. For example, when a singing sound signal is generated in accordance with lyrics information such as “saita” (saita), the lyrics information is “#s”, “s”, “sa”, “a”, “ai”, “i” ”,“ It ”,“ t ”,“ ta ”,“ a ”,“ a # ”and the like, and speech segment sample data corresponding to each of these speech symbols is read from the speech sample database 110. It will be.
[0047]
The speech segment selection unit 301 determines speech segment sample data to be read according to the lyrics information as described above, and is determined from the speech sample data group designated by the designation information supplied from the chorus control unit 140. Read speech segment sample data. For example, when the designation information designates the audio sample data group 110a, “#s”, “s”, “sa”, “a”, “a”, “ Read speech segment sample data corresponding to “ai”, “i”, “it”, “t”, “ta”, “a”, “a #”.
[0048]
The pitch determination unit 302 determines the pitch of the singing sound according to the melody information supplied from the chorus control unit 140 (see FIG. 1), and outputs the pitch information indicating the determined pitch to the harmonic component generation unit 305.
[0049]
The speech unit sample data (harmonic component and anharmonic component) read out by the speech unit selection unit 301 is supplied to the duration time adjustment unit 303. Here, the speech unit selection unit 301 may supply the read speech unit sample data as it is to the duration adjustment unit 303, but an appropriate correction process is performed according to the pitch or the like indicated by the melody information. May be supplied to the duration time adjustment unit 303.
[0050]
The duration time adjustment unit 303 changes the time length of each speech unit sample data supplied from the speech unit selection unit 301 according to the phoneme or phoneme chain duration determined by melody information or the like. I do. More specifically, when using certain speech unit sample data as a time shorter than the time length, a process of thinning out frames from the speech unit sample data is performed. On the other hand, when a certain speech segment sample data is continuously used for a time longer than the time length, a loop process is performed in which the speech segment sample data is repeated for a length of time for which the speech segment sample data is used. In this loop processing, when repeating certain speech segment sample data, the speech segment sample data j is first (0) after the data from the beginning to the end (0 to t) of the speech segment sample data. From the beginning to the end (0 to t), after the data from the beginning to the end (0 to t), the speech segment sample data is moved from the last (t) to the first part in time. The data may be connected and repeated.
[0051]
The duration adjustment unit 303 adjusts the duration of the speech unit sample data (harmonic component and anharmonic component) according to the sound duration of each speech unit as described above, and then adjusts the speech after the time adjustment. The segment sample data is output to the speech segment connection unit 304.
[0052]
The speech unit connection unit 304 connects the harmonic component data of the speech unit sample data supplied from the duration length adjustment unit 303 in time series, and connects the inharmonic component data in time series. In such connection, if the difference in the shape of the spectrum envelope between the two harmonic components to be connected is large, smoothing processing or the like may be performed. The speech element connection unit 304 outputs the connected harmonic component data to the harmonic component generation unit 305 and outputs the connected anharmonic component data to the addition unit 306.
[0053]
The harmonic component generation unit 305 is supplied with harmonic component data (spectrum envelope information) from the speech unit connection unit 304 and is supplied with pitch information according to melody information from the pitch determination unit 302. The harmonic component generation unit 305 generates a harmonic component corresponding to the pitch information from the pitch determination unit 302 while maintaining the spectrum envelope shape indicated by the spectrum envelope information from the speech unit connection unit 304.
[0054]
The adding unit 306 is supplied with the anharmonic component data from the speech unit connecting unit 304 and the harmonic component data from the harmonic component generating unit 305, and the adding unit 306 combines both of them to the inverse FFT unit 307. Output. The inverse FFT unit 307 converts the added frequency domain signal supplied from the addition unit 306 into a time domain waveform signal by performing inverse FFT, and outputs the converted waveform signal to the windowing unit 308. To do. The windowing unit 308 multiplies the time domain waveform signal by a window function corresponding to the frame length, and the overlap unit 309 generates a singing sound signal while overlapping the multiplied waveform signals. In this way, the singing sound signal is generated in the singing sound generator 120 in accordance with the lyric information and the melody information of the part having the music information supplied from the chorus control unit 140 (see FIG. 1). Is output to the adder 130 (see FIG. 1).
[0055]
The above is the detailed configuration of the song generator 120, and the music supplied from the chorus control unit 140 as described above from the other song generators 121 and 122 (same configuration as the song generator 120) shown in FIG. 1. A singing sound signal generated according to the lyrics information and melody information of the part with information is output. Here, as described above, when the singing generators 120, 121, 122 generate the singing sound signals corresponding to the parts distributed from the chorus control unit 140, the different voice sample data groups 110a, 110b, 110c, respectively. Since the voice segment sample data is read out from the voice and used for generation, the fine features (pitch fluctuations, etc.) of the singing sound signal generated by each of them are different.
[0056]
The adder 130 synthesizes and outputs the singing sound signals generated by the singing generators 120, 121, and 122 according to the parts of the music information of the choral music. A chorus sound signal obtained by synthesizing the singing sound signals of the three parts output from the adder 130 is converted into an analog voice waveform signal by a D / A (Digital to Analog) converter (not shown), and then an amplifier or the like. Through the speaker. Thereby, the listener can listen to the chorus sound according to the music information of the chorus composed of a plurality of parts. The chorus sound emitted from the chorus synthesizer 100 is different in the fine features of the singing sounds of each part (voice quality resulting from differences in pitch fluctuations, etc.) and gives a natural impression to the listener. It is possible to produce a choral sound that can be played.
[0057]
B. Second embodiment
Next, a chorus synthesizer according to a second embodiment of the present invention will be described with reference to FIG. As shown in the figure, the voice sample database 110 of the chorus synthesizer 100 in the first embodiment stores three voice sample data groups 110a, 110b, and 110c, whereas the chorus according to the second embodiment. The speech sample database 110 in the synthesizer 400 is different in that only one type of speech segment sample data is stored for the same phoneme or phoneme chain. The chorus synthesizer 400 according to the second embodiment uses the speech sample database 110 that stores only one speech segment sample data for one phoneme or phoneme chain as described above, and thus is more natural than the first embodiment. It is possible to synthesize a chorus sound signal that can give a unique impression. Hereinafter, the configuration of the choral synthesizer 400 will be described focusing on differences from the choral synthesizer 100 according to the first embodiment.
[0058]
Each of the singing generators 120, 121, and 122 in the chorus synthesizer 400 is the same as that in the first embodiment, and a speech unit sample required from the speech sample database 110 according to music information having lyrics information and melody information. The data is read, and a singing sound signal is generated using the read speech segment sample data. In the second embodiment, since only one speech segment sample data is stored in the speech sample database 110 for one phoneme or phoneme chain, when generating a singing sound signal according to the music information of the choral tune. The song generators 120, 121, 122 may use the same speech segment sample data. As described above, when singing sound signals of a plurality of parts are generated using the same speech segment sample data, fine features (such as pitch fluctuation) are basically the same, which is unnatural to the listener. It gives an impression.
[0059]
Therefore, in this chorus synthesizer 400, the chorus control unit 140 divides the lyrics information and melody information of the music information of the choral music into parts and outputs them to the song generators 120, 121, 122, and the voice sample database. Designation information for designating at which time the speech segment sample data stored in 110 is to be used is output to each song generator 120, 121, 122.
[0060]
As described in the first embodiment, the speech segment sample data stored in the speech sample database 110 is created based on the speech uttered by the speaker, and has a predetermined time length (1 frame). This is data created based on a speech waveform of ~ several frames. That is, it is data created based on the speech waveform represented by the relationship between time and amplitude within the predetermined time. Therefore, even when speech segment sample data is stored as frequency domain data as in the first embodiment, the data is obtained by performing FFT or the like on the time domain speech waveform. The chorus control unit 140 provides the singing generators 120, 121, and 122 with designation information that designates from which part the voice segment sample data, which is information that changes with time, is used. Supply.
[0061]
Here, the chorus control unit 140 generates each singing sound signal so that each singing generator 120, 121, 122 starts using the speech segment sample data from portions corresponding to different times and generates a singing sound signal. Designation information for designating different use start times is output to the generators 120, 121, and 122.
[0062]
Each singing generator 120, 121, 122 reads out the speech unit sample data required based on the lyrics information of each part supplied from the chorus control unit 140 from the speech sample database 110 and reads out the speech unit sample that has been read out. The use of the data is started from the part corresponding to the time specified in the specification information from the chorus control unit 140, and the singing sound signal is generated.
[0063]
Hereinafter, the speech unit sample data read according to the lyrics information of the three parts is “a” of the vowel, and the speech unit sample data “a” is composed of 13 frames (time 0 to T) such as F0 to F13. The case where each song generator 120, 121, 122 generates a singing sound signal by using the length of 13 frames of the speech segment sample data is specifically illustrated with reference to FIGS. To explain.
[0064]
In the example shown in FIG. 5, designation information is supplied to the song generator 120 so as to start use from the first frame F0, and the song generator 121 starts use from the frame F3. The designation information that designates to be used is supplied, and the designation information that designates the song generator 122 to start use from the frame F6 is supplied. In the drawing, for convenience of explanation, the speech segment sample data is shown as a speech waveform in the time domain, but the data stored in the speech sample database 110 is expressed in the frequency domain as in the first embodiment. It may be in the form of a harmonic component (line spectrum) and an anharmonic component (residual spectrum).
[0065]
When such designation information is supplied, as shown in FIG. 6, the singing generator 120 uses the order of frames F0, F1, F2,... F13, that is, the speech segment sample data “a” as it is. Used to generate a singing sound signal. The singing generator 121 generates a singing sound signal using the speech segment sample data “a” in the order of frames F3, F4, F5... F13, F0, F1, F2, F3. Further, the singing generator 122 generates a singing sound signal by using the speech segment sample data “a” in the order of frames F6, F7... F13, F0, F1.
[0066]
Thus, by outputting the designation information so that the chorus control unit 140 starts use from the portions corresponding to different times and generates the singing sound signal, the same phoneme “a” is set to the same time length (0 to T). Even when the singing sound signal is generated using only the singing), the data actually used by the singing generators 120, 121, 122 are different. That is, the fine features (pitch fluctuations, etc.) shown in the speech segment sample data actually used by each singing generator 120, 121, 122 are different, and one kind of speech element is obtained for one phoneme or phoneme chain. Using the single sample data, each singing generator 120, 121, 122 can generate singing sound signals with different fine features.
[0067]
By the way, when using speech segment sample data for a single phoneme like the phoneme “a”, it is generated for each part by the method of simply shifting the use start time in the data as described above. It is possible to synthesize a more natural chorus sound by changing the minute characteristics of the singing sound, but in the case of speech segment sample data for a phoneme chain consisting of a plurality of phonemes, the use start time in the data is simply It may be inconvenient to simply shift the position. For example, in the case of speech sample data for a phoneme chain such as “ai”, the first half of the time domain is data that more strongly reflects the phoneme of “a”, and the second half is data that more strongly reflects the phoneme of “i”. It is. Therefore, in order to generate the singing sound signal of the phoneme chain “ai”, when the use is started from the latter half portion where the influence of the phoneme “i” is strong, data having a tendency similar to that of the phoneme chain “ia” is used. In this case, the signal for the phoneme chain “ai” to be generated cannot be generated accurately.
[0068]
Therefore, in this embodiment, when using speech segment sample data corresponding to a plurality of phoneme chains, the chorus control unit 140 sends designation information as shown in FIG. 7 to each song generator 120, 121, 122. I am trying to output. In the example shown in the figure, the singing generator 120 is supplied with designation information that specifies that the use is started from the first frame F0, and the singing generator 121 is used from the frame F2. The designation information that designates to be used is supplied, and the specification information that designates the song generator 122 to start use from the frame F4 is supplied. That is, when compared with the designation information for the single phoneme, the use start times designated for the song generators 120, 121, 122 are concentrated in the first half (the influence of “a” is strong). Thus, by concentrating the use start time of each part on the first half of the data, it is possible to prevent the data actually used from being similar to the phoneme chain “ia” as described above.
[0069]
Further, when the designation information is supplied as described above, the singing generator 121 performs the speech segment sample data in the order of frames F2, F3, F4... F13, F0, F1, F2, that is, in one direction. Is used to generate the singing sound signal, the frames F0 to F2, which are strongly influenced by the phoneme “a”, are used as the latter half of the data that should be strongly influenced by the “i”. Therefore, in this embodiment, when speech segment sample data for a phoneme chain composed of a plurality of phonemes is used, frames F12, F11,... Are not returned to the frame F1 after the last frame (F13). Thus, the frames are used in the order of returning in the reverse direction. Therefore, when the use start frame is designated as shown in FIG. 7, the song generator 120 performs the order of frames F0, F1, F2,... F13, that is, the voice sample database 110 as shown in FIG. Is used as it is to generate a singing sound signal. Further, the singing generator 121 generates a singing sound signal using the speech segment sample data in the order of frames F2, F3, F4... F13, F12, F11. Further, the song generator 122 generates a song sound signal using the speech segment sample data in the order of frames F4, F5, F6... F13, F12, F11, F10, F9. Note that when frames are used in the reverse order, such as from frame F13 to frame F12, there is a risk of noise or the like occurring at the connection between the two, so amplitude adjustment processing or crossfade processing is performed at the connection between the frames. Etc. may be applied.
[0070]
In the case where the speech unit sample data of a phoneme chain composed of a plurality of phonemes is used in each song generator 120, 121, 122, a phoneme chain can be generated more accurately as described above. The fine features of the singing sound signals output from the singing generators 120, 121, 122 are different.
[0071]
As explained above, in the chorus synthesizer 400 according to the second embodiment, even if only one speech segment sample data is stored for one phoneme or phoneme chain, one speech segment sample data is used. As in the first embodiment, it is possible to synthesize a choral sound signal that can give a more natural impression. That is, it is possible to synthesize a chorus sound signal that can give a more natural impression while suppressing the amount of data stored in the audio sample database 110.
[0072]
C. Modified example
The present invention is not limited to the first and second embodiments described above, and various modifications as exemplified below are possible.
[0073]
(Modification 1)
In each of the embodiments described above, unit speech unit sample data such as phonemes or phoneme chains are connected to generate a singing sound signal, but there is a singing expression method called vibrato, You may make it add the function which adds the singing expression by this vibrato to a chorus synthesizer.
[0074]
Conventionally, as a method for generating a singing sound signal for electronically pronouncing a singing sound by vibrato, as in each of the above-described embodiments, the speech unit sample data of phonemes or phoneme chain units is connected, and the connection There has been known a method of applying a frequency modulation of about 6 Hz to a waveform represented by the speech segment sample data. A configuration for carrying out such a method may be added to the choir synthesizer in each of the above embodiments, but as a method of generating a vibrato singing sound signal that can give a natural impression to the listener, There is a method of using vibrato voice sample data created based on the voice when a person sings with the vibrato singing method, and it is preferable to add a configuration for carrying out this method to the chorus synthesizer according to each of the above embodiments. .
[0075]
Hereinafter, with reference to FIG. 9, an example in which a function of generating a singing sound signal using vibrato sound sample data created based on a speaker's vibrato singing sound is added to the chorus synthesizer in the first embodiment will be described. Will be described.
[0076]
As shown in the figure, in the voice sample database 110 in the chorus synthesizer 100 ′, in addition to the voice unit sample data in units of phonemes or phoneme chains such as the voice sample data groups 110a, 110b, 110c, the vibrato song Vibrato voice sample data created based on the singing voice of the time is stored. Here, in the audio sample database 110, three vibrato audio sample data BDa, BDb, and BDc created based on different sounds are stored.
[0077]
Under this configuration, the chorus control unit 140 designates the lyric information and melody information of each part and the voice sample data group to be used for each singing generator 120, 121, 122, as in the first embodiment described above. In addition to the information, second designation information for designating which of the three vibrato sound sample data BDa, BDb, and BDc is to be used is supplied. Here, the 2nd designation | designated information supplied to each song generator 120,121,122 is information which designates using different vibrato audio | voice sample data. By supplying such second designation information to each song generator 120, 121, 122, each song generator 120, 121, 122 receives different vibrato sound sample data when generating a vibrato song signal. The waveform represented by the read vibrato speech sample data is superimposed on the speech waveform represented by the speech unit sample data read out and connected in the same manner as in the above embodiment, and the superimposed waveform signal is output as a singing sound signal.
[0078]
In this way, when generating the vibrato singing sound signal, each singing generator 120, 121, 122 uses the three vibrato sound sample data BDa, BDb, BDc created based on different sounds for each part. The fine characteristics of the vibrato singing sound signal to be generated (such as frequency fluctuation during vibrato) are also different for each part. In this way, there is almost no correlation between each part of the vibrato singing sound, and each part has a unique characteristic, so that the singing sound based on the choral sound signal synthesized by the choral synthesizer 100 ' It becomes possible to give a more natural impression to the listener who has listened to the vibrato portion.
[0079]
By the way, the fact that the characteristics of the vibrato part of each part are basically the same in the chorus sound gives the listener a more unnatural impression than when the characteristics of the other parts are the same. Therefore, there may be a demand for an apparatus in which only the vibrato portion is given unique characteristics for each part. In such a case, as in each of the above embodiments, the voice sample data about the phoneme or the phoneme chain uses the same sound as it is in each part to generate a singing sound signal, and the generated singing sound signal A vibrato effect may be given by adding waveforms expressed by different vibrato audio sample data for each part.
[0080]
(Modification 2)
Moreover, as shown in FIG. 9, you may make it use three vibrato audio | voice sample data corresponding to the number of each song generator 120,121,122, but of choral song synthesis apparatus 400 'shown in FIG. As described above, the song generators 120, 121, and 122 may generate the song sound signal of the vibrato portion using the same vibrato sound sample data.
[0081]
As described in the above-described embodiment, the song generators 120, 121, and 122 can generate song sound signals having different unique characteristics by properly using the audio sample data groups 110a, 110b, and 110c. Even if the waveform expressed by the same vibrato sound sample data is added to the singing sound signal generated in this way, the singing sound signal of the vibrato portion output from each of the singing generators 120, 121, and 122 has unique characteristics. It will have. Therefore, each song generator 120, 121, 122 may simply use one vibrato sound sample data. However, each song generator 120, 121, 122 in the second embodiment is also used for vibrato sound sample data. As described as the method of using the speech segment sample data by 122, each song generator 120, 121, 122 may start using the portion corresponding to different time of the same vibrato speech sample data. . In this case, what is necessary is just to make it supply to each song generator 120, 121, 122 the designation | designated information which designates from which part the chorus control part 140 starts using from. By doing in this way, the actual audio sample data which each song generator 120, 121, 122 uses for vibrato provision has a different characteristic. Therefore, the singing sound signal of the vibrato part of each part has a unique characteristic, and the listener who has listened to the vibrato part of the singing sound based on the choral sound signal synthesized by the chorus synthesizer 400 ′. It becomes possible to give a more natural impression.
[0082]
(Modification 3)
Moreover, in the said modification, in order to give the vibrato effect to the song sound signal to produce | generate, the vibrato audio | voice sample data was memorize | stored, but various things, such as tremolo other than vibrato, portamento, etc. In order to electronically emit the singing sound of the singing method, the voice sample data created based on the singing voice of the tremolo part and the singing voice of the portamento part by the speaker is stored in the voice sample database 110. It may be. Also in this case, as with the vibrato audio sample data in the above-described modification, audio sample data is prepared for each part, or even the same audio sample data is used from a part corresponding to a different time. By doing so, it is possible to give unique characteristics to the tremolo of each part and the singing sound signal of the portamento part.
[0083]
(Modification 4)
Further, in the first embodiment described above, the three audio sample data groups 110a, 110b, and 110c are stored in the audio sample database 110. However, based on audio having different pitches such as high, medium, and low sounds. Thus, speech segment sample data included in each speech sample data group 110a, 110b, 110c may be created. For example, speech segment sample data created based on high-pitched speech is included in the speech sample data group 110a, and speech segment sample data created based on medium speech is included in the speech sample data group 110b. In this way, speech segment sample data created based on low-pitched speech may be included in the speech sample data group 110c.
[0084]
When the audio sample database 110 that stores the audio sample data groups 110a, 110b, and 110c created for each sound range is used in this way, the chorus control unit 140 has a high frequency range among a plurality of parts included in the music information. Designation information for designating the use of the voice sample data group 110a created based on the high-pitched sound is output to the singing generator responsible for generating the singing sound signal of the part composed of the melody. In addition, for the singing generator responsible for generating the singing sound signal of the part composed of the melody in the middle range, the designation information for designating the use of the voice sample data group 110b created based on the middle tone voice is output. Further, designation information for designating the use of the voice sample data group 110c created based on the low-pitched sound is output to the singing generator responsible for generating the singing sound signal of the part composed of the melody in the low-pitched range. Thereby, each singing generator 120, 121, 122 can use voice segment sample data more suitable for generation of the singing sound signal of the part each is in charge, and can generate a higher quality singing sound signal. it can.
[0085]
The voice sample data groups 110a, 110b, and 110c used by the song generators 120, 121, and 122 may be fixed at the time of generating a song sound signal corresponding to a certain piece of music as described above. It is also conceivable that the pitch level of each part determined by the melody information changes every time. In this case, when generating the singing sound signal of a certain piece of music, the chorus control unit 140 selects the singing generators 120, 120 according to the pitch of each part determined by the melody information for each part. It is also possible to output designation information such that the audio sample data group designated for 121, 122 is sequentially changed during the music.
[0086]
(Modification 5)
In the above-described modification, the voice sample data groups 110a, 110b, and 110c created based on voices at different pitches are stored in the voice sample database 110. However, the same phoneme is uttered when singing. The pitch may fluctuate greatly in the meantime. Therefore, speech segment sample data is created in the speech sample database 110 based on speech generated by changing the pitch (pitch) while speaking the same phoneme, for example, “a”, and the speech segment is generated. The sample data may be stored in the voice sample database 110. As described above, in the speech sample database 110, speech segment sample data is created in consideration of not only phonemes or phoneme chains having the same pitch described in the above-described embodiments, but also various pitch fluctuations that may occur during singing. You may make it leave.
[0087]
(Modification 6)
In the first embodiment described above, the voice sample data groups 110a, 110b, and 110c stored in the voice sample database 110 are used by the singing generators 120, 121, and 122, and in the second embodiment, the same. By using the voice segment sample data from the part corresponding to different times, a chorus sound signal capable of giving a more natural impression was synthesized. In the choir synthesizer according to the first and second embodiments, the singing generators 120 and 121 assign some value (that is, a parameter for determining sound) indicated in the speech segment sample data read from the speech sample database 110 to each of the song generators 120 and 121. , 122 may be provided to provide parameter changing means for supplying after changing. FIG. 11 shows a configuration when such parameter changing means is added to the chorus synthesizer according to the first embodiment.
[0088]
As shown in the figure, in addition to the configuration of the choral synthesizer 100 in the first embodiment, the choral synthesizer 100 ″ includes a parameter changing unit 220 provided corresponding to each singing generator 120, 121, 122. Under this configuration, the chorus control unit 140, in the same way as the first embodiment described above, provides the song generators 120, 121, and 122 with the lyrics information and melody information of each part, and which audio sample. While outputting the designation information which designates whether a data group is used, the change information which shows the change content of a parameter is output with respect to each of parameter change part 220,221,222. Change information that changes different contents of the speech segment sample data read from the speech sample database 110 is changed to each parameter change. And outputs it to the parts 220, 221, 222.
[0089]
The parameter changing units 220, 221, and 222 read out the speech unit sample data required by the corresponding song generators 120, 121, and 122 from the speech sample data group indicated by the designated information, and supply it from the chorus control unit 140. The speech segment sample data read out is changed in accordance with the change information. Then, the changed speech segment sample data is supplied to the corresponding song generators 120, 121, 122. And each song generator 120,121,122 produces | generates a song sound signal using the audio | voice unit sample data after a change.
[0090]
Here, the content of the changing process performed on the speech segment sample data read from the speech sample database 110 by the parameter changing unit 220, 221, 222 is a process of changing the timbre etc. to such an extent that the phonological property is not impaired. Various modification processes can be applied. For example, the voice formant structure expressed by a certain voice segment sample data read from the voice sample database 110 is modeled, and the tone color is changed by changing the band width of the formant by several percent or shifting the center frequency of the band by about 10 Hz. There is a way to change it slightly. In this case, by changing the ratio of the bandwidth of the formant to be changed and the amount by which the center frequency of the band is shifted to each of the parameter changing units 220, 221, and 222, the parameter changing units 220, 221, and 222 The timbre of the voice shown in the read voice segment sample data is slightly different.
[0091]
(Modification 7)
Moreover, in each embodiment mentioned above, in order for the chorus sound produced | generated by each song generator 120,121,122 to give a listener a more natural impression, the song sound by the song sound signal produced | generated for every part The sound generation timing may be shifted. In this case, the chorus control unit 140 supplies timing designation information for designating how much the sound generation timing is shifted for each part. At this time, the chorus control unit 140 supplies the singing generators 120, 121, and 122 with timing designation information such that the sound generation timings of the singing generators 120, 121, and 122 are slightly shifted. For example, for the singing generator 120, the singing sound signal generated according to the lyrics information and the melody information supplied from the chorus control unit 140 is output to the adder 130 without delay, and the singing generator 121 is output. If the singing sound signal is output to the adder 130 with a delay of 10 msec, and the singing sound signal is output to the adder 130 with a delay of 20 msec for the singing generator 122, the singing sound of each part can be obtained. It is pronounced with a slight shift, giving a more natural impression to the listener.
[0092]
In addition, when generating the singing sound signal of one piece of music as described above, the correlation between the sound generation timings of the singing generators 120, 121, and 122 may be fixed. Even in the middle, the correlation of the sound generation timings of the song generators 120, 121, 122 may be changed. For example, the first half of the music may be sounded in the order of song generators 120, 121, 122 as in the above example, and the second half of the music may be sounded in the order of song generators 122, 121, 120. Good.
[0093]
(Modification 8)
Moreover, in 1st Embodiment mentioned above, although the audio | voice sample database 110 memorize | stores the audio | voice sample data group of the kind according to the number (three) of the singing generators 120, 121, 122, a singing generator A larger number of types of audio sample data groups may be stored.
[0094]
Further, in the case where three song generators such as the song generators 120, 121, and 122 are provided, and only two voice sample data groups 110a and 110b are stored in the voice sample database 110, at least two song songs are stored. What is necessary is just to make it produce | generate a song sound signal using the audio | voice sample data group 110a, 110b from which a generator differs. In this case, the song generator 120 uses the voice sample data group 110a, the song generator 121 uses the voice sample data group 110b, and the song generator 122 uses one of the voice sample data groups 110a and 110b as a song generator. If the use is started from a portion corresponding to a time different from 120, 121, the three song generators 120, 121, 122 will actually generate a song signal using different speech segment sample data, As in the above embodiments, a choral sound signal that can give a natural impression can be synthesized.
[0095]
(Modification 9)
The chorus synthesizer in each of the embodiments and modifications described above may be configured by a dedicated hardware circuit, but may be configured by software by a computer system as shown in FIG. As shown in the figure, this computer system includes a central processing unit (CPU) 320 that controls the entire apparatus, a read only memory (ROM) 321 that stores various control data and program groups, and a RAM (RAM) used as a work area. Random Access Memory) 322, an external storage device 323 such as a hard disk or CD-ROM (Compact Disc Read Only Memory) drive for storing music information and program groups, an operation unit 324 such as a keyboard and a mouse, and various information are displayed to the user. A display unit 325, a D / A converter 326, an amplifier 327, and a speaker 328 are provided.
[0096]
The CPU 320 constructs the audio sample database 110 in the RAM 322 or the external storage device 323 according to a program group stored in the ROM 321 or an external storage device 323 such as a hard disk, and uses the audio sample database 110 to change the above embodiments and modifications. The singing sound signal synthesis process for each part is performed as in the example. Then, CPU 320 adds the generated singing sound signals for each part, and then outputs the added chorus sound signals to D / A converter 326. In the D / A converter 326, the choral sound signal is converted into an analog signal, amplified by the analog signal amplifier 327 of the choral sound, and then emitted from the speaker 328.
[0097]
As described above, the chorus synthesizer in each of the above embodiments and modifications can be configured by software using a computer system, and a program for causing a computer system to perform the same chorus sound synthesizing process as in each of the above embodiments and the like. You may make it provide to a user with the form. As a method of providing such a program, there are a method of providing it by storing it in various recording media such as a CD-ROM and a floppy disk, a method of providing it via a communication line such as the Internet, and the like.
[0098]
【The invention's effect】
As described above, according to the present invention, it is possible to synthesize a chorus sound that can give a listener a more natural impression.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a basic configuration of a choral synthesizer according to a first embodiment of the present invention.
FIG. 2 is a diagram for explaining a method of creating a voice sample database, which is a component of the choral synthesizer.
FIG. 3 is a block diagram showing a functional configuration of a song generator that is a component of the chorus synthesizer.
FIG. 4 is a block diagram showing a basic configuration of a choral synthesizer according to a second embodiment of the present invention.
FIG. 5 is a diagram for explaining a singing sound signal generation method by the chorus synthesizer according to the second embodiment.
FIG. 6 is a diagram for explaining a singing sound signal generation method by the chorus synthesizer according to the second embodiment.
FIG. 7 is a diagram for explaining a singing sound signal generation method by the chorus synthesizer according to the second embodiment.
FIG. 8 is a diagram for explaining a singing sound signal generation method by the chorus synthesizer according to the second embodiment.
FIG. 9 is a block diagram showing a basic configuration of a modified example of the choral synthesizer according to the first embodiment.
FIG. 10 is a block diagram showing a basic configuration of a modified example of the choral synthesizer according to the second embodiment.
FIG. 11 is a block diagram showing a basic configuration of another modification of the choral synthesizer according to the first embodiment.
FIG. 12 is a block diagram showing a configuration of a computer system for realizing the function of the choral synthesizer by software.
[Explanation of symbols]
100, 100 ', 100 "... chorus synthesizer, 110 ... voice sample database, 110a, 110b, 110c ... voice sample data group, 120 ... song generator, 121 ... song generator, 122 ... song Generator, 130... Adder, 140... Choral control unit, 200... SMS analysis unit, 201... Segment extraction unit, 220, 221, 222. 302... Pitch determination unit 303. Duration time length adjustment unit 304. Speech unit connection unit 305. Harmonic component generation unit 306... Addition unit 307. Hanging part, 309... Overlapping part, 400, 400 ′.

Claims

A chorus synthesizer that synthesizes a choral sound signal based on music data,
A database that stores voice sample data groups each consisting of a plurality of voice sample data and each voice sample data group created based on a plurality of different voices for each sound range ;
A means for generating a singing sound signal according to the music data, wherein a plurality of singing sound generating means used for generating the singing sound signal by reading out the required voice sample data from the database;
Singing synthesis means for synthesizing a chorus sound signal from the singing sound signals generated by the plurality of song generation means,
When the music data is composed of a plurality of parts, and each of the plurality of song generation means generates a song sound signal corresponding to each of the parts, at least two of the song generation means each of the song generation means A choral synthesizer characterized in that the voice sample data included in a voice sample data group corresponding to a range corresponding to each part is read from the database and used to generate the singing sound signal.

Made before Symbol music data of a plurality of parts, when each of the plurality of singing generating means for generating a singing sound signals corresponding to each of said parts, each of at least two of said singing generating means, said singing generating means 2. The chorus synthesizer according to claim 1 , wherein the audio sample data group to be used is sequentially changed in the middle of the music in accordance with the pitch of each part determined by the melody information of each part.

Voice pre-SL database, phonemes, or an audio sample data for the voice segment is phoneme is a chain of two or more phonemes, which was created based on a plurality of different audio for the same phoneme or phoneme It stores sample data,
The singing generation means reads out and connects voice sample data corresponding to the lyrics shown in the music data from the database, adjusts the connected voice sample data according to the pitch shown in the music data, and sings sound signal chorus synthesizing apparatus according to claim 1 or 2, characterized in that to produce a.

Before SL database, an audio sample data, each created on the basis of a plurality of different audio stores a vibrato sound sample data indicating characteristics of vibrato part of speech,
The singing generating means, when generating a singing sound signal vibrato part, according to any one of claims 1 to 3, characterized in that use reads the vibrato voice sample data stored in said database Choral synthesizer.

A chorus synthesis method for synthesizing a choral sound signal from a plurality of singing sound signals generated based on music data,
When generating a singing sound signal corresponding to the plurality of parts according to the music data consisting of a plurality of parts, a voice sample data group consisting of a plurality of voice sample data each created based on a plurality of different sounds Read out the required voice sample data from a database that stores the voice sample data group for each sound range ,
For generating the singing sound signal corresponding to at least two parts, for each part, the voice sample data included in the voice sample data group corresponding to the range corresponding to each part is read from the database and the singing is performed. A chorus synthesis method characterized by being used to generate a sound signal.

The computer,
In accordance with the music data, the audio sample data group composed of a plurality of sound sample data, and the sound sample data required from a database for storing the sound sample data groups respectively created based on a plurality of different sounds for each sound range. A means for reading and generating a singing sound signal, wherein the music data comprises a plurality of parts, and when generating a singing sound signal corresponding to the plurality of parts, the singing sound corresponding to at least two of the parts When generating a signal, the singing sound generating means for reading out the voice sample data included in the voice sample data group corresponding to the range corresponding to each part for each part from the database and generating the singing sound signal When,
The program for functioning as a song synthesis | combination means which synthesize | combines a chorus sound signal from the said produced | generated song sound signal.