JP5552797B2

JP5552797B2 - Speech synthesis apparatus and speech synthesis method

Info

Publication number: JP5552797B2
Application number: JP2009256027A
Authority: JP
Inventors: 隼人大下; 靖雄吉岡; 雅史吉田; 橘　　誠
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2009-11-09
Filing date: 2009-11-09
Publication date: 2014-07-16
Anticipated expiration: 2029-11-09
Also published as: JP2011100055A

Description

本発明は、音声（典型的には歌唱音）を合成する技術に関する。 The present invention relates to a technique for synthesizing voice (typically singing sound).

音声素片を示す複数の素片データの集合（以下「音声ライブラリ」という）を利用して所望の音声を合成する技術が従来から提案されている（例えば特許文献１）。音声ライブラリは、実際の音声を採取したうえで音声素片毎に区分および解析することで作成される。 Conventionally, a technique for synthesizing a desired speech using a set of a plurality of segment data (hereinafter referred to as “speech library”) indicating speech segments has been proposed (for example, Patent Document 1). The speech library is created by collecting and analyzing actual speech and then segmenting and analyzing each speech unit.

特開２００２−２０２７９０号公報JP 2002-202790 A

特許文献１の技術のもとでは、合成すべき音声の特性毎に別個の音声ライブラリが必要となる。したがって、既存の音声ライブラリとは特性が異なる音声（例えば別の歌手の歌唱音）を合成するためには、音声ライブラリを新規に作成する必要がある。また、多様な音声の合成には多数の音声ライブラリが使用されるから、これらの音声ライブラリの記憶に必要な記憶装置の容量が増大するという問題もある。以上の事情を考慮して、本発明は、音声ライブラリの作成の労力や音声ライブラリの記憶に必要な容量を削減しながら多様な音声を合成することを目的とする。 Under the technique of Patent Document 1, a separate audio library is required for each characteristic of audio to be synthesized. Therefore, in order to synthesize a voice having characteristics different from those of an existing voice library (for example, a singer's singing sound), it is necessary to create a new voice library. In addition, since a large number of voice libraries are used for synthesizing various voices, there is a problem that the capacity of a storage device necessary for storing these voice libraries increases. In view of the above circumstances, an object of the present invention is to synthesize various voices while reducing the effort for creating a voice library and the capacity required for storing the voice library.

以上の課題を解決するために、本発明の第１態様に係る音声合成装置は、音声素片を示す複数の素片データを含む音声ライブラリと、素片データの利用を規定する素片利用情報が、音声ライブラリ内の１個以上の素片データを単位として複数の前記単位の各々を対象に設定された付属情報とを記憶する記憶手段と、指定音（合成の対象として指定された音）の時系列を示す音楽情報に応じて音声ライブラリの素片データを順次に選択する素片選択手段と、素片選択手段が選択した各素片データを、付属情報にて当該素片データに設定された素片利用情報に応じて加工する素片加工手段と、素片加工手段による加工後の素片データから音声を合成する合成処理手段とを具備する。以上の構成においては、付属情報を音声ライブラリに適用することで合成音が生成されるから、新規な音声ライブラリを追加せずに、既存の音声ライブラリからの合成音とは音響的な特性が相違する合成音を生成することが可能である。すなわち、音声ライブラリの作成の労力や音声ライブラリの記憶に必要な容量を削減しながら多様な音声を合成することが可能である。 In order to solve the above problems, the speech synthesizer according to the first aspect of the present invention includes a speech library including a plurality of segment data indicating speech segments, and segment usage information that defines the usage of the segment data. but storage means for storing the attribute information set in the target each of the plurality of the unit one or more fragment data in the speech library as a unit, designated sounds (designated as the synthesis of the target) The segment selection means for sequentially selecting segment data of the audio library according to the music information indicating the time series, and each segment data selected by the segment selection means is set as the segment data in the attached information Segment processing means for processing according to the segment usage information, and synthesis processing means for synthesizing speech from the segment data processed by the segment processing means. In the above configuration, synthesized sound is generated by applying the attached information to the audio library, so the acoustic characteristics are different from the synthesized sound from the existing audio library without adding a new audio library. It is possible to generate a synthesized sound. That is, it is possible to synthesize various voices while reducing the effort required to create a voice library and the capacity required for storing the voice library.

なお、記憶手段は、音声ライブラリおよび付属情報を記憶する単体の記録媒体と、音声ライブラリおよび付属情報の各々を別個に記憶する別体の複数の記録媒体とを含む概念である。また、記憶手段と音楽情報を記憶する手段とは、別体の記録媒体、または、単体の記録媒体に設定された別個の記憶領域であり得る。 The storage means is a concept including a single recording medium for storing the audio library and the attached information, and a plurality of separate recording media for separately storing the audio library and the attached information. The storage means and the means for storing music information may be separate recording media or separate storage areas set on a single recording medium.

本発明の好適な態様において、付属情報は、素片データのうち音声の合成に使用される区間を示す区間情報を含み、素片加工手段は、素片選択手段が選択した素片データのうち区間情報が示す区間を抽出する。以上の態様においては、各素片データの使用区間を既存の音声ライブラリの素片データから相違させることで多様な合成音を生成することが可能である。また、他の態様において、付属情報は、素片データに対応する音声素片内の特徴量を示す特性情報を含み、素片加工手段は、素片選択手段が選択した素片データの特徴量を特性情報に応じて制御する。以上の態様においては、各素片データの特徴量の変化を既存の音声ライブラリの素片データから相違させることで多様な合成音を生成することが可能である。以上の各態様の具体例は第１実施形態として後述される。 In a preferred aspect of the present invention, the attached information includes section information indicating a section used for speech synthesis in the piece data, and the piece processing means includes the piece data selected by the piece selection means. The section indicated by the section information is extracted. In the above aspect, it is possible to generate various synthesized sounds by making the usage interval of each piece data different from the piece data of the existing speech library. In another aspect, the attached information includes characteristic information indicating a feature amount in the speech unit corresponding to the piece data, and the piece processing means is a feature amount of the piece data selected by the piece selection means. Is controlled according to the characteristic information. In the above aspect, various synthesized sounds can be generated by making the change in the feature amount of each piece data different from the piece data of the existing speech library. Specific examples of the above aspects will be described later as the first embodiment.

本発明の好適な態様において、記憶手段は、複数の音声ライブラリを記憶し、付属情報は、複数の音声ライブラリの各々の素片データの混合比を指示し、素片選択手段は、複数の音声ライブラリの各々から素片データを選択し、素片加工手段は、素片選択手段が各音声ライブラリから選択した素片データを、付属情報が示す混合比で混合する。以上の態様においては、各音声ライブラリから選択された素片データが付属情報の規定する混合比で混合されるから、複数の音声ライブラリの各々の素片データの特性を反映した合成音を生成することが可能である。以上の態様の具体例は第２実施形態として後述される。 In a preferred aspect of the present invention, the storage means stores a plurality of sound libraries, the attached information indicates a mixture ratio of each piece data of the plurality of sound libraries, and the piece selection means has a plurality of sound libraries. The segment data is selected from each of the libraries, and the segment processing unit mixes the segment data selected from each speech library by the segment selection unit at a mixing ratio indicated by the attached information. In the above aspect, since the segment data selected from each speech library is mixed at the mixing ratio specified by the attached information, a synthesized sound reflecting the characteristics of each segment data of the plurality of speech libraries is generated. It is possible. A specific example of the above aspect will be described later as a second embodiment.

本発明の第２態様に係る音声合成装置は、音声素片を示す複数の素片データを各々が含む複数の音声ライブラリと、複数の音声ライブラリの各々の素片データについて制御変数の設定値を示す付属情報とを記憶する第１記憶手段と、指定音の時系列を示す音楽情報を記憶する第２記憶手段と、制御変数の指示値を順次に指示する変数指示手段と、複数の音声ライブラリの各々において音楽情報に応じた素片データのうち、付属情報における設定値が変数指示手段による指示値に近い素片データを選択する素片選択手段と、素片選択手段が選択した素片データから音声を合成する合成処理手段とを具備する。以上の構成においては、各音声ライブラリから選択された素片データを利用して合成音が生成されるから、新規な音声ライブラリを追加せずに、既存の１個の音声ライブラリからの合成音とは音響的な特性が相違する合成音を生成することが可能である。すなわち、音声ライブラリの作成の労力や音声ライブラリの記憶に必要な容量を削減しながら多様な音声を合成することが可能である。以上の態様の具体例は第３実施形態として後述される。 The speech synthesizer according to the second aspect of the present invention includes a plurality of speech libraries each including a plurality of segment data indicating speech units, and setting values of control variables for each segment data of the plurality of speech libraries. First storage means for storing the attached information to be shown, second storage means for storing music information indicating the time series of the designated sound, variable instruction means for sequentially indicating the instruction values of the control variables, and a plurality of audio libraries Among the segment data corresponding to the music information, the segment selection means for selecting the segment data whose set value in the attached information is close to the instruction value by the variable instruction means, and the segment data selected by the segment selection means Synthesizing means for synthesizing speech from. In the above configuration, the synthesized sound is generated using the segment data selected from each speech library. Therefore, the synthesized speech from one existing speech library can be generated without adding a new speech library. Can generate synthesized sounds with different acoustic characteristics. That is, it is possible to synthesize various voices while reducing the effort required to create a voice library and the capacity required for storing the voice library. A specific example of the above aspect will be described later as a third embodiment.

以上の各態様に係る音声合成装置は、音声の合成に専用されるＤＳＰ（Digital Signal Processor）などのハードウェア（電子回路）によって実現されるほか、ＣＰＵ（Central Processing Unit）などの汎用の演算処理装置とプログラムとの協働によっても実現される。本発明の第１態様に係るプログラムは、音声素片を示す複数の素片データを含む音声ライブラリと、素片データの利用を規定する素片利用情報が、音声ライブラリ内の１個以上の素片データを単位として複数の前記単位の各々を対象に設定された付属情報とを記憶する記憶手段を具備するコンピュータに、指定音の時系列を示す音楽情報に応じて音声ライブラリの素片データを順次に選択する素片選択処理と、素片選択処理で選択した各素片データを、付属情報にて当該素片データに設定された素片利用情報に応じて加工する素片加工処理と、素片加工処理による加工後の素片データから音声を合成する合成処理処理とを実行させる。以上のプログラムによれば、第１態様に係る音声合成装置と同様の作用および効果が実現される。 The speech synthesizer according to each aspect described above is realized by hardware (electronic circuit) such as a DSP (Digital Signal Processor) dedicated to speech synthesis and general-purpose arithmetic processing such as a CPU (Central Processing Unit). This is also realized by cooperation between the apparatus and the program. The program according to the first aspect of the present invention includes a speech library including a plurality of segment data indicating speech segments, and segment usage information for defining the usage of the segment data in one or more segments in the speech library. In a computer having storage means for storing auxiliary information set for each of a plurality of units with a piece of data as a unit, the piece of speech library data is stored in accordance with the music information indicating the time series of the specified sound. Segment selection processing for sequentially selecting, segment processing for processing each segment data selected in the segment selection processing according to the segment usage information set in the segment data in the attached information, And synthesizing processing for synthesizing speech from segment data after processing by segment processing. According to the above program, the same operation and effect as the speech synthesizer according to the first aspect are realized.

また、本発明の第２態様に係るプログラムは、音声素片を示す複数の素片データを各々が含む複数の音声ライブラリと、複数の音声ライブラリの各々の素片データについて制御変数の設定値を示す付属情報とを記憶する第１記憶手段と、指定音の時系列を示す音楽情報を記憶する第２記憶手段とを具備するコンピュータに、制御変数の指示値を順次に指示する変数指示処理と、複数の音声ライブラリの各々において音楽情報に応じた素片データのうち、付属情報における設定値が変数指示処理による指示値に近い素片データを選択する素片選択処理と、素片選択処理で選択した素片データから音声を合成する合成処理処理とを実行させる。以上のプログラムによれば、第２態様に係る音声合成装置と同様の作用および効果が実現される。 The program according to the second aspect of the present invention includes a plurality of speech libraries each including a plurality of segment data indicating speech segments, and setting values of control variables for each segment data of the plurality of speech libraries. Variable instruction processing for sequentially instructing instruction values of control variables to a computer having first storage means for storing attached information and second storage means for storing music information indicating a time series of a specified sound; In each of the plurality of audio libraries, segment selection processing for selecting segment data in which the set value in the attached information is close to the instruction value by the variable instruction processing among the segment data corresponding to the music information, and the segment selection processing And a synthesis process for synthesizing speech from the selected segment data. According to the above program, the same operation and effect as the speech synthesizer according to the second aspect are realized.

本発明の各態様に係るプログラムは、コンピュータが読取可能な記録媒体に格納された形態で利用者に提供されてコンピュータにインストールされるほか、通信網を介した配信の形態でサーバ装置から提供されてコンピュータにインストールされる。 The program according to each aspect of the present invention is provided to the user in a form stored in a computer-readable recording medium and installed in the computer, and is also provided from the server device in the form of distribution via a communication network. Installed on the computer.

第１実施形態に係る音声合成装置のブロック図である。It is a block diagram of the speech synthesizer concerning a 1st embodiment. 音声素片の波形の模式図である。It is a schematic diagram of the waveform of a speech element. 付属情報の模式図である。It is a schematic diagram of attached information. 編集画像の模式図である。It is a schematic diagram of an edit image. 音声合成部による処理を説明するための模式図である。It is a schematic diagram for demonstrating the process by a speech synthesizer. 第２実施形態における音声合成部の処理を説明するための模式図である。It is a schematic diagram for demonstrating the process of the speech synthesizer in 2nd Embodiment. 第３実施形態における音声合成部の処理を説明するための模式図である。It is a schematic diagram for demonstrating the process of the speech synthesizer in 3rd Embodiment. 第３実施形態における編集画像の模式図である。It is a schematic diagram of the edited image in 3rd Embodiment. 変形例における仮想ライブラリの構成を示す模式図である。It is a schematic diagram which shows the structure of the virtual library in a modification. 変形例における仮想ライブラリの構成を示す模式図である。It is a schematic diagram which shows the structure of the virtual library in a modification. 変形例における仮想ライブラリの構成を示す模式図である。It is a schematic diagram which shows the structure of the virtual library in a modification.

＜Ａ：第１実施形態＞
図１は、本発明の第１実施形態に係る音声合成装置１００のブロック図である。音声合成装置１００は、歌唱音などの様々な音声（以下「合成音」という）を合成する装置であり、図１に示すように、制御装置１０と記憶装置１２と入力装置１４と表示装置１６と放音装置１８とを具備するコンピュータシステムで実現される。音声合成装置１００を楽曲の歌唱音の合成に利用する場合を以下では想定する。 <A: First Embodiment>
FIG. 1 is a block diagram of a speech synthesizer 100 according to the first embodiment of the present invention. The speech synthesizer 100 is a device that synthesizes various sounds such as singing sounds (hereinafter referred to as “synthesized sounds”). As shown in FIG. 1, the control device 10, the storage device 12, the input device 14, and the display device 16. And a sound emitting device 18. In the following, it is assumed that the speech synthesizer 100 is used for synthesizing a song singing sound.

制御装置（ＣＰＵ）１０は、記憶装置１２に記憶されたプログラムＰGの実行で、音声信号ＳOUTの生成に必要な複数の機能（表示制御部２２，情報生成部２４，音声合成部２６）を実現する。音声信号ＳOUTは、合成音の波形を表す信号である。なお、制御装置１０の各機能を専用の電子回路（ＤＳＰ）で実現した構成や、制御装置１０の各機能を複数の集積回路に分散した構成も採用され得る。 The control device (CPU) 10 realizes a plurality of functions (display control unit 22, information generation unit 24, speech synthesis unit 26) necessary for generating the audio signal SOUT by executing the program PG stored in the storage device 12. To do. The audio signal SOUT is a signal representing the waveform of the synthesized sound. A configuration in which each function of the control device 10 is realized by a dedicated electronic circuit (DSP) or a configuration in which each function of the control device 10 is distributed over a plurality of integrated circuits may be employed.

入力装置１４は、利用者からの指示を受付ける機器（例えばマウスやキーボード）である。表示装置（例えば液晶表示装置）１６は、制御装置１０から指示された画像を表示する。放音装置（例えばスピーカやヘッドホン）１８は、制御装置１０が生成する音声信号ＳOUTに応じた音波を放射する。 The input device 14 is a device (for example, a mouse or a keyboard) that receives an instruction from a user. The display device (for example, a liquid crystal display device) 16 displays an image instructed from the control device 10. A sound emitting device (for example, a speaker or a headphone) 18 emits a sound wave corresponding to the sound signal SOUT generated by the control device 10.

記憶装置１２は、制御装置１０が実行するプログラムＰGや制御装置１０が使用する各種のデータ（音楽情報ＤS，音声ライブラリＬ，付属情報Ａ）を記憶する。半導体記録媒体や磁気記録媒体などの公知の記録媒体または複数種の記録媒体の組合せが記憶装置１２として任意に採用される。なお、プログラムＰGや各データ（ＤS，Ｌ，Ａ）を複数の記録媒体に分散して記憶した構成も採用される。 The storage device 12 stores a program PG executed by the control device 10 and various data (music information DS, audio library L, attached information A) used by the control device 10. A known recording medium such as a semiconductor recording medium or a magnetic recording medium or a combination of a plurality of types of recording media is arbitrarily employed as the storage device 12. A configuration in which the program PG and each data (DS, L, A) are distributed and stored in a plurality of recording media is also employed.

音楽情報ＤSは、楽曲を構成する音符（以下「指定音」という）の時系列を示す情報（スコアデータ）である。具体的には、音楽情報ＤSは、指定音の音高（ノートナンバ）と発音期間と発音文字とを楽曲内の指定音毎に指定する。発音期間は、例えば発音が開始する時刻と発音が継続される時間長とで規定される。発音文字は、音節を単位として発音の内容（歌詞）を示す文字である。 The music information DS is information (score data) indicating a time series of musical notes (hereinafter referred to as “designated sound”) constituting the music. Specifically, the music information DS designates the pitch (note number) of the designated sound, the pronunciation period, and the pronunciation character for each designated sound in the music. The sound generation period is defined by, for example, the time when sound generation starts and the length of time during which sound generation is continued. The pronunciation character is a character indicating the content (lyrics) of the pronunciation in syllable units.

図１の音声ライブラリＬは、相異なる音声素片に対応する複数の素片データＶの集合である。各素片データＶは、合成音の素材として利用される。音声素片は、例えば、音声を聴覚的に区別し得る最小の単位に相当する音素、または複数の音素を連結した音素連鎖である。音素連鎖は、典型的には２個の音素の連鎖（子音-母音，母音-子音，子音-子音，母音-母音）であるが、３個以上の音素の連鎖（例えば子音-母音-子音）や音節をも包含する概念である。 The speech library L in FIG. 1 is a set of a plurality of segment data V corresponding to different speech segments. Each piece data V is used as a synthetic sound material. The phoneme segment is, for example, a phoneme corresponding to a minimum unit that can be audibly distinguished from a voice, or a phoneme chain in which a plurality of phonemes are connected. A phoneme chain is typically a chain of two phonemes (consonant-vowel, vowel-consonant, consonant-consonant, vowel-vowel), but a chain of three or more phonemes (eg consonant-vowel-consonant). It is also a concept that encompasses syllables.

図１に示すように、音声ライブラリＬを構成する各音声素片の素片データＶは、その音声素片の波形（以下「素片波形」という）Ｗと、素片波形Ｗの初期的な使用区間を指示する区間情報Ｑとを含んで構成される。使用区間は、素片波形Ｗのうち実際に合成音の生成に使用される区間に相当する。図２は、母音の音素［ａ］と子音の音素［ｓ］とを連続させた音声素片（音素連鎖）［ａ_ｓ］の素片波形Ｗの模式図である。区間情報Ｑは、始点ｓ0と終点ｅ0とにわたる素片波形Ｗの全区間のうち、使用区間の初期的な始点（以下「初期始点」という）ｑSと、使用区間の初期的な終点（以下「初期終点」という）ｑEとを指示する。 As shown in FIG. 1, the segment data V of each speech unit constituting the speech library L includes the waveform W of the speech unit (hereinafter referred to as “segment waveform”) W and the initial unit waveform W. And section information Q indicating a section to be used. The used section corresponds to a section of the unit waveform W that is actually used for generating a synthesized sound. FIG. 2 is a schematic diagram of a unit waveform W of a speech unit (phoneme chain) [a_s] in which a vowel phoneme [a] and a consonant phoneme [s] are continuous. The section information Q includes the initial start point of the used section (hereinafter referred to as “initial start point”) qS and the initial end point of the used section (hereinafter “ QE) (referred to as "initial end point").

図１の付属情報Ａは、音声ライブラリＬ内の各素片データＶの加工に適用される。図３に示すように、付属情報Ａは、音声ライブラリＬに収録された複数の音声素片（素片データＶ）のうち利用者が編集を指示した音声素片毎に、区間情報Ｐおよび特性情報Ｆの少なくとも一方を含んで構成される。利用者が編集を指示しない音声素片について区間情報Ｐや特性情報Ｆは付属情報Ａに含まれない。 The attached information A in FIG. 1 is applied to the processing of each piece data V in the audio library L. As shown in FIG. 3, the attached information A includes the section information P and the characteristics for each speech unit that the user has instructed to edit among a plurality of speech units (segment data V) recorded in the speech library L. It is configured to include at least one of the information F. The segment information P and the characteristic information F are not included in the attached information A for the speech unit that the user does not instruct to edit.

区間情報Ｐは、音声ライブラリＬ内の素片データＶが示す素片波形Ｗのうち実際に音声合成に使用される使用区間を指定する情報である。図２に示すように、区間情報Ｐは、使用区間の始点ｐSおよび終点ｐEを可変に指示する。区間情報Ｐが指定する始点ｐSは、音声ライブラリＬ内の区間情報Ｑが指定する初期始点ｑSとは相違し得る。同様に、区間情報Ｐの終点ｐEは初期終点ｑEとは相違し得る。始点ｐSは初期始点ｑSに対する変化量として指定され、終点ｐEは初期終点ｑEに対する変化量として指定される。 The section information P is information for designating a use section that is actually used for speech synthesis in the unit waveform W indicated by the unit data V in the speech library L. As shown in FIG. 2, the section information P variably indicates the start point pS and the end point pE of the use section. The starting point pS specified by the section information P may be different from the initial starting point qS specified by the section information Q in the audio library L. Similarly, the end point pE of the section information P may be different from the initial end point qE. The start point pS is specified as a change amount with respect to the initial start point qS, and the end point pE is specified as a change amount with respect to the initial end point qE.

図３の特性情報Ｆは、音声ライブラリＬ内の各素片データＶが示す素片波形Ｗ内での特徴量（特に時間的な変化）を示す。具体的には、素片波形Ｗ内の音量，ピッチ，ホルマント周波数，または音色の時間的な変化が特性情報Ｆで指示される。音色の時間的な変化は、例えばスペクトルやMFCC（mel-frequency cepstrum coefficient）の遷移で定義される。 The characteristic information F in FIG. 3 indicates a feature amount (particularly temporal change) in the segment waveform W indicated by each segment data V in the audio library L. Specifically, the temporal change in volume, pitch, formant frequency, or timbre in the segment waveform W is indicated by the characteristic information F. The temporal change in timbre is defined by, for example, the transition of a spectrum or MFCC (mel-frequency cepstrum coefficient).

図１の表示制御部２２は、音楽情報ＤSや付属情報Ａの生成および編集のために利用者が視認する編集画像４０を表示装置１６に表示させる。図４は、編集画像４０の模式図である。図４に例示するように、編集画像４０は、指定音の時系列を表示する楽譜領域４２と、付属情報Ａの編集に利用される編集領域４４とを含んで構成される。 The display control unit 22 in FIG. 1 causes the display device 16 to display an edited image 40 visually recognized by the user for generating and editing the music information DS and the attached information A. FIG. 4 is a schematic diagram of the edited image 40. As illustrated in FIG. 4, the edited image 40 includes a score area 42 that displays a time series of designated sounds, and an edit area 44 that is used for editing the attached information A.

楽譜領域４２は、音高に対応する縦軸（音高軸）と時間に対応する横軸（時間軸）とが設定されたピアノロール型の画像領域である。利用者は、楽譜領域４２を視認しながら入力装置１４を適宜に操作することで指定音の音高と発音期間と発音文字とを指示する。表示制御部２２は、利用者から指示された指定音に対応する音指示子５１を楽譜領域４２内に配置する。音高軸の方向における音指示子５１の位置は利用者が指示した音高に応じて決定され、時間軸の方向における音指示子５１の各端点は利用者が指示した発音期間の始点および終点に相当する。また、利用者から指示された発音文字が各音指示子５１に付加される。なお、指定音の音譜を五線譜に記譜した楽譜の画像を楽譜領域４２に配置した構成も採用され得る。 The score area 42 is a piano roll type image area in which a vertical axis (pitch axis) corresponding to pitch and a horizontal axis (time axis) corresponding to time are set. The user designates the pitch, pronunciation period, and pronunciation character of the designated sound by appropriately operating the input device 14 while visually recognizing the score area 42. The display control unit 22 arranges the sound indicator 51 corresponding to the designated sound instructed by the user in the score area 42. The position of the sound indicator 51 in the direction of the pitch axis is determined according to the pitch instructed by the user, and each end point of the sound indicator 51 in the direction of the time axis is the start point and end point of the sound generation period instructed by the user. It corresponds to. In addition, pronunciation characters instructed by the user are added to each sound indicator 51. It is also possible to adopt a configuration in which an image of a musical score in which a musical score of a designated sound is recorded in a staff score is arranged in the musical score area 42.

図１の情報生成部２４は、利用者が楽譜領域４２に対して指示した指定音の音高と発音期間と発音文字とを対応させて記憶装置１２の音楽情報ＤSに格納する。以上の処理が反復されることで、利用者から指示された指定音の時系列を示す音楽情報ＤSが記憶装置１２に生成され、各指定音の音指示子の時系列が図４の例示のように楽譜領域４２に表示される。 The information generation unit 24 in FIG. 1 stores the pitch of the designated sound, the pronunciation period, and the pronunciation characters specified by the user with respect to the score area 42 in the music information DS of the storage device 12. By repeating the above processing, music information DS indicating the time series of the designated sound instructed by the user is generated in the storage device 12, and the time series of the sound indicator of each designated sound is shown in FIG. As shown in the musical score area 42.

編集領域４４は、波形領域４４１と特性領域４４３とを含んで構成される。表示制御部２２は、利用者が指示した指定音の合成に使用される各素片データＶの素片波形Ｗを波形領域４４１内に時系列に配置する。波形領域４４１に素片波形Ｗが表示される素片データＶは、指定音に指示された発音文字に応じて選択される。例えば、図４の例示のように「あさ（朝）」が発音文字として指示された場合、表示制御部２２は、音声素片［＃_ａ］（「＃」は無音を意味する），［ａ］，［ａ_ｓ］，［ｓ_ａ］，［ａ］および［ａ_＃］の各々に対応する素片データＶを記憶装置１２の音声ライブラリＬから取得して各素片波形Ｗを波形領域４４１内に時系列に配列する。 The editing area 44 includes a waveform area 441 and a characteristic area 443. The display control unit 22 arranges the segment waveform W of each segment data V used for synthesizing the designated sound designated by the user in the waveform area 441 in time series. The segment data V in which the segment waveform W is displayed in the waveform area 441 is selected according to the pronunciation character designated by the designated sound. For example, when “ASA (morning)” is designated as a pronunciation character as illustrated in FIG. 4, the display control unit 22 causes the speech element [#_a] (“#” means silence), [a ], [A_s], [s_a], [a], and [a_ #], the segment data V corresponding to each of them is acquired from the speech library L of the storage device 12, and each segment waveform W is stored in the waveform region 441. Arrange in time series.

表示制御部２２は、素片波形Ｗのうち使用区間の始点を示す始点指示子５３２と使用区間の終点を示す終点指示子５３４とを素片波形Ｗ毎に配置する。始点指示子５３２の初期的な位置は、音声ライブラリＬ内の素片データＶの区間情報Ｑが示す初期始点ｑSに設定される。同様に、終点指示子５３４の初期的な位置は、素片データＶの区間情報Ｑが示す初期終点ｑEに設定される。利用者は、入力装置１４を適宜に操作することで、波形領域４４１内に配置された複数の素片波形Ｗの何れかを選択するとともにその素片波形Ｗの始点指示子５３２および終点指示子５３４の移動を指示することが可能である。表示制御部２２は、入力装置１４に対する利用者からの指示に応じて、始点指示子５３２と終点指示子５３４とを、各素片波形Ｗの始点ｓ0から終点ｅ0までの範囲内で移動させる。 The display control unit 22 arranges, for each unit waveform W, a start point indicator 532 indicating the start point of the use interval and an end point indicator 534 indicating the end point of the use interval in the unit waveform W. The initial position of the start point indicator 532 is set to the initial start point qS indicated by the section information Q of the segment data V in the audio library L. Similarly, the initial position of the end point indicator 534 is set to the initial end point qE indicated by the segment information Q of the segment data V. The user appropriately operates the input device 14 to select one of the plurality of segment waveforms W arranged in the waveform region 441, and at the same time, the start point indicator 532 and the end point indicator of the segment waveform W are selected. It is possible to instruct the movement of 534. The display control unit 22 moves the start point indicator 532 and the end point indicator 534 within the range from the start point s0 to the end point e0 of each segment waveform W in accordance with an instruction from the user to the input device 14.

また、表示制御部２２は、各指定音の合成に使用される素片データＶの特徴量の遷移を示す特性遷移画像５５を素片波形Ｗ毎に特性領域４４３に配置する。例えば、表示制御部２２は、図４の例示のように、各素片データＶの音量の時間的な遷移を示すグラフ（折線グラフ）を特性遷移画像５５として表示装置１６に表示させる。各素片波形Ｗに対応する特性遷移画像５５は、波形領域４４１におけるその素片波形Ｗと時間軸を共通にして表示される。利用者は、入力装置１４を適宜に操作することで、特性遷移画像５５の編集（変更）を指示することが可能である。表示制御部２２は、利用者からの指示に応じて特性遷移画像５５を編集する。 Further, the display control unit 22 arranges the characteristic transition image 55 showing the transition of the feature amount of the segment data V used for the synthesis of each designated sound in the characteristic region 443 for each segment waveform W. For example, as illustrated in FIG. 4, the display control unit 22 causes the display device 16 to display a graph (line graph) indicating the temporal transition of the volume of each piece data V as the characteristic transition image 55. The characteristic transition image 55 corresponding to each segment waveform W is displayed in common with the segment waveform W in the waveform region 441 and the time axis. The user can instruct editing (change) of the characteristic transition image 55 by appropriately operating the input device 14. The display control unit 22 edits the characteristic transition image 55 in accordance with an instruction from the user.

情報生成部２４は、編集領域４４に対する利用者からの指示に応じて付属情報Ａを更新する。具体的には、情報生成部２４は、各素片データＶの素片波形Ｗに対して利用者から指示された始点指示子５３２および終点指示子５３４の各々の位置を特定し、始点指示子５３２の位置に応じた始点ｐSと終点指示子５３４の位置に応じた終点ｐEとを示す情報を、その素片データＶの区間情報Ｐとして生成したうえで付属情報Ａに格納する。すなわち、音声ライブラリＬ内の区間情報Ｑが指示する使用区間が維持されたまま、付属情報Ａの区間情報Ｐが指示する使用区間は、利用者からの指示に応じて可変に設定される。 The information generation unit 24 updates the attached information A in response to an instruction from the user for the editing area 44. Specifically, the information generation unit 24 specifies the positions of the start point indicator 532 and the end point indicator 534 instructed by the user with respect to the unit waveform W of each unit data V, and the start point indicator Information indicating the start point pS corresponding to the position of 532 and the end point pE corresponding to the position of the end indicator 534 is generated as section information P of the segment data V and stored in the attached information A. That is, while the use section designated by the section information Q in the audio library L is maintained, the use section designated by the section information P of the attached information A is variably set according to an instruction from the user.

また、情報生成部２４は、各素片データＶの素片波形Ｗについて利用者が編集した特性遷移画像５５から特性情報Ｆを生成して付属情報Ａに格納する。すなわち、音声ライブラリＬ内の各素片データＶの素片波形Ｗが維持されたまま、付属情報Ａの特性情報Ｆが素片波形Ｗについて指示する特徴量は、利用者からの指示に応じて可変に設定される。以上に説明したように、始点指示子５３２または終点指示子５３４の移動や特性遷移画像５５の変更が利用者から指示された場合でも、付属情報Ａが更新されるだけで、音声ライブラリＬ内の各素片データＶは何ら変更されない。 Further, the information generation unit 24 generates characteristic information F from the characteristic transition image 55 edited by the user for the segment waveform W of each segment data V and stores it in the attached information A. That is, the feature quantity instructed by the characteristic information F of the attached information A for the segment waveform W while the segment waveform W of each segment data V in the audio library L is maintained corresponds to the instruction from the user. Set to variable. As described above, even when the user instructs the movement of the start point indicator 532 or the end point indicator 534 or the change of the characteristic transition image 55, only the attached information A is updated, Each piece data V is not changed at all.

図１の音声合成部２６は、記憶装置１２に格納された音楽情報ＤSが示す指定音を合成して音声信号ＳOUTを生成する。概略的には、音声合成部２６は、音声ライブラリＬのうち音楽情報ＤSに応じて選択した素片データＶを付属情報Ａに応じて加工して相互に連結することで音声信号ＳOUTを生成する。図５に示すように、音声合成部２６は、素片選択部３２と素片加工部３４と合成処理部３６とを含んで構成される。素片選択部３２は、音楽情報ＤSにて各指定音に指示された発音文字に対応する各音声素片の素片データＶを記憶装置１２の音声ライブラリＬから順次に選択する。 The voice synthesizer 26 in FIG. 1 synthesizes a designated sound indicated by the music information DS stored in the storage device 12 to generate a voice signal SOUT. Schematically, the speech synthesizer 26 generates the speech signal SOUT by processing the segment data V selected in accordance with the music information DS in the speech library L in accordance with the attached information A and connecting them together. . As shown in FIG. 5, the speech synthesis unit 26 includes a segment selection unit 32, a segment processing unit 34, and a synthesis processing unit 36. The segment selection unit 32 sequentially selects the segment data V of each speech unit corresponding to the pronunciation character designated by each designated sound in the music information DS from the speech library L of the storage device 12.

素片加工部３４は、素片選択部３２が選択した各素片データＶを加工する。素片選択部３２が選択した素片データＶについて付属情報Ａに区間情報Ｐや特性情報Ｆが含まれない場合、素片加工部３４は、その素片データＶが示す素片波形Ｗのうち音声ライブラリＬの区間情報Ｑで指示される使用区間（図２の始点ｑSと終点ｑEとにわたる区間）を、音楽情報ＤSが指示する音高および発音期間に調整する。 The segment processing unit 34 processes each segment data V selected by the segment selection unit 32. When the segment information P or the characteristic information F is not included in the attached information A for the segment data V selected by the segment selection unit 32, the segment processing unit 34 includes the segment waveform W indicated by the segment data V. The use section indicated by the section information Q of the audio library L (the section extending between the start point qS and the end point qE in FIG. 2) is adjusted to the pitch and the sound generation period indicated by the music information DS.

他方、素片選択部３２が選択した素片データＶについて付属情報Ａに区間情報Ｐが含まれる場合、素片加工部３４は、その素片データＶが示す素片波形Ｗのうち当該区間情報Ｐで指示された使用区間（始点ｐSと終点ｐEとにわたる区間）を抽出し、抽出後の使用区間を、音楽情報ＤSが指示する音高および発音期間に調整する。すなわち、付属情報Ａの区間情報Ｐに応じた加工（使用区間の抽出）が素片データＶに対して実行される。なお、素片データＶの音高や発音期間の調整には公知の技術が任意に採用される。また、素片データＶの音高や発音期間の調整後に区間情報Ｐに応じた使用区間を抽出する構成も採用され得る。 On the other hand, when the segment information P is included in the attached information A for the segment data V selected by the segment selection unit 32, the segment processing unit 34 includes the segment information in the segment waveform W indicated by the segment data V. The use section indicated by P (the section extending from the start point pS and the end point pE) is extracted, and the use section after extraction is adjusted to the pitch and sound generation period indicated by the music information DS. That is, the processing (extraction of the used section) according to the section information P of the attached information A is performed on the piece data V. A known technique is arbitrarily employed for adjusting the pitch and sound generation period of the segment data V. In addition, a configuration may be employed in which a use section corresponding to the section information P is extracted after adjustment of the pitch or sound generation period of the segment data V.

また、素片選択部３２が選択した素片データＶについて付属情報Ａに特性情報Ｆが含まれる場合、素片加工部３４は、その素片データＶが示す素片波形Ｗの使用区間を、音楽情報ＤSが指示する音高および発音期間に調整するとともに特性情報Ｆに応じて加工する。具体的には、素片加工部３４は、特性情報Ｆで指示される特性が素片波形Ｗの使用区間に対して付加されるように、素片波形Ｗの特徴量を特性情報Ｆに応じて制御する。例えば、特性情報Ｆが音量の時系列を示す場合、素片加工部３４は、素片波形Ｗの使用区間内の音量が特性情報Ｆの音量の時系列に沿うように素片データＶを加工する。なお、素片データＶのうち特性情報Ｆに応じた加工の対象となる使用区間は、付属情報Ａが区間情報Ｐを含まない素片データＶについては音声ライブラリＬの区間情報Ｑが示す使用区間であり、付属情報Ａが区間情報Ｐを含む素片データＶについては当該区間情報Ｐが示す使用区間である。 Further, in the case where the characteristic information F is included in the attached information A for the segment data V selected by the segment selection unit 32, the segment processing unit 34 determines the usage period of the segment waveform W indicated by the segment data V, The music information DS is adjusted to the pitch and sound generation period indicated, and processed according to the characteristic information F. Specifically, the segment processing unit 34 sets the feature amount of the segment waveform W in accordance with the property information F so that the property indicated by the property information F is added to the use period of the segment waveform W. Control. For example, when the characteristic information F indicates a time series of volume, the segment processing unit 34 processes the segment data V so that the volume in the use section of the segment waveform W is along the time series of the volume of the characteristic information F. To do. Of the segment data V, the usage interval to be processed according to the characteristic information F is the usage interval indicated by the interval information Q of the audio library L for the segment data V in which the attached information A does not include the interval information P. The segment data V in which the attached information A includes the section information P is the use section indicated by the section information P.

図５の合成処理部３６は、素片加工部３４による加工後の各素片データＶを時間軸上で相互に連結することで音声信号ＳOUTを生成する。以上の説明から理解されるように、楽譜領域４２の各音指示子５１が示す音符で構成される楽曲を各指定音の発音文字で歌唱した歌唱音の音声信号ＳOUTが生成される。なお、素片データＶを利用した音声信号ＳOUTの生成には公知の技術が任意に採用される。 The synthesis processing unit 36 in FIG. 5 generates the audio signal SOUT by connecting the piece data V processed by the piece processing unit 34 to each other on the time axis. As can be understood from the above description, the sound signal SOUT of the singing sound in which the music composed of the notes indicated by the sound indicators 51 in the score area 42 is sung with the pronunciation characters of the designated sounds is generated. A known technique is arbitrarily employed for generating the audio signal SOUT using the segment data V.

以上に説明したように、第１実施形態においては、既存の音声ライブラリＬに付属情報Ａを適用することで合成音が生成されるから、音声ライブラリＬとは別個の音声ライブラリを実際には用意することなく、収録音の特性が音声ライブラリＬとは相違する新規な音声ライブラリ（仮想的な音声ライブラリ）を利用した場合と同等の音声信号ＳOUTを生成することが可能である。すなわち、音声毎に別個の音声ライブラリＬを用意しなくても、相異なる特性の音声を合成することが可能である。したがって、音声ライブラリＬの作成の労力を削減しながら、新規な音声ライブラリＬを作成および利用した場合と同様に多様な音声を合成できるという利点がある。また、付属情報Ａは音声ライブラリＬと比較してデータ量が少ないから、新規な音声ライブラリＬを用意する場合と比較して、記憶装置１２に必要な容量が削減されるという利点もある。 As described above, in the first embodiment, since the synthesized sound is generated by applying the auxiliary information A to the existing sound library L, a sound library separate from the sound library L is actually prepared. Without being performed, it is possible to generate an audio signal SOUT equivalent to the case of using a new audio library (virtual audio library) whose recorded sound characteristics are different from those of the audio library L. That is, it is possible to synthesize sounds having different characteristics without preparing a separate sound library L for each sound. Therefore, there is an advantage that various voices can be synthesized as in the case of creating and using a new voice library L while reducing the effort for creating the voice library L. Further, since the attached information A has a smaller data amount than the audio library L, there is an advantage that the capacity required for the storage device 12 is reduced as compared with the case where a new audio library L is prepared.

＜Ｂ：第２実施形態＞
次に、本発明の第２実施形態を説明する。なお、以下の各例示において作用や機能が第１実施形態と同等である要素については、以上と同じ符号を付して各々の詳細な説明を適宜に省略する。 <B: Second Embodiment>
Next, a second embodiment of the present invention will be described. In the following examples, elements having the same functions and functions as those of the first embodiment are denoted by the same reference numerals, and detailed descriptions thereof are omitted as appropriate.

図６は、第２実施形態における音声の合成を説明するための模式図である。図６に示すように、第２実施形態の記憶装置１２は、複数（以下の例示では２個）の音声ライブラリＬ（Ｌ1，Ｌ2）を記憶する。各音声ライブラリＬ（Ｌ1，Ｌ2）は、第１実施形態と同様に音声素片毎の素片データＶ（Ｖ1，Ｖ2）の集合である。音声ライブラリＬ1と音声ライブラリＬ2とは、相異なる特性の音声から生成される。例えば、各音声素片の抽出元となる音声の発声者が音声ライブラリＬ1と音声ライブラリＬ2とでは相違する。なお、ひとりの発声者が相異なる特性で発声した複数の音声の各々から音声ライブラリＬ1と音声ライブラリＬ2とを生成してもよい。 FIG. 6 is a schematic diagram for explaining speech synthesis in the second embodiment. As shown in FIG. 6, the storage device 12 of the second embodiment stores a plurality (two in the following example) of audio libraries L (L1, L2). Each speech library L (L1, L2) is a set of segment data V (V1, V2) for each speech segment, as in the first embodiment. The audio library L1 and the audio library L2 are generated from audio having different characteristics. For example, the voice speaker from which each speech element is extracted differs between the voice library L1 and the voice library L2. Note that the voice library L1 and the voice library L2 may be generated from each of a plurality of voices uttered by a single speaker with different characteristics.

図６の素片選択部３２は、音楽情報ＤSにて各指定音に指定された発音文字に対応する各音声素片の素片データＶを音声ライブラリＬ1および音声ライブラリＬ2の各々から順次に選択する。したがって、音声ライブラリＬ1内の素片データＶ1と音声ライブラリＬ2内の素片データＶ2とが、各発音文字に対応する音声素片毎に順次に選択される。 The segment selection unit 32 in FIG. 6 sequentially selects the segment data V of each speech unit corresponding to the phonetic character designated for each designated sound in the music information DS from each of the speech library L1 and the speech library L2. To do. Accordingly, the segment data V1 in the speech library L1 and the segment data V2 in the speech library L2 are sequentially selected for each speech segment corresponding to each phonetic character.

図６に示すように、第２実施形態の付属情報Ａは、音声ライブラリＬ1および音声ライブラリＬ2に収録された複数の音声素片の各々について、音声ライブラリＬ1内の素片データＶ1と音声ライブラリＬ2内の素片データＶ2との混合比Ｒを指示する。付属情報Ａが指示する各混合比Ｒは、例えば入力装置１４に対する利用者からの指示に応じて可変に設定される。 As shown in FIG. 6, the attached information A of the second embodiment is that, for each of a plurality of speech units recorded in the speech library L1 and the speech library L2, the segment data V1 and speech library L2 in the speech library L1. The mixing ratio R with the segment data V2 is indicated. Each mixing ratio R indicated by the auxiliary information A is variably set according to an instruction from the user to the input device 14, for example.

素片加工部３４は、素片選択部３２が音声ライブラリＬ1から選択した素片データＶ1と音声ライブラリＬ2から選択した素片データＶ2とを、その音声素片に対して付属情報Ａに指示された混合比Ｒで混合（加算）することで素片データＶAを生成する。なお、音楽情報ＤSに応じた音高や発音期間の調整は、素片加工部３４による混合前または混合後に、第１実施形態にて例示した方法で実行される。合成処理部３６は、素片加工部３４による加工後（混合後）の素片データＶAから音声信号ＳOUTを生成する。 The segment processing unit 34 is instructed by the attached information A to the segment data V1 selected from the speech library L1 by the segment selection unit 32 and the segment data V2 selected from the speech library L2. The segment data VA is generated by mixing (adding) at the mixing ratio R. Note that the adjustment of the pitch and the sound generation period according to the music information DS is executed by the method exemplified in the first embodiment before or after mixing by the segment processing unit 34. The synthesis processing unit 36 generates an audio signal SOUT from the segment data VA processed (after mixing) by the segment processing unit 34.

以上の形態においては、音声ライブラリＬ1の素片データＶ1と音声ライブラリＬ2の素片データＶ2とを付属情報Ａの混合比Ｒで混合したうえで音声信号ＳOUTの生成に利用する。したがって、音声ライブラリＬを新規に作成することなく、音声ライブラリＬ1および音声ライブラリＬ2の双方の特性を反映した音声ライブラリ（すなわち、音声ライブラリＬ1の素片データＶと音声ライブラリＬ2の素片データＶとを混合した素片データＶで構成される仮想的な音声ライブラリ）を利用した場合と同等の音声信号ＳOUTを生成することが可能である。すなわち、第１実施形態と同様に、音声ライブラリＬの作成の労力や記憶装置１２に必要な容量を削減しながら、新規な音声ライブラリＬを用意した場合と同様に多様な音声を合成できるという利点がある。 In the above embodiment, the segment data V1 of the audio library L1 and the segment data V2 of the audio library L2 are mixed at the mixing ratio R of the auxiliary information A and then used to generate the audio signal SOUT. Therefore, without creating a new audio library L, an audio library reflecting the characteristics of both the audio library L1 and the audio library L2 (that is, the unit data V of the audio library L1 and the unit data V of the audio library L2) It is possible to generate an audio signal SOUT equivalent to the case of using a virtual audio library composed of segment data V mixed with. That is, as in the first embodiment, it is possible to synthesize various voices as in the case of preparing a new voice library L while reducing the effort for creating the voice library L and the capacity required for the storage device 12. There is.

＜Ｃ：第３実施形態＞
図７は、本発明の第３実施形態における音声の合成を説明するための模式図である。図７に示すように、第３実施形態の記憶装置１２は、第２実施形態と同様に、相異なる特性の音声から生成された複数（以下の例示では２個）の音声ライブラリＬ（Ｌ1，Ｌ2）を記憶する。 <C: Third Embodiment>
FIG. 7 is a schematic diagram for explaining speech synthesis in the third embodiment of the present invention. As shown in FIG. 7, the storage device 12 of the third embodiment is similar to the second embodiment in that a plurality (two in the following example) of audio libraries L (L1, L2) is stored.

また、記憶装置１２は、音声合成に適用される制御変数（コントロールパラメータ）Ｘの数値の変化を示す変数情報ＤPを記憶する。制御変数Ｘは、合成音に付与される音楽的な表情を制御するための変数である。具体的には、指定音の発音の強弱（velocity），音量（dynamics），息成分の強弱（breathness）、明瞭度（brightness，clearness），発音時の開口度（opening），発音者の性別（genderfactor），音高を連続的に変化（ポルタメント）させる時点（portamento-timing），音高の微小変化（pitch-bend），音高の微小変化の最大幅（pitch-bend sensitivity）など、音声合成に適用される公知の変数が制御変数Ｘとして任意に採用される。また、以上の例示から選択された複数の変数の組合せ（例えば利用者から指示された複数の変数）を新規な（仮想的な）制御変数Ｘとして変数情報ＤPで指定した構成も採用され得る。 Further, the storage device 12 stores variable information DP indicating a change in the numerical value of a control variable (control parameter) X applied to speech synthesis. The control variable X is a variable for controlling a musical expression given to the synthesized sound. Specifically, the sound intensity (velocity), volume (dynamics) of the specified sound, breathness intensity (breathness), clarity (brightness, clearness), opening degree during pronunciation, gender ( genderfactor), time point of continuous pitch change (portamento) (portamento-timing), minute pitch change (pitch-bend), maximum pitch change (pitch-bend sensitivity), etc. A known variable applied to the above is arbitrarily adopted as the control variable X. Further, a configuration in which a combination of a plurality of variables selected from the above examples (for example, a plurality of variables instructed by the user) is designated as a new (virtual) control variable X by the variable information DP may be employed.

図８は、第３実施形態における編集画像４０の模式図である。図８の例示のように、表示制御部２２は、変数情報ＤPが示す制御変数Ｘの時系列を示す画像（以下「変数遷移画像」という）５７を、楽譜領域４２内の各音指示子５１の時系列と共通の時間軸のもとで編集領域４６内に配置する。具体的には、制御変数Ｘの数値の遷移を示すグラフ（例えば折線グラフ）が変数遷移画像５７として表示される。表示制御部２２は、入力装置１４に対する利用者からの指示に応じて変数遷移画像５７を随時に変更する。情報生成部２４は、変更後の変数遷移画像５７に応じた制御変数Ｘの時系列を示す内容に記憶装置１２の変数情報ＤPを更新する。すなわち、変数情報ＤPは利用者からの指示に応じて可変に設定される。 FIG. 8 is a schematic diagram of an edited image 40 in the third embodiment. As illustrated in FIG. 8, the display control unit 22 converts an image 57 (hereinafter referred to as “variable transition image”) 57 indicating the time series of the control variable X indicated by the variable information DP into each sound indicator 51 in the score area 42. Are arranged in the editing area 46 on the same time axis as that of the time series. Specifically, a graph (for example, a line graph) showing the numerical transition of the control variable X is displayed as the variable transition image 57. The display control unit 22 changes the variable transition image 57 as needed in accordance with an instruction from the user to the input device 14. The information generation unit 24 updates the variable information DP of the storage device 12 to the content indicating the time series of the control variable X corresponding to the variable transition image 57 after the change. That is, the variable information DP is variably set according to an instruction from the user.

第３実施形態の付属情報Ａは、図７に示すように、音声ライブラリＬ1の素片データＶ1と音声ライブラリＬ2の素片データＶ2との各々について、制御変数Ｘの数値（以下「設定値」という）ｘAを音声素片毎に指示する。各素片データＶ1には設定値ｘA1が指示され、各素片データＶ2には設定値ｘA2が指示される。音声素片が共通する素片データＶ1と素片データＶ2とでは設定値ｘA（ｘA1，ｘA2）が相違する。例えば、図７では、音声素片［ａ_ｓ］の素片データＶ1の設定値ｘA1を0.2に設定し、同じ音声素片［ａ_ｓ］の素片データＶ2の設定値ｘA2を0.6に設定した場合が例示されている。各設定値ｘA1および各設定値ｘA2は、例えば入力装置１４に対する利用者からの指示に応じて可変に設定される。 As shown in FIG. 7, the auxiliary information A of the third embodiment includes numerical values of control variables X (hereinafter “set values”) for each of the segment data V1 of the audio library L1 and the segment data V2 of the audio library L2. XA) is designated for each speech unit. A set value xA1 is instructed to each piece data V1, and a set value xA2 is instructed to each piece data V2. The set value xA (xA1, xA2) is different between the segment data V1 and the segment data V2 that share the speech segment. For example, in FIG. 7, the set value xA1 of the segment data V1 of the speech unit [a_s] is set to 0.2, and the set value xA2 of the segment data V2 of the same speech unit [a_s] is set to 0.6. Illustrated. Each set value xA1 and each set value xA2 are variably set according to an instruction from the user to the input device 14, for example.

図７に示すように、第３実施形態の音声合成部２６は、素片選択部３２と素片加工部３４と合成処理部３６とに加えて変数指示部３８を含んで構成される。変数指示部３８は、制御変数Ｘの数値（以下「指示値」という）ｘBを素片選択部３２に対して順次に指定する。具体的には、変数指示部３８は、変数情報ＤPが時系列に指定する制御変数Ｘの数値を指示値ｘBとして順次に記憶装置１２から取得して素片選択部３２に指示する。 As shown in FIG. 7, the speech synthesis unit 26 according to the third embodiment includes a variable instruction unit 38 in addition to the segment selection unit 32, the segment processing unit 34, and the synthesis processing unit 36. The variable instruction unit 38 sequentially specifies the numerical value (hereinafter referred to as “instruction value”) xB of the control variable X to the element selection unit 32. Specifically, the variable instruction unit 38 sequentially acquires the numerical value of the control variable X designated by the variable information DP in time series as the instruction value xB from the storage device 12 and instructs the element selection unit 32.

素片選択部３２は、音声ライブラリＬ1および音声ライブラリＬ2の各々において音楽情報ＤSの発音文字に対応する音声素片の素片データＶ（素片データＶ1および素片データＶ2）の何れかを、各々に対して付属情報Ａに規定された設定値ｘA（ｘA1，ｘA2）と変数指示部３８からの指示値ｘBとに応じて順次に選択する。具体的には、素片選択部３２は、発音文字に応じた素片データＶ1および素片データＶ2のうち、付属情報Ａにて各々に規定された設定値ｘA（ｘA1，ｘA2）が変数指示部３８からの指示値ｘBに近い素片データＶを素片データＶAとして選択する。 The segment selection unit 32 selects one of the speech segment data V (segment data V1 and segment data V2) corresponding to the pronunciation character of the music information DS in each of the speech library L1 and the speech library L2. For each of them, the setting value xA (xA1, xA2) defined in the attached information A and the instruction value xB from the variable instruction section 38 are selected sequentially. Specifically, the segment selection unit 32 designates the set values xA (xA1, xA2) defined in the attached information A among the segment data V1 and the segment data V2 corresponding to the phonetic characters as the variable instructions. The segment data V close to the instruction value xB from the unit 38 is selected as the segment data VA.

例えば、付属情報Ａが図７に例示した内容に設定された状況で音声素片［ａ_ｓ］の合成が指示された場合を想定する。変数指示部３８からの指示値ｘBが例えば0.3である場合、素片選択部３２は、音声素片［ａ_ｓ］について付属情報Ａに規定された設定値ｘA1（0.2）および設定値ｘA2（0.6）のうち指示値ｘB（0.3）に近い設定値ｘA1に対応する素片データＶ1を音声ライブラリＬ1から素片データＶAとして選択する。他方、変数指示部３８からの指示値ｘBが例えば0.5である場合、素片選択部３２は、音声素片［ａ_ｓ］に関する設定値ｘA1（0.2）および設定値ｘA2（0.6）のうち、指示値ｘB（0.5）に近い設定値ｘA1に対応する素片データＶ2を音声ライブラリＬ2から素片データＶAとして選択する。したがって、素片選択部３２による素片データＶAの選択の対象となる音声ライブラリＬは、変数指示部３８からの指示値ｘB（変数情報ＤPが規定する制御変数Ｘの時系列）に応じて音声ライブラリＬ1および音声ライブラリＬ2の一方から他方に順次に変更される。 For example, a case is assumed in which the synthesis of the speech unit [a_s] is instructed in the situation where the attached information A is set to the content illustrated in FIG. When the instruction value xB from the variable instruction unit 38 is 0.3, for example, the segment selection unit 32 sets the set value xA1 (0.2) and the set value xA2 (0.6) defined in the attached information A for the speech unit [a_s]. Of these, the segment data V1 corresponding to the set value xA1 close to the indicated value xB (0.3) is selected from the speech library L1 as the segment data VA. On the other hand, when the instruction value xB from the variable instruction unit 38 is 0.5, for example, the element selection unit 32 indicates the instruction value among the setting value xA1 (0.2) and the setting value xA2 (0.6) related to the speech element [a_s]. The segment data V2 corresponding to the set value xA1 close to xB (0.5) is selected from the speech library L2 as the segment data VA. Therefore, the audio library L that is the target of the selection of the segment data VA by the segment selection unit 32 is a voice according to the instruction value xB from the variable instruction unit 38 (time series of the control variable X defined by the variable information DP). The library L1 and the audio library L2 are sequentially changed from one to the other.

素片加工部３４は、素片選択部３２が選択した素片データＶAを音楽情報ＤSが指示する音高および発音期間に調整する。なお、素片加工部３４による素片データＶの加工には第１実施形態と同様の方法が採用され得る。また、合成処理部３６は、素片加工部３４による処理後の素片データＤSの連結で音声信号ＳOUTを生成する。 The segment processing unit 34 adjusts the segment data VA selected by the segment selection unit 32 to the pitch and sound generation period indicated by the music information DS. In addition, the method similar to 1st Embodiment can be employ | adopted for the process of the segment data V by the segment processing part 34. FIG. Further, the synthesis processing unit 36 generates an audio signal SOUT by concatenating the segment data DS processed by the segment processing unit 34.

以上に説明したように、第３実施形態においては、音声ライブラリが規定する設定値ｘAと変数指示部３８からの指示値ｘBとの大小に応じて複数の音声ライブラリＬ（Ｌ1，Ｌ2）から択一的に素片データＶAが選択されて合成音の生成に利用される。したがって、音声ライブラリＬを新規に作成することなく、音声ライブラリＬ1および音声ライブラリＬ2の双方の特性を反映した音声ライブラリ（すなわち、音声ライブラリＬ1および音声ライブラリＬ2から音声素片毎に択一的に抽出した素片データＶで構成される仮想的な音声ライブラリ）を利用した場合と同等の多様な音声信号ＳOUTを生成することが可能である。すなわち、第１実施形態と同様に、音声ライブラリＬの作成の労力や記憶装置１２に必要な容量を削減しながら、新規な音声ライブラリＬを用意した場合と同様に多様な音声を合成できるという利点がある。 As described above, in the third embodiment, a plurality of audio libraries L (L1, L2) are selected according to the setting value xA defined by the audio library and the instruction value xB from the variable instruction unit 38. The segment data VA is selected and used to generate a synthesized sound. Therefore, without creating a new audio library L, an audio library reflecting the characteristics of both the audio library L1 and the audio library L2 (that is, alternatively extracted for each audio unit from the audio library L1 and the audio library L2). It is possible to generate various audio signals SOUT equivalent to the case where a virtual audio library composed of segment data V is used. That is, as in the first embodiment, it is possible to synthesize various voices as in the case of preparing a new voice library L while reducing the effort for creating the voice library L and the capacity required for the storage device 12. There is.

＜Ｄ：変形例＞
以上の各形態は多様に変形され得る。具体的な変形の態様を以下に例示する。以下の例示から任意に選択された２以上の態様は適宜に併合され得る。 <D: Modification>
Each of the above forms can be variously modified. Specific modifications are exemplified below. Two or more aspects arbitrarily selected from the following examples can be appropriately combined.

（１）変形例１
以上の各形態においては、音声ライブラリＬが音声素片毎に１個の素片データＶを含む構成を便宜的に例示したが、素片データＶを更に細分化した構成も採用され得る。例えば、音高（周波数）や音量などの音響的な属性（以下「素片属性」という）が相違する複数の素片データＶを音声素片毎に音声ライブラリＬに含ませる構成も採用され得る。素片選択部３２は、指定音に指示された発音文字の音声素片に対応する複数の素片データＶのうち、その指定音に対して指示された素片属性の素片データＶ（例えば音楽情報ＤSで指示される音高の素片データＶ）を選択する。 (1) Modification 1
In each of the above embodiments, the configuration in which the speech library L includes one unit data V for each speech unit is illustrated for convenience, but a configuration in which the unit data V is further subdivided may be employed. For example, a configuration may be employed in which a plurality of segment data V having different acoustic attributes (hereinafter referred to as “segment attributes”) such as pitch (frequency) and volume are included in the speech library L for each speech segment. . The segment selection unit 32 selects the segment data V (for example, the segment attribute specified for the designated sound from among the plurality of segment data V corresponding to the speech segment of the phonetic character specified by the specified sound. The pitch unit data V) indicated by the music information DS is selected.

以上の構成によれば、音声素片毎に１個の素片データＶを用意した構成と比較して多様な合成音を生成できるという利点がある。他方、素片データＶの個数が多いほど音声ライブラリＬのデータ量は増大するから、記憶装置１２に必要な容量を削減しながら合成音を多様化できるという各形態の効果は、音声素片に加えて素片属性（音高や音量）に応じて素片データＶを用意した変形例１のもとでは格別に顕著である。 According to the above configuration, there is an advantage that various synthesized sounds can be generated as compared with a configuration in which one unit data V is prepared for each speech unit. On the other hand, as the number of segment data V increases, the amount of data in the speech library L increases. Therefore, the effect of each embodiment that the synthesized speech can be diversified while reducing the capacity required for the storage device 12 is effective for speech segments. In addition, it is particularly prominent under the modified example 1 in which the segment data V is prepared according to the segment attributes (pitch and volume).

（２）変形例２
以上の各形態では、音声合成部２６における素片データＶの利用（素片選択部３２による選択または素片加工部３４による加工）を規定するための情報（以下「素片利用情報」という）を素片データＶ毎（音声素片毎）に付属情報Ａに設定したが、以下に例示するように、複数の素片データＶを単位として素片利用情報を設定する構成も採用され得る。素片利用情報は、第１実施形態における区間情報Ｐおよび特性情報Ｆと、第２実施形態における混合比Ｒと、第３実施形態における制御変数Ｘの設定値ｘAとを包括する概念である。 (2) Modification 2
In each of the above forms, information for defining the use of the segment data V in the speech synthesizer 26 (selection by the segment selection unit 32 or processing by the segment processing unit 34) (hereinafter referred to as “segment usage information”). Is set in the attached information A for each segment data V (for each speech segment). However, as illustrated below, a configuration in which the segment usage information is set in units of a plurality of segment data V may be employed. The element usage information is a concept that includes the section information P and the characteristic information F in the first embodiment, the mixing ratio R in the second embodiment, and the set value xA of the control variable X in the third embodiment.

例えば、音声素片の分類毎に付属情報Ａに素片利用情報を設定する構成が採用される。音声素片の分類としては、音素の構造による分類（音素単体／音素連鎖）や、母音または子音の有無による分類が想定される。例えば、音素単体で構成される複数の音声素片の素片データＶの集合と、音素連鎖で構成される複数の音声素片の素片データＶの集合との各々について、付属情報Ａに素片利用情報が設定される。分類が共通する複数の音声素片の素片データＶについては同じ素片利用情報が適用される。 For example, a configuration is used in which segment usage information is set in the attached information A for each speech segment classification. As the classification of speech elements, classification based on phoneme structure (phoneme unit / phoneme chain), or classification based on the presence or absence of vowels or consonants is assumed. For example, for each of a set of segment data V of a plurality of speech units composed of a single phoneme and a set of segment data V of a plurality of speech units composed of a phoneme chain, element A includes Single usage information is set. The same unit usage information is applied to the unit data V of a plurality of speech units having a common classification.

また、音声ライブラリＬ内の全部の素片データＶに対して共通の素片利用情報を設定した構成も採用され得る。例えば、第１実施形態では、１個の区間情報Ｐや１個の特性情報Ｆが総ての素片データＶに対して共通に適用される。第２実施形態では、音声ライブラリＬ1内の素片データＶ1と音声ライブラリＬ2内の素片データＶ2とが、音声素片に関わらず共通の混合比Ｒで混合される。また、第３実施形態では、音声ライブラリＬ1の素片データＶ1および音声ライブラリＬ2の素片データＶ2の何れかが、音声素片に関わらず共通の設定値ｘA（ｘA1，ｘA2）に応じて選択される。 Further, a configuration in which common segment usage information is set for all the segment data V in the audio library L may be employed. For example, in the first embodiment, one section information P and one characteristic information F are commonly applied to all the piece data V. In the second embodiment, the segment data V1 in the speech library L1 and the segment data V2 in the speech library L2 are mixed at a common mixing ratio R regardless of the speech segment. In the third embodiment, either the segment data V1 of the speech library L1 or the segment data V2 of the speech library L2 is selected according to the common set value xA (xA1, xA2) regardless of the speech segment. Is done.

変形例１のように音高や音量などの素片属性毎に素片データＶを用意する場合には、素片属性に関わらず音声素片毎に素片利用情報を設定する構成や、素片属性毎に素片利用情報を設定する構成が採用される。前者の構成では、音声素片が共通する複数の素片データＶに対して、素片属性に関わらず共通の素片利用情報が適用される。後者の構成では、素片属性が相違する各素片データＶに対しては、音声素片が共通する場合でも別個の素片利用情報が適用される。 When preparing the segment data V for each segment attribute such as pitch and volume as in the first modification, a configuration in which segment usage information is set for each speech segment regardless of the segment attribute, A configuration in which element usage information is set for each attribute is adopted. In the former configuration, common unit usage information is applied to a plurality of unit data V having a common speech unit regardless of the unit attribute. In the latter configuration, separate segment usage information is applied to each segment data V having different segment attributes even if the speech segment is common.

（３）変形例３
第１実施形態から第３実施形態は、以下の例示のように適宜に併合され得る。なお、以下の説明では、音声ライブラリＬ内の各素片データＶに付属情報Ａを適用した素片データＶAで構成される仮想的な音声ライブラリＬを便宜的に「仮想ライブラリ」と表記する。各形態の説明から理解されるように、実際には総ての音声素片に対応する素片データＶAの集合（音声ライブラリ）が記憶装置１２に生成される訳ではない（つまり、音声ライブラリＬの各素片データＶに対する付属情報Ａの適用で素片データＶAが順次に生成される）ことを考慮して、「仮想」という文言を付記した。 (3) Modification 3
The first to third embodiments can be appropriately merged as illustrated below. In the following description, a virtual audio library L composed of segment data VA obtained by applying attached information A to each segment data V in the audio library L will be referred to as a “virtual library” for convenience. As can be understood from the description of each embodiment, a set of segment data VA (speech library) corresponding to all speech units is not actually generated in the storage device 12 (that is, the speech library L). In consideration of the fact that the segment data VA is sequentially generated by applying the auxiliary information A to each segment data V of FIG.

例えば、第１実施形態では、既存の音声ライブラリＬの全部の素片データＶに対して付属情報Ａを適用したと仮定した場合に生成され得る複数の素片データＶAの集合が仮想ライブラリＬVに相当する。第２実施形態では、音声ライブラリＬ1内の素片データＶ1と音声ライブラリＬ2内の素片データＶ2とを混合比Ｒで混合する処理を、音声素片が共通する素片データＶ1と素片データＶ2との全対について実行した、と仮定した場合に得られる複数の素片データＶAの集合が、仮想ライブラリＬVに相当する。同様に、第３実施形態では、音声ライブラリＬ1内の素片データＶ1と音声ライブラリＬ2内の素片データＶ2との何れかを設定値ｘA（ｘA1，ｘA2）に応じて選択する処理を、音声素片が共通する素片データＶ1と素片データＶ2との全対について実行した、と仮定した場合に得られる複数の素片データＶAの集合が、仮想ライブラリＬVに相当する。 For example, in the first embodiment, a set of a plurality of segment data VA that can be generated on the assumption that the attached information A is applied to all the segment data V of the existing audio library L is stored in the virtual library LV. Equivalent to. In the second embodiment, the process of mixing the segment data V1 in the speech library L1 and the segment data V2 in the speech library L2 with the mixing ratio R is performed by using the segment data V1 and the segment data in which the speech segments are common. A set of a plurality of segment data VA obtained when it is assumed that all pairs with V2 are executed corresponds to the virtual library LV. Similarly, in the third embodiment, the process of selecting either the segment data V1 in the speech library L1 or the segment data V2 in the speech library L2 according to the set value xA (xA1, xA2) is performed. A set of a plurality of segment data VA obtained on the assumption that the processing is executed for all pairs of the segment data V1 and the segment data V2 having the same segment corresponds to the virtual library LV.

まず、図９に示すように、仮想ライブラリＬV1と仮想ライブラリＬV2とに対して第２実施形態や第３実施形態の付属情報Ａ3を適用することで、仮想ライブラリＬV3が構成される。仮想ライブラリＬV1は、例えば、区間情報Ｐや特性情報Ｆを含む第１実施形態の付属情報Ａ1を既存の音声ライブラリＬ1に適用することで構成される。同様に、仮想ライブラリＬV2は、第１実施形態の付属情報Ａ2を既存の音声ライブラリＬ2に適用することで構成される。また、図１０に示すように、仮想ライブラリＬV1と既存の音声ライブラリＬ2とに対して第２実施形態や第３実施形態の付属情報Ａ4を適用することで、仮想ライブラリＬV4が構成される。以上のように、第１実施形態から第３実施形態を適宜に併合することで、様々な特性の音声に対応した多様な仮想ライブラリＬVを構築することが可能である。 First, as shown in FIG. 9, the virtual library LV3 is configured by applying the auxiliary information A3 of the second embodiment or the third embodiment to the virtual library LV1 and the virtual library LV2. The virtual library LV1 is configured, for example, by applying the auxiliary information A1 of the first embodiment including the section information P and the characteristic information F to the existing audio library L1. Similarly, the virtual library LV2 is configured by applying the auxiliary information A2 of the first embodiment to the existing audio library L2. Further, as shown in FIG. 10, the virtual library LV4 is configured by applying the auxiliary information A4 of the second embodiment or the third embodiment to the virtual library LV1 and the existing audio library L2. As described above, various virtual libraries LV corresponding to voices having various characteristics can be constructed by appropriately merging the first to third embodiments.

また、図１１に示すように、既存の音声ライブラリＬに対して複数の付属情報Ａ（Ａ1，Ａ2）を用意した構成も採用され得る。音声ライブラリＬに対する付属情報Ａ1の適用で仮想ライブラリＬV1が構築され、音声ライブラリＬに対する付属情報Ａ2の適用で仮想ライブラリＬV2が構築される。すなわち、付属情報Ａの個数に相当する仮想ライブラリＬVが１個の音声ライブラリＬから生成される。 Further, as shown in FIG. 11, a configuration in which a plurality of attached information A (A1, A2) is prepared for an existing audio library L may be employed. The virtual library LV1 is constructed by applying the attached information A1 to the audio library L, and the virtual library LV2 is constructed by applying the attached information A2 to the audio library L. That is, a virtual library LV corresponding to the number of attached information A is generated from one audio library L.

（４）変形例４
第１実施形態では、付属情報Ａの区間情報Ｐが素片データＶの素片波形Ｗの使用区間を指定する場合を例示したが、例えば、音楽情報ＤSにて継続的な発音が指示された母音の補間（クロスフェード）に使用される素片データＶの区間を区間情報Ｐが指定する構成も採用される。例えば、「あさが（朝が）」という発音文字の音声を音声素片［＃_ａ］，［ａ］，［ａ_ｓ］，［ｓ_ａ］，［ａ］，［ａ_ｇ］，［ｇ_ａ］，［ａ_＃］から生成する場合を想定すると、音声素片［ｓ_ａ］のうち区間情報Ｐが示す後方の区間と、音声素片［ａ_ｇ］のうち区間情報Ｐが示す前方の区間との補間で両者間の［ａ］の音声を合成する。 (4) Modification 4
In the first embodiment, the case where the section information P of the accessory information A designates the use section of the segment waveform W of the segment data V is exemplified. However, for example, continuous pronunciation is instructed by the music information DS. A configuration in which the section information P designates a section of the segment data V used for vowel interpolation (crossfade) is also employed. For example, the voice of the phonetic character “Asa (morning)” is converted into speech units [#_a], [a], [a_s], [s_a], [a], [a_g], [g_a], [a_ Assuming the case of generating from [#], interpolation between the backward segment indicated by the segment information P in the speech unit [s_a] and the forward segment indicated by the segment information P in the speech unit [a_g] [A] is synthesized.

（５）変形例５
第３実施形態では、変数情報ＤPが記憶装置１２に格納された場合を例示したが、変数指示部３８が制御変数Ｘの指示値ｘBを指示する方法は適宜に変更される。例えば、入力装置１４に対する入力に応じて変数指示部３８が指示値ｘBを時系列に指示する構成や、通信網から順次に受信される指示値ｘBを変数指示部３８が順次に素片選択部３２に指示する構成も採用される。すなわち、変数指示部３８は、制御変数Ｘの指示値ｘBを順次に指示する要素として包括され、変数情報ＤPを予め用意して記憶装置１２に格納した構成は省略され得る。 (5) Modification 5
In the third embodiment, the case where the variable information DP is stored in the storage device 12 is exemplified, but the method in which the variable instruction unit 38 indicates the instruction value xB of the control variable X is appropriately changed. For example, a configuration in which the variable instruction unit 38 instructs the instruction value xB in time series according to an input to the input device 14, or the variable instruction unit 38 sequentially receives the instruction value xB sequentially received from the communication network. A configuration instructed to 32 is also employed. That is, the variable instruction unit 38 is included as an element that sequentially indicates the instruction value xB of the control variable X, and the configuration in which the variable information DP is prepared in advance and stored in the storage device 12 may be omitted.

（６）変形例６
以上の各形態では、素片データＶが素片波形Ｗを示す場合を例示したが、素片データＶの内容は適宜に変更される。例えば、音声素片を公知の手法で解析した結果を示す情報（例えば音声素片の周波数スペクトルに関する情報）を素片データＶとして利用してもよい。 (6) Modification 6
In each of the above embodiments, the case where the segment data V indicates the segment waveform W is exemplified, but the content of the segment data V is appropriately changed. For example, information indicating the result of analyzing a speech unit by a known method (for example, information on the frequency spectrum of the speech unit) may be used as the unit data V.

（７）変形例７
以上の各形態では、利用者からの指示に応じて音楽情報ＤSを編集したが、音楽情報ＤSの編集は省略され得る。すなわち、記憶装置１２に予め記憶された音楽情報ＤSや、可搬型の記録媒体または通信網を介して音声合成装置１００に提供された音楽情報ＤSを合成音の生成に利用する構成も採用される。したがって、以上の各形態における情報生成部２４は省略され得る。 (7) Modification 7
In each of the above embodiments, the music information DS is edited in response to an instruction from the user, but the editing of the music information DS can be omitted. That is, a configuration in which music information DS stored in advance in the storage device 12 or music information DS provided to the speech synthesizer 100 via a portable recording medium or communication network is used to generate synthesized sound is also employed. . Therefore, the information generation unit 24 in each of the above forms can be omitted.

１００……音声合成装置、１０……制御装置、１２……記憶装置、１４……入力装置、１６……表示装置、１８……放音装置、２２……表示制御部、２４……情報生成部、２６……音声合成部、３２……素片選択部、３４……素片加工部、３６……合成処理部、３８……変数指示部。
100 ... speech synthesizer, 10 ... control device, 12 ... storage device, 14 ... input device, 16 ... display device, 18 ... sound emitting device, 22 ... display control unit, 24 ... information generation , 26... Speech synthesis unit, 32... Segment selection unit, 34... Segment processing unit, 36.

Claims

And audio library containing a plurality of segment data indicating a speech segment, segment usage information defining the use of fragment data, a plurality of the units of one or more segment data as a unit in the audio library Storage means for storing attached information set for each of the objects ;
Segment selection means for sequentially selecting the segment data of the audio library according to music information indicating a time sequence of a designated sound;
Each piece data selected by the piece selection means is processed according to the piece usage information set in the piece data in the attached information;
A speech synthesizer comprising: synthesis processing means for synthesizing speech from the piece data processed by the piece processing means.

The time series of the segment waveform of each segment data corresponding to the specified sound specified by the music information, the start point indicator indicating the start point of the used segment used for speech synthesis in each segment waveform, and the used segment An end point indicator indicating the end point of the display, and a display control means for moving each of the start point indicator and the end point indicator in accordance with an instruction from a user,
The segment usage information set for each segment data in the attached information includes segment information indicating a usage segment defined by the start point indicator and the end point indicator in the segment waveform of the segment data. ,
The speech synthesis apparatus according to claim 1, wherein the segment processing unit extracts a segment indicated by the segment information from segment data selected by the segment selection unit.

The speech synthesis apparatus according to claim 2, wherein the display control unit displays an image indicating a time series of a designated sound indicated by the music information on the display device in parallel with the time series of the segment waveforms.

The display control means displays a characteristic transition image showing a transition of the feature amount of each piece data corresponding to the designated sound specified by the music information, for each piece waveform, on a time axis common to the piece waveform. And display on the display device, edit the characteristic transition image according to instructions from the user,
The segment usage information set for each segment data in the attached information includes characteristic information indicating a feature amount according to a characteristic transition image of the segment data,
The speech synthesis apparatus according to claim 2 or 3, wherein the segment processing unit controls a feature amount of the segment data selected by the segment selection unit according to the characteristic information.

The storage means stores the accessory information in which the unit usage information is set for each speech unit classification of each unit data in the speech library,
2. The segment processing means commonly applies the segment usage information set in the one classification in the attached information to processing the segment data of each speech segment belonging to one classification. The speech synthesizer according to claim 4.

And audio library containing a plurality of segment data indicating a speech segment, segment usage information defining the use of fragment data, a plurality of the units of one or more segment data as a unit in the audio library A computer comprising storage means for storing auxiliary information set for each object ,
According to the music information indicating the time series of the specified sound, the unit data of the voice library is sequentially selected,
Each selected piece data is processed according to the piece use information set in the piece data in the attached information,
A speech synthesis method for synthesizing speech from the processed segment data.