JP2017520016A5

JP2017520016A5 - Excitation signal formation method of glottal pulse model based on parametric speech synthesis system

Info

Publication number: JP2017520016A5
Application number: JP2016567717A
Authority: JP
Filing date: 2014-05-28
Publication date: 2018-08-16
Anticipated expiration: 2034-05-28

Description

本発明は、音声合成のみならず、概して電気通信システム及び方法に関する。より詳細には、本発明は、統計的パラメトリック音声合成システムに基づく隠れマルコフモデルにおける励起信号の形成に関する。The present invention relates generally to telecommunications systems and methods as well as speech synthesis. More particularly, the present invention relates to the formation of the excitation signal in the hidden Markov model based on a statistical parametric speech synthesis system.

パラメトリック音声合成システムに基づく声門パルスモデルの励起信号を形成する方法が提供されている。一実施形態において、励起信号を形成する為に基本周波数値が使用される。励起は、所与の話者のデータベースから選択された音源パルスを使用してモデル化される。音源信号は、励起信号の形成に使用する声門パルスを識別する為に、ベクトル表現において使用される声門セグメントにセグメント化される。新規の距離メトリックの使用及び話者の音声サンプルから抽出した原信号を保存することは、励起信号の低周波数情報の取込みに役立つ。加えて、話者の音声品質を正確に表現形成すると同時に音声合成の品質を向上させる為に、独自のセグメント結合方法を適用することによりセグメント端のアーチファクトが除去される。A method of forming a excitation signal of the glottal pulse model based on parametric speech synthesis system is provided. In one embodiment, the fundamental frequency values are used to form the excitation signal. Excitation is modeled by using the sound source pulse which is selected from the database of a given speaker. Source signals, to identify glottal pulses used to form the excitation signal is segmented into glottal segments used in vector representation. To preserve the original signal extracted from the use and the speaker of the voice sample of the new distance metric will help the uptake of low-frequency information of the excitation signal. In addition, segment edge artifacts are eliminated by applying a unique segment combining method to accurately represent the speech quality of the speaker and at the same time improve the quality of speech synthesis.

更に別の実施形態において、ａ）入力テキストをコンテキスト依存音素ラベルに変換するステップと、ｂ）基本周波数値、合成された音声持続時間及び音素ラベルのスペクトル特性を予測する為に学習したパラメトリックモデルを使用して、ステップ（ａ）で作成された音素ラベルを処理するステップと、ｃ）固有声門パルス及び前記予測した基本周波数値、音素ラベルのスペクトル特性及び合成された音声持続時間のうちの１つ又は１つ以上を使用して、励起信号を作成するステップと、ｄ）合成音声の出力を作成する為にフィルタを使用して、励起信号を音素ラベルのスペクトル特性と組合せるステップとを含む、入力テキストを使用して音声を合成する方法が示される。In yet another embodiment, a) transforming the input text into context-dependent phoneme labels; b) a parametric model learned to predict fundamental frequency values, synthesized speech duration, and spectral characteristics of phoneme labels. Using the processing of the phoneme label created in step (a); c) one of the eigenglottal pulses and the predicted fundamental frequency value, the spectral characteristics of the phoneme label and the synthesized speech duration. or using one or more, and creating a excitation signal, using a filter to create the output of d) synthesizing speech, the step of combining the excitation signal and the spectral characteristics of the phoneme label A method for synthesizing speech using input text is presented.

図１は、音声システムに対するテキストに基づく隠れマルコフモデルのある実施形態を示す図である。FIG. 1 is a diagram illustrating an embodiment of a text-based hidden Markov model for a speech system. 図２は、ある信号のある実施形態示す図である。FIG. 2 is a diagram illustrating an embodiment of a signal. 図３は、励起信号作成のある実施形態示す図である。Figure 3 is a diagram showing embodiments of the excitation signal created. 図４は、励起信号作成のある実施形態示す図である。4 is a diagram showing embodiments of the excitation signal created. 図５は、重畳境界のある実施形態示す図である。FIG. 5 is a diagram illustrating an embodiment with overlapping boundaries. 図６は、励起信号作成のある実施形態示す図である。6 is a diagram showing embodiments of the excitation signal created. 図７は、声門パルス識別のある実施形態示す図である。FIG. 7 is a diagram illustrating an embodiment of glottal pulse identification. 図８は、声門パルスのデータベース作成のある実施形態示す図である。FIG. 8 is a diagram illustrating an embodiment of creating a glottal pulse database.

励起は、一般に有声音領域のインパルスの準周期列であると推定されている。各列は、Ｔ_０＝１／Ｆ_０など一定時間で前列から分離され、式中Ｔ_０はピッチ周期を表し、Ｆ_０は基本周波数を表す。無声音領域において、励起は白色雑音としてモデル化される。有声音領域において、励起は実際にはインパルス列ではない。励起はむしろ、声の折り重なりによる振動により発生する音源パルスの列である。パルスの形状は、話者、話者の気分、言語的コンテキスト、感情などの各種要因により変動してもよい。 Excitation is generally estimated to be quasi-periodic sequence of impulses voiced region. Each column is separated from the previous column at a fixed time, such as T ₀ = 1 / F ₀ , where T ₀ represents the pitch period and F ₀ represents the fundamental frequency. In unvoiced regions, excitation is modeled as white noise. In voiced region, excitation is not an impulse train in practice. Rather excitation is a sequence of sound pulses generated by the vibration caused by the folding of the voice. The shape of the pulse may vary depending on various factors such as the speaker, the speaker's mood, linguistic context, and emotion.

欧州特許ＥＰ２２４２０４５（２０１２年６月２７日取得、発明者ＴｈｏｍａｓＤｒｕｇｍａｎら）に記載されているように、ソースパルスは、（サンプリングを通じて）長さの正規化及びインパルスの整合によりベクトルとして数学的に処理されている。正規化されたソースパルス信号の最終的な長さは、標的ピッチに適合するように再サンプル化される。ソースパルスは、データベースから選択されないが、周波数領域においてパルス特性を処理する一連の計算を通じて得られる。加えて、線形予測（ＬＰ）係数を決定する一方で終了したプレフィルタリングは存在しないとして、パルスデータベース作成に使用される近似励起信号は低周波数源の内容を取込まず、線形予測係数は逆フィルタリングに使用される。As described in the European patent EP22442045 (acquired June 27, 2012, inventor Thomas Drugman et al.), Source pulses are mathematically processed as vectors by length normalization and impulse matching (through sampling). Has been. The final length of the normalized source pulse signal is resampled to fit the target pitch. The source pulse is not selected from the database, but is obtained through a series of calculations that process the pulse characteristics in the frequency domain. In addition, as not present pre-filtering ended while determining the linear prediction (LP) coefficients, the approximate excitation signal used to pulse database creation not taken the content of low-frequency source, the linear prediction coefficient inverse Used for filtering.

統計的パラメトリック音声合成において、音声単位信号は、音声を合成する為に使用可能なパラメータのセットにより表される。パラメータは、例えばＨＭＭなどの統計的モデルにより学習されてもよい。ある実施形態において、ソース／励起は、所与の音を生成する適切なフィルタを通過する際の信号であり、音声は、ソースフィルタモデルとして表されてもよい。図１は、音声（ＴＴＳ）システムへのテキストに基づく隠れマルコフモデル（ＨＭＭ）のある実施形態を示す図である。例示的システムのある実施形態は、例えば学習フェーズ及び合成フェーズの２つのフェーズを含んでいてもよい。In statistical parametric speech synthesis, speech unit signals are represented by a set of parameters that can be used to synthesize speech. The parameter may be learned by a statistical model such as an HMM. In certain embodiments, the source / excitation is a signal passing through the appropriate filter to generate a given sound, speech may be represented as a source filter model. FIG. 1 is a diagram illustrating an embodiment of a text-based hidden Markov model (HMM) to a speech (TTS) system. An embodiment of an exemplary system may include two phases, for example, a learning phase and a synthesis phase.

音声データベース１０５は、音声合成で使用する音声データ量を含むことができる。学習フェーズ中、音声信号１０６は、パラメータに変換される。パラメータは、励起パラメータ及びスペクトルパラメータを含んでいてもよい。励起パラメータ抽出１１０及びスペクトルパラメータ抽出１１５は、音声データベース１０５から伝えられる音声信号１０６から発生する。隠れマルコフモデル１２０は、これらの抽出されたパラメータ及び音声データベース１０５からラベル１０７を使用して学習されてもよい。任意のＨＭＭモデル数は、学習から生じてもよく、これらのコンテキスト依存ＨＭＭは、データベース１２５内に保存される。The speech database 105 can include the amount of speech data used for speech synthesis. During the learning phase, the audio signal 106 is converted into parameters. Parameters may include excitation parameters and spectral parameters. Excitation parameter extraction 110 and the spectral parameter extraction 115 is generated from the audio signal 106 to be transmitted from the speech database 105. Hidden Markov model 120 may be learned using labels 107 from these extracted parameters and speech database 105. Any number of HMM models may arise from learning, and these context-dependent HMMs are stored in the database 125.

合成フェーズは、コンテキスト依存ＨＭＭ１２５として始まり、パラメータ１４０を生成する為に使用される。パラメータ生成１４０は、音声が合成されるテキスト１３０のコーパスからの入力を利用してもよい。テキスト１３０は、分析１３５を経てもよく、抽出されたラベル１３６は、パラメータ１４０の生成において使用される。一実施形態において、励起パラメータ及びスペクトルパラメータは、１４０において生成されてもよい。The synthesis phase begins as a context sensitive HMM 125 and is used to generate the parameter 140. The parameter generation 140 may utilize input from a corpus of text 130 that is synthesized with speech. Text 130 may go through analysis 135 and the extracted label 136 is used in generating parameter 140. In one embodiment, excitation parameters and spectral parameters may be generated at 140.

励起パラメータは、励起信号１４５を生成する為に使用されてもよく、励起信号１４５は、スペクトルパラメータと共に合成フィルタ１５０に入力される。フィルタパラメータは、一般にメル周波数ケプストラム係数（ＭＦＣＣ）であり、ＨＭＭを使用して統計的時系列によりしばしばモデル化される。フィルタの予測値及び時系列値として基本周波数は、励起信号を基本周波数値から作成することによりフィルタを合成する為に使用されてもよく、ＭＦＣＣ値は、フィルタを形成する為に使用される。 Excitation parameters may be used to generate an excitation signal 145, excitation signal 145 is input to synthesis filter 150 with the spectral parameters. The filter parameters are generally Mel Frequency Cepstrum Coefficients (MFCC) and are often modeled by statistical time series using HMM. The fundamental frequency as the predicted value and the time-series value of the filter may be used to synthesize filter, MFCC values are used to form the filter by creating a excitation signal from the basic frequency value .

合成音声１５５は、励起信号がフィルタを通過する際に生成される。励起信号１４５の形成は、出力の品質又は合成音声１５５に不可欠である。励起の低周波数情報は取込まれない。従って、励起信号の低周波数源の内容を取込み、合成音声の品質を向上させる為の方法が必要であることが理解されよう。Synthesized speech 155, excitation signal is generated when passing through the filter. Formation of excitation signal 145 is essential to the quality or synthesized speech 155 of the output. Low-frequency information of the excitation is not taken. Accordingly, it will be appreciated that there is a need for a method for improving uptake, the quality of the synthesized speech the contents of the low-frequency source of the excitation signal.

図２は、音声セグメントの信号領域の一実施形態のグラフ図であり、全体として２００で示される。信号は、有声音セグメント、無声音セグメント及び休止セグメントといった種類の基本周波数値に基づくセグメントに分類される。縦軸２０５は、ヘルツ（Ｈｚ）による基本周波数を示すのに対し、横軸２１０は、ミリ秒（ｍｓ）の経過を表す。時系列であるＦ_０の２１５は、基本周波数を表す。有声音領域である２２０は、一連のピークが見られ、非ゼロセグメントと見なすことができる。以下に更なる詳細が記載されているように、非ゼロセグメント２２０は、全音声の励起信号を形成する為に連結されていてもよい。無声音領域２２５は、グラフ図２００においてピークを有することが見られずゼロセグメントと見なすことができる。ゼロセグメントは、休止又は音素ラベルにより所与される無声音セグメントを表すことができる。FIG. 2 is a graphical illustration of one embodiment of the signal region of an audio segment, indicated generally at 200. The signals are classified into segments based on the types of fundamental frequency values such as voiced segments, unvoiced segments and pause segments. The vertical axis 205 represents the fundamental frequency in hertz (Hz), while the horizontal axis 210 represents the passage of milliseconds (ms). Time series F ₀ 215 represents the fundamental frequency. The voiced sound region 220 has a series of peaks and can be regarded as a non-zero segment. As further detail described below, the non-zero segments 220 may be connected to form a excitation signal of the entire sound. The unvoiced sound region 225 does not have a peak in the graph 200 and can be regarded as a zero segment. A zero segment can represent an unvoiced segment given by a pause or phoneme label.

図３は、励起信号作成のある実施形態を示す図であり、全体として３００で示される。図３は、無声音セグメント及び休止セグメント双方の励起信号作成を示す。Ｆ_０として表される基本周波数時系列値は、Ｆ_０値に基づき有声音セグメント、無声音セグメント及び休止セグメントに分類される信号領域３０５を表す。Figure 3 is a diagram showing an embodiment of the excitation signal created, represented by 300 as a whole. Figure 3 shows the excitation signal created both unvoiced segments and pause segments. The fundamental frequency time series value represented as F ₀ represents a signal region 305 that is classified into a voiced segment, an unvoiced segment, and a pause segment based on the F ₀ value.

励起信号３２０は、無声音セグメント及び休止セグメントの為に作成される。休止が発生した場合、励起信号にゼロ（０）が配置される。無声音領域において、適切なエネルギーの白色雑音（一実施形態において、これは聞き取り試験により実験的に決定されることができる）は励起信号として使用される。 Excitation signal 320 is generated for unvoiced segments and pause segments. If the pause occurs, the zero (0) is placed excitation signal. In unvoiced region, (in one embodiment, this can be determined empirically by listening test) white noise suitable energy is used as the excitation signal.

信号領域３０５は、声門パルス３１０と共に励起生成３１５に使用され、続いて励起信号３２０の生成に使用される。声門パルス３１０は、声門パルスデータベースから識別された固有声門パルスを含み、以下の図８には、その作成の更なる詳細が記載されている。Signal area 305 is used to excitation generator 315 with glottal pulse 310 is subsequently used to generate the excitation signal 320. The glottal pulse 310 includes eigenglottic pulses identified from the glottal pulse database, and further details of its creation are described in FIG. 8 below.

図４は、有声音セグメントの励起信号作成のある実施形態を示す図であり、全体として４００で示される。固有声門パルスは、（以下の図７に更なる詳細が記載されている）声門パルスデータベースから識別されたと推定される。信号領域４０５は、有声音セグメントからモデルにより予測されることができるＦ_０値を含む。Ｎ_ｆで表されてもよいＦ_０セグメントの長さは、数学的方程式を使用して励起信号の長さを決定する為に使用される。Figure 4 is a diagram showing an embodiment of the excitation signal generated in the voiced segments, it indicated generally 400. Eigenglottic pulses are presumed to have been identified from the glottal pulse database (which is described in further detail in FIG. 7 below). The signal region 405 includes F ₀ values that can be predicted by the model from voiced sound segments. The length of the good F ₀ segment be represented by N _f is used to determine the length of the excitation signal using a mathematical equation.

Ｆ_０値から４１０の声門境界が作成され、４１０は信号領域４０５において有声音セグメントの励起信号のピッチ境界を示す。ピッチ周期配列は、以下の数学的方程式を使用して算出されることができる。Glottis boundary from F ₀ value 410 are created, 410 denotes a pitch boundary excitation signals voiced segment in the signal region 405. The pitch period array can be calculated using the following mathematical equation.

声門パルス４１５は、各声門境界から始まる声門パルスの重畳加算４２０において識別された声門境界４１０と共に使用される。次に図５及び図６に更に記載されている境界効果を回避する為に、励起信号４２５は「スティッチング」又はセグメント結合の処理を通じて作成される。The glottal pulse 415 is used with the glottal boundary 410 identified in the superposition addition 420 of glottal pulses starting from each glottic boundary. Then in order to avoid boundary effects that are further described in FIGS. 5 and 6, excitation signal 425 is generated through the processing of the "stitching" or segments bonded.

図５は、重畳境界のある実施形態を示す図であり、全体として５００で示される。図５００は、セグメントにおいて一連の声門パルス５１５及び重畳する声門パルス５２０を表す。縦軸５０５は、励起の振幅を表す。横軸５１０は、フレーム番号を表してもよい。FIG. 5 is a diagram illustrating an embodiment with overlapping boundaries, indicated generally at 500. Diagram 500 represents a series of glottal pulses 515 and overlapping glottal pulses 520 in a segment. The vertical axis 505 represents the amplitude of the excitation. The horizontal axis 510 may represent frame numbers.

図６は、有声音セグメントの励起信号作成のある実施形態を示す図であり、全体として６００で示される。「スティッチング」は、理想的に境界効果のない（図４から）有声音セグメントの最終励起信号を形成する為に使用されてもよい。ある実施形態において、任意の異なる励起信号数は、図４及び図５００（図５）に示された重畳加算法を通じて形成されてもよい。異なる励起信号は、声門境界６０５において一定に増加するシフト量及び声門パルス信号に対して同量の循環左シフト６３０を有していてもよい。一実施形態において、声門パルス信号６１５が対応するピッチ周期未満の長さである場合、循環左シフトする６３０が実施される以前のピッチ周期の長さまで声門パルスはゼロ伸張６２５でもよい。ピッチ境界の異なる配列（Ｐ^ｍ（ｉ）、ｍ＝１，２，・・・Ｍ−１として表される）は、Ｐ^０と同じ長さのそれぞれからなる。配列は、以下の数学的方程式を使用して算出される。Figure 6 is a diagram showing an embodiment of the excitation signal generated in the voiced segment, represented by 600 as a whole. "Stitching" is no ideal boundary effects (FIGS. 4) may be used to form the final excitation signal voiced segments. In certain embodiments, any different excitation signal number may be formed through overlap-add method shown in FIGS. 4 and 500 (FIG. 5). Different excitation signal may have a left circular shift 630 of the same amount to the shift amount and the glottal pulse signal increases to a constant in the glottis boundary 605. In one embodiment, if the glottal pulse signal 615 is less than the corresponding pitch period, the glottal pulse may be zero stretched 625 up to the length of the pitch period before the cyclic shift left 630 is performed. Arrangements with different pitch boundaries (P ^m (i), represented as m = 1, 2,... M−1) are each of the same length as P ⁰ . The array is calculated using the following mathematical equation:

フレーム境界の各セットに対して、声門パルスをゼロ（０）に初期化することにより励起信号６３５が形成される。配列Ｐ^ｍ（ｉ）、ｉ＝１，２，・・・Ｋの各ピッチ境界値から始まり、重畳加算６１０は声門パルス６２０を励起の第１のＮサンプルに加算する為に使用される。形成された信号は、スティッチングされた単一励起としてシフトｍに対応している。 For each set of frame boundaries, it is excited by initializing glottal pulses to zero (0). StartA signal 635 is formed. Array P^m(I) Starting with each pitch boundary value of i = 1, 2,... K, the superposition addition 610 excites the glottal pulse 620.StartUsed to add to the first N samples. The formed signal is a stitched single excitation.StartCorresponds to the shift m.

ある実施形態において、全てのスティッチングされた単一励起信号の算術平均が算出され、算出された６４０は有声音セグメントの最終励起信号６４５を表す。In certain embodiments, the calculated arithmetic mean of all stitched, single excitation signal, 640 calculated represents the final excitation signal 645 of the voiced segment.

図８は、声門パルスデータベース作成のある実施形態を示す図であり、全体として８００で示される。音声信号８０５は、プレエンファシス８１０などプレフィルタリングを経る。線形予測（ＬＰ）分析８１５は、ＬＰ係数を得る為にプレフィルタリングされた信号を使用して実施される。従って、励起の低周波情報は取込まれることができる。係数が決定されると、集積された線形予測残差（ＩＬＰＲ）信号８２５を算出する為にプレフィルタされていない原音声信号８０５のフィルタを８２０で反転させる為に係数が使用される。ＩＬＰＲ信号８２５は、励起信号又は音源信号への近似として使用されることができる。ＩＬＰＲ信号８２５は、音声信号８０５から決定された声門セグメント／サイクル境界を使用して声門パルスにセグメント化８３５される。セグメント化８３５は、ゼロ周波数フィルタリング技術（ＺＦＦ）を使用して実施されてもよい。次に結果として生じる声門パルスはエネルギー正規化されることができる。全音声学習データの全ての音声パルスは、音声パルスデータベース８４０を形成する為に組合わされる
FIG. 8 is a diagram illustrating one embodiment of glottal pulse database creation, indicated generally at 800. The audio signal 805 undergoes pre-filtering such as pre-emphasis 810. Linear prediction (LP) analysis 815 is performed using the prefiltered signal to obtain LP coefficients. Therefore, it is possible to low-frequency information of the excitation is incorporated. Once the coefficients are determined, the coefficients are used to invert at 820 the filter of the unprefiltered original speech signal 805 to calculate an integrated linear prediction residual (ILPR) signal 825. ILPR signal 825 can be used as an approximation to the excitation signal or excitation signal. ILPR signal 825 is segmented 835 into glottal pulses using glottal segment / cycle boundaries determined from speech signal 805. Segmentation 835 may be performed using a zero frequency filtering technique (ZFF). The resulting glottal pulse can then be energy normalized. All speech pulses of all speech learning data are combined to form a speech pulse database 840.

Claims

A method of forming a parametric model, comprising:
a. Calculating a glottal pulse distance metric between multiple glottal pulses;
b. Clustering a plurality of glottal pulses stored in a glottal pulse database into a number of clusters to determine the glottal pulse centroid;
c. Forming a corresponding vector database by associating a vector with each glottal pulse in the glottal pulse database, wherein the glottal pulse centroid and the distance metric are mathematically defined to determine an association;
d. Determining eigenvectors of the vector database;
e. Forming a parametric model by associating glottal pulses with each determined eigenvector from the glottal pulse database.

The method of claim 1 , wherein the number of glottal pulses is two.

The step (a) of claim 1 comprises
a. Decomposing the number of glottal pulses into corresponding subband components;
b. Calculating a subband distance metric between the corresponding subband components of each glottal pulse;
c. Further comprising the method of claim 1 comprising the steps of mathematically calculating the glottal pulse distance metrics using the subband distance metric.

The step (c) of claim 3 is calculated by mathematical equations

Wherein d (x _i , y _i ) represents the distance metric and d _s ² (x _i ⁽ⁿ⁾ , y _i ⁽ⁿ⁾ ) represents the subband distance metric, Item 4. The method according to Item 3 .

The method of claim 1 , wherein the number of clusters is 256.

Clustering step of claim. 1 (b) is carried out using a modified k-means calculation using the glottal pulse distance metric method according to claim 1.

The method of claim 6 , wherein the modified k-means calculation further comprises updating a cluster centroid with an element of the cluster that has a minimum distance sum of squares from all other elements of the cluster.

8. The method of claim 7 , further comprising terminating the clustering iteration if there is no shift in any of the centroids from the cluster.

Determination of eigenvectors of the steps of claim 1 (d) is performed using a principal component analysis method of claim 1.

The step (e) of claim 1 comprises
a. Determining the eigenvector;
b. Determining a vector best matching the eigenvector from the vector database; c. Determining the most suitable glottal pulse from the glottal pulse database;
d. Further comprising the method of claim 1 and a step of designating the glottal pulse from best fits the glottal pulse database to the eigenvector as a unique glottal pulse associated with the eigenvectors.

Further comprising the step of learning the parametric model the formed for use in speech synthesis method according to claim 1.

The learning is
a. Defining a learning text corpus;
b. Obtaining voice data by recording the learning text spoken by the voice talent;
c. Converting the learning text into a context-dependent phoneme label;
d. Determining a plurality of spectral characteristics of the speech data using the phoneme labels;
e. Predicting the fundamental frequency of the audio data;
f. 12. The method of claim 11 , further comprising performing parameter prediction on the audio stream using the spectral characteristics, the fundamental frequency, and a duration of the audio stream.

A method of synthesizing speech using input text,
a. Converting the input text into a context-dependent phoneme label;
b. A step of fundamental frequency values, using the parametric model trained to predict the spectral characteristics of the synthesized speech duration and the phoneme label, processes the phoneme label created the in step (a),
c. And creating a excitation signal using the one or one or more of the specific glottal pulses and the predicted fundamental frequency values, spectral characteristics and the synthesized speech duration of the phoneme label,
d. Using filters to create an output of the synthesized speech, it viewed including the step of combining said spectral characteristics of said excitation signal and the phoneme label,
The step of generating the excitation signal,
e . A step of classifying the signal region of excitation of the type of segment,
f . Further seen including the step of creating the excitation signal of each type,
Type of the segment looking contains voiced, one of unvoiced and pauses or one or more,
g . Using the predicted fundamental frequency value from the model to create a glottal boundary indicating the pitch boundary of the excitation signal ;
h . Adding glottal pulses starting from each glottal boundary using a superposition addition method;
i . i. When the glottal pulse has a length less than the corresponding pitch period, the glottal pulse is zero-extended to the length of the pitch period before the left shift, and the amount of shift that increases constantly at the glottic boundary and the glottal pulse and creating a number of different excitation formed through the overlap-add method in the same amount of cyclic left shift for,
ii. Determining an arithmetic mean of the different excitation signal number,
iii. Further comprising a said arithmetic mean declaring the step of final excitation signal voiced segments, how the excitation signal in a voiced device signals is created and a step to avoid boundary effects at the excitation signal .

A method of synthesizing speech using input text,
a. Converting the input text into a context-dependent phoneme label;
b. Processing the phoneme label created in step (a) using a parametric model learned to predict a fundamental frequency value, the synthesized speech duration, and a spectral characteristic of the phoneme label; ,
c. A step to create a unique glottal pulse and the predicted fundamental frequency values, the excitation signal using the one or more than one of said phoneme label spectral properties and the synthesized speech duration,
d. Using filters to create an output of the synthesized speech, comprising the steps of combining the spectral properties of the excitation signal and the phoneme label
Including
The proper glottal pulse is identified from a glottal pulse database, and the identification is
e . Calculating a glottal pulse distance metric between multiple glottal pulses;
f . Clustering the glottal pulse database into multiple clusters to determine the centroid of glottal pulses;
g . Forming a corresponding vector database by associating a vector with each glottal pulse in the glottal pulse database, wherein the glottal pulse centroid and the distance metric are mathematically defined to determine an association;
h . Determining eigenvectors of the vector database;
i . Forming a parametric model by associating glottal pulses with each determined eigenvector from the glottal pulse database.

The method of claim 14 , wherein the number of glottal pulses is two.

The step ( e ) of claim 14 comprises :
a. Decomposing the number of glottal pulses into corresponding subband components;
b. Calculating a subband distance metric between the corresponding subband components of each glottal pulse;
c. 15. The method of claim 14 , further comprising mathematically calculating the distance metric using the subband distance metric.

The calculation of step (c) of claim 16 comprises the mathematical equation

Wherein d (x _i , y _i ) represents the distance metric and d _s ² (x _i ⁽ⁿ⁾ , y _i ⁽ⁿ⁾ ) represents the subband distance metric, Item 17. The method according to Item 16 .

The method of claim 14 , wherein the number of clusters is 256.

The clustering step (f) of claim 14 is carried out using a modified k-means calculation using the glottal pulse distance metric method according to claim 14.

20. The method of claim 19 , wherein the modified k-means calculation further comprises updating a cluster centroid with an element of the cluster that has a minimum sum of squares of distances from all other elements of the cluster.

21. The method of claim 20 , further comprising terminating the clustering iteration if it does not shift in any of the centroids from the cluster.

The method of claim 14, wherein the determination of the eigenvectors of the step ( h ) of claim 14 is performed using principal component analysis.

The step ( i ) of claim 14 comprises :
a. Determining the eigenvector;
b. Determining a vector best matching the eigenvector from the vector database; c. Determining the most suitable glottal pulse from the glottal pulse database;
d. 15. The method of claim 14 , further comprising: specifying the glottal pulse from the glottal pulse database that best matches the eigenvector as the eigenglottic pulse associated with the eigenvector.

Further comprising constructing the glottal pulse database from speech signals, the configuration comprising: a. Performing pre-filtering on the audio signal to obtain a pre-filtered signal;
b. Analyzing the prefiltered signal to obtain inverse filtering parameters;
c. Performing inverse filtering of the audio signal using the inverse filtering parameters;
d. Calculating an integrated linear prediction residual signal using the inverse filtered speech signal;
e. Identifying glottal segment boundaries in the speech signal;
f. Segmenting the integrated linear prediction residual signal into glottal pulses using the identified glottic segment boundaries from the speech signal;
g. Performing normalization of the glottal pulses;
h. By collecting all the normalized glottal pulse obtained to the audio signal, and forming the glottal pulse database The method of claim 14.

25. The method of claim 24, wherein the analysis of step (b) of claim 24 is performed using linear prediction.

25. The method of claim 24, wherein the inverse filtering parameters in step (b) of claim 24 include linear prediction coefficients.

25. The method of claim 24, wherein the identification of step (e) of claim 24 is performed using a zero frequency filtering technique.

25. The method of claim 24, wherein the pre-filtering of step (a) of claim 24 includes pre-emphasis.