JP4313728B2

JP4313728B2 - Voice recognition method, apparatus and program thereof, and recording medium thereof

Info

Publication number: JP4313728B2
Application number: JP2004179723A
Authority: JP
Inventors: 敏高橋; 哲小橋川
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2004-06-17
Filing date: 2004-06-17
Publication date: 2009-08-12
Anticipated expiration: 2024-06-17
Also published as: JP2006003617A

Description

この発明は、例えば音声応答装置のように、マイクロホンに収音された音声信号に対し音声認識を行い、この認識結果に対応した音声合成信号をスピーカから放音する装置に用いられ、マイクロホンからの入力音声信号の特徴ベクトル系列に対して、各認識カテゴリごとの特徴ベクトルからモデル化した確率モデルの尤度を求めて、入力音声信号の認識を行う音声認識方法、その装置およびプログラム、その記録媒体に関する。 The present invention is used in a device that performs voice recognition on a voice signal picked up by a microphone, such as a voice response device, and emits a voice synthesis signal corresponding to the recognition result from a speaker. Speech recognition method for recognizing input speech signal by obtaining likelihood of probability model modeled from feature vector for each recognition category for feature vector series of input speech signal, device and program thereof, and recording medium thereof About.

従来の音声認識において、認識結果候補を構成する音素、音節、単語などの音声単位の認識カテゴリ毎に隠れマルコフモデル（Hidden Markov Model、以下ではＨＭＭと記す）などの確率モデルを用いてモデル化する手法は、認識性能が高く、現在の音声認識技術の主流となっている。従来のＨＭＭを用いた音声認識装置を図１を参照して簡単に説明する。入力端子１１から入力された音声信号は、Ａ／Ｄ変換部１２においてディジタル信号に変換される。そのディジタル信号から特徴ベクトル抽出部１３において音声特徴ベクトルを抽出する。予め、認識カテゴリごとに、音声単位について作成したＨＭＭを、モデルメモリ１４から読み出し、尤度計算部１５において、抽出された音声特徴ベクトルに対する各モデルの照合尤度を計算する。最も大きな照合尤度を示すモデルが表現する音声単位（認識カテゴリ）を認識結果とし出力部１６より出力する。なおこの明細書及び図面中で対応する部分は同一参照番号を付けて重複説明は省略する。 In conventional speech recognition, modeling is performed using a stochastic model such as a Hidden Markov Model (hereinafter referred to as HMM) for each speech recognition category such as phonemes, syllables, and words constituting a recognition result candidate. The method has high recognition performance and has become the mainstream of current speech recognition technology. A conventional speech recognition apparatus using an HMM will be briefly described with reference to FIG. The audio signal input from the input terminal 11 is converted into a digital signal by the A / D converter 12. A feature vector extraction unit 13 extracts a speech feature vector from the digital signal. For each recognition category, the HMM created for the speech unit is read from the model memory 14 in advance, and the likelihood calculation unit 15 calculates the matching likelihood of each model with respect to the extracted speech feature vector. The speech unit (recognition category) represented by the model showing the largest matching likelihood is output from the output unit 16 as a recognition result. Corresponding portions in the specification and the drawings are denoted by the same reference numerals, and redundant description is omitted.

背景雑音などの加法性雑音が重畳した音声の認識方法を２つ説明する。その１つ目は、入力音声に重畳した雑音を抑圧した後に認識する方法である。雑音抑圧方法はいろいろと提案されているが、ここではスペクトルサブトラクション法（以下、ＳＳ法と記す）について説明する（例えば非特許文献１参照）。時間領域で加法性の２つの信号は、線形パワースペクトル上でも加法性であることから、ＳＳ法では、雑音重畳音声信号から、推定雑音成分を線形パワースペクトル上で減算して音声成分を抽出する。 Two methods for recognizing speech on which additive noise such as background noise is superimposed will be described. The first is a method of recognizing after suppressing the noise superimposed on the input speech. Various noise suppression methods have been proposed. Here, a spectral subtraction method (hereinafter referred to as SS method) will be described (for example, see Non-Patent Document 1). Since two signals that are additive in the time domain are also additive on the linear power spectrum, the SS method subtracts the estimated noise component on the linear power spectrum from the noise-superimposed speech signal to extract the speech component. .

ＳＳ法を用いた音声認識装置を図２を参照して簡単に説明する。ディジタル信号とされた入力音声信号は、雑音であるか、雑音重畳音声であるかを音声／雑音判定部２１で判定される。この判定部２１は、その判定が雑音であるならば、音声／雑音スイッチ２２を雑音端子２２ａ側に接続して、Ａ／Ｄ変換部１２の出力側を平均雑音パワースペクトル計算部２３に接続して入力音声信号中の雑音区間における平均パワースペクトルを計算する。判定部２１で認識対象である雑音重畳音声区間であると判定された場合は、音声／雑音スイッチ２２を音声端子２２ｂ側に切り替えて、Ａ／Ｄ変換部１２の出力側を雑音重畳音声パワースペクトル計算部２４に接続し、入力音声信号中の雑音重畳音声のパワースペクトルを計算する。抑圧処理部２５において、各時刻の雑音重畳音声のパワースペクトルから、平均雑音パワースペクトルを減算する。時刻ｔのパワースペクトルの周波数ｆの雑音抑圧後のパワースペクトルＹ^Ｄ（ｔ，ｆ）は、以下のように計算される。 A speech recognition apparatus using the SS method will be briefly described with reference to FIG. The voice / noise determination unit 21 determines whether the input voice signal that is a digital signal is noise or noise-superimposed voice. If the determination is noise, the determination unit 21 connects the voice / noise switch 22 to the noise terminal 22a side, and connects the output side of the A / D conversion unit 12 to the average noise power spectrum calculation unit 23. The average power spectrum in the noise interval in the input speech signal is calculated. When the determination unit 21 determines that it is a noise-superimposed speech section to be recognized, the speech / noise switch 22 is switched to the speech terminal 22b side, and the output side of the A / D conversion unit 12 is connected to the noise-superimposed speech power spectrum. It connects to the calculation part 24 and calculates the power spectrum of the noise superimposed voice in the input voice signal. In the suppression processing unit 25, the average noise power spectrum is subtracted from the power spectrum of the noise superimposed speech at each time. The power spectrum Y ^D (t, f) after noise suppression at the frequency f of the power spectrum at time t is calculated as follows.

Ｄ（Ｙ（ｔ，ｆ））＝Ｙ（ｔ，ｆ）−αＮ＾（ｆ）
Ｙ^Ｄ（ｔ，ｆ）＝Ｄ（Ｙ（ｔ，ｆ））：Ｄ（Ｙ（ｔ，ｆ））＞βＹ（ｔ，ｆ）の場合
Ｙ^Ｄ（ｔ，ｆ）＝βＹ（ｔ，ｆ）その他の場合（１）
ここで、Ｙ（ｔ，ｆ）は、入力雑音重畳音声の時刻ｔ、周波数ｆのパワースペクトル、
Ｎ＾（ｆ）は推定された周波数ｆの時間平均雑音パワースペクトル、
αはサブストラクション係数であり、通常１より大きい。
βはフロアリング係数であり、１より小さい。 D (Y (t, f)) = Y (t, f) −αN ^ (f)
Y ^D (t, f) = D (Y (t, f)): D (Y (t, f))> βY (t, f) Y ^D (t, f) = βY (t, f) Other cases (1)
Here, Y (t, f) is the time t of the input noise superimposed speech, the power spectrum of the frequency f,
N ^ (f) is the time average noise power spectrum of the estimated frequency f,
α is a subtraction coefficient and is usually larger than 1.
β is a flooring coefficient and is smaller than 1.

抑圧処理部２５から出力されるパワースペクトルから、音声認識の特徴パラメータ（例えば、１２次元のメル周波数・ケプストラム係数（Mel-Frequency Cepstrum Coefficient：ＭＦＣＣ））を特徴ベクトル抽出部１３で計算する。これ以後の処理は、図１で説明した通りである。
２つ目の例としてＨＭＭ合成法による雑音重畳音声の認識について説明する。認識対象音声信号に重畳されていると予想される雑音データを、雑音を含まないクリーンな音声の学習データセットに重畳し、ＨＭＭを作成し、得られたＨＭＭを用いて、雑音重畳音声信号に対し音声認識をすれば高い認識性能が得られる。 A feature parameter for speech recognition (for example, a 12-dimensional Mel-Frequency Cepstrum Coefficient (MFCC)) is calculated by the feature vector extraction unit 13 from the power spectrum output from the suppression processing unit 25. The subsequent processing is as described with reference to FIG.
As a second example, recognition of noise superimposed speech by the HMM synthesis method will be described. Noise data that is expected to be superimposed on the recognition target speech signal is superimposed on a clean speech learning data set that does not contain noise, an HMM is created, and the obtained HMM is used to generate a noise superimposed speech signal. On the other hand, if speech recognition is performed, high recognition performance can be obtained.

しかし、音声認識が利用される周囲環境の雑音は様々であり、予め予想することは難しい。さらに、ＨＭＭを作成するためのクリーン音声学習データセットのデータ量は膨大であり、従って重畳されていると思われる雑音データを重畳して、雑音重畳音声モデルを作成するために例えば、１００時間という長い計算時間がかかる。よって、音声認識が利用される周囲環境の雑音を認識時に収録し、ＨＭＭを作成して利用することは、ＨＭＭ作成に長い処理時間がかかるため現実的ではない。
そこで、例えば特許文献１に示すように、雑音のない大量のクリーン音声学習データセットをもとに予めクリーン音声ＨＭＭを作成しておき、認識時には背景雑音を観測して雑音ＨＭＭを作成し、クリーン音声ＨＭＭと合成する。得られた雑音重畳音声ＨＭＭは、認識時の背景雑音を含む音声モデルの近似であり、これを用いて認識する。雑音モデルの作成、モデルの合成にかかる処理時間は数秒から数十秒である。確率モデルであるＨＭＭを用いるので、音声の変動、雑音の変動を考慮することもできる。
特許第３２４７７４６号公報 Steven F.Boll：“Suppression of Acoustic Noise in Speech Using Spectral Subtraction，”IEEE Transactions on Acoustics,Speech and Signal Processing,Vol.ASSP-27,No.2,pp.113-120,April 1979 However, noise in the surrounding environment where voice recognition is used varies, and it is difficult to predict in advance. Furthermore, the amount of data of the clean speech learning data set for creating the HMM is enormous, and therefore, for example, 100 hours is used to create a noise superimposed speech model by superimposing noise data that seems to be superimposed. It takes a long calculation time. Therefore, it is not realistic to record the noise of the surrounding environment where voice recognition is used at the time of recognition and create and use the HMM because it takes a long processing time to create the HMM.
Therefore, for example, as shown in Patent Document 1, a clean speech HMM is created in advance based on a large amount of clean speech learning data set without noise, and a noise HMM is created by observing background noise during recognition. Synthesize with voice HMM. The obtained noise superimposed speech HMM is an approximation of a speech model including background noise at the time of recognition, and is recognized using this. The processing time required for noise model creation and model synthesis is several seconds to several tens of seconds. Since an HMM that is a probabilistic model is used, it is possible to take into account voice fluctuations and noise fluctuations.
Japanese Patent No. 3247746 Steven F. Boll: “Suppression of Acoustic Noise in Speech Using Spectral Subtraction,” IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. ASSP-27, No. 2, pp. 113-120, April 1979

例えば、音声認識方法を利用した人間と機械との音声応答装置においては、装置に設置されたスピーカから、ユーザに対するガイダンスのための音声や音を発することが多い。このような装置構成においては、装置に設置された音声認識用のマイクからは、周囲の背景雑音だけでなく、音声応答装置が自らが発するガイダンス音声も回り込んでエコーとして入力されてしまうことが多く、これも周囲雑音と同様に音声認識装置にとっての雑音となる。これらの雑音は、音声認識性能を劣化させる要因となる。
本発明は、上記に鑑みてなされたもので、その目的とするところは、周囲雑音のみならず、音声応答装置のように音声認識装置と共に用いられる音声合成装置が発する音声や音の回り込みエコーの存在に拘らず、認識性能が高い音声認識方法、その装置、プログラムおよび記録媒体を提供することにある。 For example, in a voice response device between a human and a machine using a voice recognition method, a voice or sound for guidance to a user is often emitted from a speaker installed in the device. In such a device configuration, a voice recognition microphone installed in the device may circulate not only the surrounding background noise but also the guidance sound generated by the voice response device itself and be input as an echo. In many cases, this also becomes noise for the speech recognition apparatus as well as ambient noise. These noises cause deterioration in speech recognition performance.
The present invention has been made in view of the above, and an object of the present invention is not only ambient noise but also speech and sound wraparound echoes generated by a speech synthesizer used together with a speech recognition device such as a speech response device. To provide a speech recognition method, apparatus, program, and recording medium with high recognition performance regardless of existence.

この発明はマイクロホンにより音響信号を収音し、スピーカより音響信号を放音する装置における上記マイクロホンにより収音した音声信号を確率モデルを用いて音声認識を行う音声認識方法であって、
マイクロホンよりの入力信号中の、上記スピーカからの放音信号成分（以下エコー信号という）を、上記スピーカへの供給信号を用いて抑圧し、かつ入力信号中の雑音信号成分を抑圧し、これらの抑圧がなされた信号を特徴ベクトル系列に変換し、この特徴ベクトル系列が認識対象の音声信号を含む音声区間か否かを判定し、その判定が音声区間でなければその特徴ベクトル系列を用いて雑音モデルを学習し、その雑音モデルと、雑音がないクリーンな音声データを用いて予め作成されたクリーン音声モデルとを合成して雑音重畳音声モデルを生成し、上記判定が音声区間であればその特徴ベクトル系列と上記雑音重畳音声モデルとを用いて認識カテゴリに対する尤度を計算し、これら計算された尤度に基づき、認識結果を出力する。 The present invention is a speech recognition method for performing speech recognition using a probability model for a speech signal collected by the microphone in an apparatus that collects an acoustic signal by a microphone and emits the acoustic signal from a speaker,
The sound emission signal component from the speaker (hereinafter referred to as echo signal) in the input signal from the microphone is suppressed using the supply signal to the speaker, and the noise signal component in the input signal is suppressed. The suppressed signal is converted into a feature vector sequence, and it is determined whether or not the feature vector sequence is a speech section including a speech signal to be recognized. If the determination is not a speech section, noise is generated using the feature vector sequence. The model is learned, and the noise model and a clean speech model created in advance using clean speech data without noise are combined to generate a noise superimposed speech model. The likelihood for the recognition category is calculated using the vector sequence and the noise superimposed speech model, and the recognition result is output based on the calculated likelihood.

この発明によれば、マイクロホンよりの入力信号中の雑音信号成分を抑圧するだけでなく、スピーカへの供給信号を用いてエコー信号をも抑圧し、これら両抑圧がなされた信号から雑音モデルを生成し、クリーン音声モデルと合成し、かつエコーおよび雑音抑圧され、Ｓ／Ｎ（信号対雑音比）が改善された雑音重畳音声信号に対し、前記合成モデルを用いて音声認識を行っているため、環境雑音のみならずエコーの影響を受け難く、高い認識率を得ることができる。しかも雑音モデルと、クリーン音声モデルを合成して雑音重畳音声モデルとしているため、雑音重畳音声モデルを短時間で作ることができる。 According to the present invention, not only the noise signal component in the input signal from the microphone is suppressed, but also the echo signal is suppressed using the signal supplied to the speaker, and a noise model is generated from the signal subjected to both the suppressions. Since the speech recognition is performed using the synthesis model for the noise superimposed speech signal which is synthesized with the clean speech model, echo and noise are suppressed, and the S / N (signal-to-noise ratio) is improved, It is difficult to be affected by not only environmental noise but also echo, and a high recognition rate can be obtained. Moreover, since the noise model and the clean speech model are combined to form a noise superimposed speech model, the noise superimposed speech model can be created in a short time.

［第１実施形態］
この発明の第１実施形態機能構成例を図３にその処理手順の例を図４にそれぞれ示す。この発明は例えば音声応答装置における音声認識に適用される。つまりこの音声応答装置の利用者に対し発話を誘導するためのガイダンス音声や利用者の発声を促す“ピッ”という音などのガイダンス音がスピーカ３１から放音される。このガイダンス音声やガイダンス音など（以下システム音声という）を放音するために、出力用システム音声生成部３２でディジタルシステム音声信号が音声合成され、このディジタルシステム音声信号が音声再生部３３でアナログのシステム音声信号に変換されてスピーカ３１へ供給される。 [First Embodiment]
An example of the functional configuration of the first embodiment of the present invention is shown in FIG. 3, and an example of the processing procedure is shown in FIG. The present invention is applied to voice recognition in a voice response device, for example. That is, a guidance sound such as a guidance voice for inducing speech to the user of the voice response device or a “beep” sound that prompts the user to speak is emitted from the speaker 31. In order to emit this guidance voice or guidance sound (hereinafter referred to as system voice), a digital system voice signal is synthesized by the output system voice generation unit 32, and this digital system voice signal is converted into an analog signal by the voice reproduction unit 33. It is converted into a system audio signal and supplied to the speaker 31.

利用者より発声された音声はマイクロホン３４により収音され、その収音された音声信号は、入力端子１１を通じてＡ／Ｄ変換部１２へ入力される。マイクロホン３４には周囲雑音が収音されると共に、スピーカ３１から放音されたシステム音声の回り込みエコーが収音される。つまりマイクロホン３４から入力端子１１へ供給される入力信号は利用者の認識対象音声信号に周囲雑音信号およびエコー信号が重畳されたものである。
第１実施形態ではエコー・雑音抑圧部３５にＡ／Ｄ変換部１２よりのディジタル入力信号および出力用システム音声生成部３２からのディジタルシステム音声信号が入力され、エコー・雑音抑圧部３５で入力信号はこれに重畳している周囲雑音信号とエコー信号とが抑圧される（ステップＳ１）。この例ではエコー部３５ａにおいて、システム音声信号によりまずエコー信号が抑圧される（ステップＳ１ａ）。このエコー抑圧は、例えば電話会議システムやテレビ会議システムなどに利用されている反響消去装置（エコーキャンセラ）の方法を用いることができる。例えばスピーカ３１からマイクロホン３４を通じてエコー部３５ａに到る伝達特性を適応的に推定し、推定した伝達特性をシステム音声信号に対し畳み込み、疑似エコー信号を生成し、この疑似エコー信号を入力信号から差し引いてエコー抑圧された入力信号を得る。 The voice uttered by the user is collected by the microphone 34, and the collected voice signal is input to the A / D conversion unit 12 through the input terminal 11. Ambient noise is picked up by the microphone 34, and a wraparound echo of the system sound emitted from the speaker 31 is picked up. That is, the input signal supplied from the microphone 34 to the input terminal 11 is obtained by superimposing the ambient noise signal and the echo signal on the speech signal to be recognized by the user.
In the first embodiment, the digital input signal from the A / D converter 12 and the digital system voice signal from the output system voice generator 32 are input to the echo / noise suppressor 35, and the echo / noise suppressor 35 receives the input signal. The ambient noise signal and the echo signal superimposed on it are suppressed (step S1). In this example, the echo signal is first suppressed by the system voice signal in the echo unit 35a (step S1a). For this echo suppression, for example, a method of an echo canceller (echo canceller) used in a telephone conference system or a video conference system can be used. For example, the transfer characteristic from the speaker 31 through the microphone 34 to the echo unit 35a is adaptively estimated, the estimated transfer characteristic is convoluted with the system sound signal, a pseudo echo signal is generated, and the pseudo echo signal is subtracted from the input signal. To obtain an echo-suppressed input signal.

次にこのエコー抑圧された入力信号が雑音部３５ｂに入力されて、入力信号に重畳している周囲（背景）雑音成分が抑圧される（ステップＳ１ｂ）。この雑音抑圧は例えば入力信号中の平均的な最低レベルを背景雑音レベルとみなし、このレベル以下の信号を除去する。
更にこの例ではこのエコー抑圧及び雑音抑圧処理された信号およびディジタルのシステム音声信号が残留エコー部３５ｃに入力され、背景雑音レベル以外のエコー信号など、背景雑音に影響されてエコー部Ｓ１ａにより除去できなかった残留エコー信号が、エコーおよび雑音抑圧された入力信号から除去される（ステップＳ１ｃ）。この残留エコー抑圧も、例えばテレビ会議システムに利用されているものと同様の手法を用いることができる。例えば特許第３４２０７０５号公報、特許第３５０７０２０号公報、特開２００３−２８４１８３号公報に示されているように、入力信号とシステム音声信号とから音響（エコー経路）結合量を求め、これに応じて、エコーおよび雑音抑圧された入力信号に対し抑圧、つまり損失を与えればよい。 Next, the echo-suppressed input signal is input to the noise unit 35b, and the ambient (background) noise component superimposed on the input signal is suppressed (step S1b). In this noise suppression, for example, the average minimum level in the input signal is regarded as the background noise level, and signals below this level are removed.
Further, in this example, the echo-suppressed and noise-suppressed signal and the digital system voice signal are input to the residual echo unit 35c, and can be removed by the echo unit S1a affected by background noise such as an echo signal other than the background noise level. The residual echo signal that did not exist is removed from the echo and noise-suppressed input signal (step S1c). For the residual echo suppression, the same technique as that used in, for example, a video conference system can be used. For example, as shown in Japanese Patent No. 3420705, Japanese Patent No. 3507020, and Japanese Patent Application Laid-Open No. 2003-284183, an acoustic (echo path) coupling amount is obtained from an input signal and a system audio signal, and accordingly, In other words, it is only necessary to suppress, that is, to give a loss to the echo and noise-suppressed input signal.

エコー・雑音抑圧部３５よりのエコーおよび雑音抑圧処理された入力信号は特徴ベクトル抽出部３６に入力され、特徴ベクトルは確率モデル、この例ではＨＭＭの学習に必要な特徴ベクトル系列に変換される（ステップＳ２）。この特徴ベクトル系列は区間判定部３７に入力され、その特徴ベクトル系列より、現在の入力信号が雑音信号成分のみ、つまり周囲雑音信号又はこれとエコー信号のみの雑音区間であるか、あるいは雑音信号成分と音声信号とが重畳された雑音重畳音声信号の音声区間のいずれであるかの判定がなされる（ステップＳ３）。 The input signal subjected to the echo and noise suppression processing from the echo / noise suppression unit 35 is input to the feature vector extraction unit 36, and the feature vector is converted into a probability model, in this example, a feature vector sequence necessary for HMM learning ( Step S2). This feature vector series is input to the section determination unit 37, and from the feature vector series, the current input signal is only a noise signal component, that is, a noise section of only an ambient noise signal or an echo signal, or a noise signal component. It is determined which one of the voice sections of the noise superimposed voice signal is superimposed with the voice signal (step S3).

区間判定部３７より判定結果出力が音声／雑音スイッチ３８に入力され、判定結果出力が雑音区間に対するものであれば、スイッチ３８は端子３８_Ｎ側に切り替えられ、特徴ベクトル抽出部３６よりの特徴ベクトル系列が雑音モデル学習部３９へ入力される。雑音モデル学習部３９は入力された特徴ベクトルの複数の分析フレーム分を学習して雑音ＨＭＭが生成される（ステップＳ４）。この雑音ＨＭＭはエコーおよび雑音抑圧処理された周囲雑音信号又はこれとエコー信号と対応している。
クリーン音声モデルメモリ４１には、雑音がないクリーンな多数の音声データを基に、認識する音声単位で各認識カテゴリごとに学習されたクリーン音声ＨＭＭが格納されている。このクリーン音声ＨＭＭと雑音ＨＭＭがモデル合成部４２に入力され、これらＨＭＭが合成され、雑音重畳音声ＨＭＭとして、雑音重畳音声モデルメモリ４３に格納される（ステップＳ５）。 If the determination result output is input to the voice / noise switch 38 from the section determination unit 37 and the determination result output is for the noise section, the switch 38 is switched to the terminal 38 _N side, and the feature vector from the feature vector extraction unit 36 is displayed. The sequence is input to the noise model learning unit 39. The noise model learning unit 39 learns a plurality of analysis frames of the input feature vector to generate a noise HMM (step S4). This noise HMM corresponds to the echo and noise-suppressed ambient noise signal or this and the echo signal.
The clean speech model memory 41 stores clean speech HMMs learned for each recognition category in units of speech to be recognized based on a large number of clean speech data free from noise. The clean speech HMM and noise HMM are input to the model synthesis unit 42, and these HMMs are synthesized and stored in the noise superimposed speech model memory 43 as noise superimposed speech HMM (step S5).

区間判定部３２よりの判定結果出力が音声区間に対するものであれば、音声／雑音スイッチ３８は端子３８_Ｓ側に切り替えられ、特徴ベクトル抽出部３６よりの、エコーおよび雑音抑圧処理された雑音重畳音声信号の特徴ベクトル系列は尤度計算部４４に入力される。尤度計算部４４は、入力された特徴ベクトル系列に対する雑音重畳音声モデルメモリ４３内の各雑音重畳音声モデルの尤度を計算する（ステップＳ６）。各認識カテゴリについて計算された尤度が出力部１６へ入力され、入力された尤度中の最大のモデルの認識カテゴリが認識結果として出力される（ステップＳ７）。 If the determination result output from the section determination unit 32 is for the voice section, the voice / noise switch 38 is switched to the terminal 38 _S side, and the noise-superimposed voice subjected to echo and noise suppression processing from the feature vector extraction unit 36. The feature vector sequence of the signal is input to the likelihood calculation unit 44. The likelihood calculating unit 44 calculates the likelihood of each noise superimposed speech model in the noise superimposed speech model memory 43 for the input feature vector series (step S6). The likelihood calculated for each recognition category is input to the output unit 16, and the recognition category of the maximum model in the input likelihood is output as a recognition result (step S7).

雑音ＨＭＭの生成は、音声応答装置を動作させるための準備期間（アドリング中）に、システム音声を放音させて行ってもよいし、利用者が発声する前の区間に行ってもよい。後者においては、利用者の発声ごとの各直前に常に行うようにしてもよく、この場合は、モデル合成部４３で合成された雑音重畳音声モデルにより、雑音重畳音声モデルメモリ４３内の雑音重畳音声モデルが更新される（ステップＳ５）。このようにすると、利用者の音声応答装置に対する位置が変化してもＳ／Ｎ（信号対雑音比）の影響が少なく、かつ、エコー経路の推定により良好になり、認識率が向上する。 The generation of the noise HMM may be performed by emitting the system sound during a preparation period (during the addition) for operating the voice response device, or may be performed in a section before the user utters. In the latter case, it may be always performed immediately before each utterance of the user. In this case, the noise superimposed speech in the noise superimposed speech model memory 43 is generated by the noise superimposed speech model synthesized by the model synthesis unit 43. The model is updated (step S5). In this way, even if the position of the user with respect to the voice response device changes, the influence of S / N (signal-to-noise ratio) is small, and it becomes better by estimating the echo path, and the recognition rate is improved.

以上のようにこの第１実施形態によれば、エコー・雑音抑圧部３５により、エコー信号が抑圧され、しかも雑音区間においてエコーおよび雑音抑圧された入力信号の特徴ベクトル系列から雑音モデルを生成し、かつ音声区間においてエコーおよび雑音抑圧され、Ｓ／Ｎ（信号対雑音比）が改善された雑音重畳音声信号の特徴ベクトル系列に対し雑音重畳音声ＨＭＭの尤度を計算しているため、雑音モデルを学習しているため、予め使用される環境の雑音を予測して生成することなく、常にその場所での周囲（背景）雑音と対応した雑音モデルが生成でき、また周囲雑音の状態が変化してもこれに応じた雑音モデルが得られ、認識率が向上する。更に雑音モデルとクリーン音声モデルとを合成して雑音重畳音声モデルを生成しているため処理時間が短かい。 As described above, according to the first embodiment, the echo signal is suppressed by the echo / noise suppression unit 35, and the noise model is generated from the feature vector sequence of the input signal that has been echoed and suppressed in the noise period. In addition, since the likelihood of the noise-superimposed speech HMM is calculated for the feature vector sequence of the noise-superimposed speech signal that is echo- and noise-suppressed in the speech section and has an improved S / N (signal to noise ratio), the noise model is Since learning, it is possible to always generate a noise model corresponding to the ambient (background) noise in the place without predicting and generating the noise of the environment used in advance, and the state of the ambient noise changes In addition, a noise model corresponding to this is obtained, and the recognition rate is improved. Furthermore, since the noise superimposed speech model is generated by synthesizing the noise model and the clean speech model, the processing time is short.

［第２実施形態］
この発明の第２実施形態の機能構成例を図５に、処理手順例を図６にそれぞれ示す。第１実施形態と異なる点を説明する。
特徴ベクトル抽出部３６よりの特徴ベクトル系列は区間判定部５１に入力され、この区間判定部５１には、出力用システム音声生成部３２からシステム音声を放音中であるか否かを示すエコー有無信号も入力される。区間判定部５１は入力された特徴ベクトル系列およびエコー有無信号により現在の入力信号が周囲（背景）雑音信号のみの雑音区間または周囲（背景）雑音信号およびエコー信号を含む雑音・エコー区間かあるいは周囲雑音信号もしくはこれとエコー信号とが重畳された雑音重畳音声信号の音声区間のいずれであるかが判定される。例えばステップＳ２の後、区間判定結果が音声区間であるか否かが判定され（ステップＳ１１）、音声区間でなければ雑音区間か否かが判定される（ステップＳ１２）。 [Second Embodiment]
An example of the functional configuration of the second embodiment of the present invention is shown in FIG. 5, and an example of the processing procedure is shown in FIG. Differences from the first embodiment will be described.
The feature vector series from the feature vector extraction unit 36 is input to the section determination unit 51. The section determination unit 51 includes an echo indicating whether the system sound is being emitted from the output system sound generation unit 32. A signal is also input. The section determination unit 51 determines whether the current input signal is a noise section including only the surrounding (background) noise signal or a noise / echo section including the surrounding (background) noise signal and the echo signal based on the input feature vector series and the echo presence / absence signal. It is determined whether the noise section or the voice section of the noise superimposed voice signal in which this and the echo signal are superimposed. For example, after step S2, it is determined whether the section determination result is a voice section (step S11), and if it is not a voice section, it is determined whether it is a noise section (step S12).

雑音区間と判定された判定結果出力によりスイッチ５２が端子５２_Ｓに切り替えられ、特徴ベクトル抽出部３６よりの特徴ベクトル系列が雑音モデル学習部５３に入力され、雑音モデル学習部５３は入力された特徴ベクトル系列に基づき雑音およびエコー抑圧処理された周囲雑音信号と対応する雑音ＨＭＭを学習する（ステップＳ１３）。
雑音・エコー区間と判定された判定結果出力によりスイッチ５２が端子５２_Ｅに切り替えられ、特徴ベクトル抽出部３６よりの特徴ベクトル系列が雑音・エコーモデル学習部５４に入力され、雑音・エコー学習部５４は入力された特徴ベクトル系列に基づき雑音およびエコー抑圧処理された周囲雑音信号とエコー信号の重畳信号と対応する雑音・エコーＨＭＭを学習する（ステップＳ１４）。 Switch 52 by the decision result output it is determined that the noise section is switched to the terminal 52 _S, a feature vector sequence of from the feature vector extraction unit 36 is input to the noise model learning unit 53, the noise model learning unit 53 is inputted, wherein A noise HMM corresponding to the ambient noise signal subjected to noise and echo suppression processing based on the vector sequence is learned (step S13).
Switch 52 is switched to the terminal 52 _E by the judgment result output which is determined as a noise-echo interval, the feature vector series from the feature vector extraction section 36 is inputted to the noise echo model learning unit 54, the noise echo learning section 54 Learns the noise / echo HMM corresponding to the superposed signal of the ambient noise signal and echo signal subjected to noise and echo suppression processing based on the input feature vector sequence (step S14).

雑音モデル学習部５３よりの雑音ＨＭＭと、雑音・エコーモデル学習部５４よりの雑音・エコーモデルとがモデル合成部５５に入力され、これらとクリーン音声モデルメモリ４１よりのクリーン音声ＨＭＭとがそれぞれ合成されて雑音重畳音声ＨＭＭが生成され、雑音重畳音声モデルメモリ４３へ格納され、またはその記憶内容の更新が行われる（ステップＳ１５）。
音声区間と判定された判定結果出力によりスイッチ５２が端子５２_Ｓに切り替えられ、特徴ベクトル抽出部３６からの特徴ベクトル系列が尤度計算部４４へ入力される。その他は第１実施形態と同一である。 The noise HMM from the noise model learning unit 53 and the noise / echo model from the noise / echo model learning unit 54 are input to the model synthesis unit 55, and these are combined with the clean speech HMM from the clean speech model memory 41, respectively. Thus, the noise superimposed speech HMM is generated and stored in the noise superimposed speech model memory 43, or the stored content is updated (step S15).
Switch 52 is switched to the terminal 52 _S by the judgment result output is determined that the speech segment, a feature vector sequence from the feature vector extraction unit 36 is input to the likelihood calculating unit 44. Others are the same as the first embodiment.

この構成によれば、利用者がシステム音声が放音されていない状態で発声した場合は、雑音ＨＭＭとクリーン音声ＨＭＭとを合成した雑音重畳音声ＨＭＭを用いた尤度が高くなり、利用者がシステム音声が放音されている状態で発声した場合は、雑音・エコーＨＭＭとクリーン音声ＨＭＭとを合成した雑音重畳音声ＨＭＭを用いた尤度が高くなり、入力信号と認識用のモデルとがより合致するため、より高い認識率が得られる。
［変形例］
第１実施形態および第２実施形態では入力信号をエコー抑圧処理した後、雑音抑圧処理を行い、更に残留エコー抑圧処理を行ったが、図３〜図６中に破線で示すように、残留エコー抑圧は省略してもよい。この場合はこれら図中に括弧書きで示すように、雑音抑圧処理を先に行い、その後、エコー抑圧処理を行ってもよい。 According to this configuration, when the user utters in a state where the system voice is not emitted, the likelihood using the noise superimposed voice HMM obtained by synthesizing the noise HMM and the clean voice HMM is increased, and the user can When the system voice is uttered, the likelihood of using the noise superimposed voice HMM synthesized from the noise / echo HMM and the clean voice HMM becomes high, and the input signal and the recognition model are more reliable. Since they match, a higher recognition rate can be obtained.
[Modification]
In the first and second embodiments, the input signal is subjected to echo suppression processing, noise suppression processing is performed, and residual echo suppression processing is then performed. As shown by broken lines in FIGS. Suppression may be omitted. In this case, as shown in parentheses in these drawings, noise suppression processing may be performed first, and then echo suppression processing may be performed.

雑音抑圧法としては、例えば特許第３３０９８９５号公報、特許第３４５４４０２号公報、特許第３４５９３６３号公報などに示すように、入力信号を周波数領域信号に変換し、複数の周波数帯域に分割し、これら分割された周波数帯域ごとに雑音成分を推定しながら入力信号の対応周波数帯域の信号に対し雑音抑圧を行うようにしてもよい。このようにすれば、ある帯域について認識対象音声信号を必要以上に抑圧したり、逆に雑音抑圧が不十分であったりするおそれが少なくなり、Ｓ／Ｎが改善され、それだけ高い認識率が得られることになる。 As a noise suppression method, as shown in, for example, Japanese Patent No. 3309895, Japanese Patent No. 3454402, Japanese Patent No. 3459363, etc., an input signal is converted into a frequency domain signal and divided into a plurality of frequency bands. Noise suppression may be performed on the signal in the corresponding frequency band of the input signal while estimating the noise component for each frequency band. In this way, the possibility that the recognition target speech signal is suppressed more than necessary for a certain band or that noise suppression is insufficient is reduced, S / N is improved, and a higher recognition rate is obtained. Will be.

エコー抑圧法および残留エコー抑圧法も周波数領域に変換して行うとより有効である。またモデルとしてはＨＭＭに限らず他の確率モデルでもよい。
図３及び図５に示した装置をコンピュータにより機能させてもよい。この場合は図４又は図６に示した処理手順の各過程をコンピュータに実行させるための音声認識プログラムを、コンピュータにＣＤ−ＲＯＭ、磁気ディスク装置、半導体記憶装置などの記録媒体からインストールし、あるいは通信回線を介してダウンロードして、このプログラムをコンピュータに実行させればよい。 The echo suppression method and the residual echo suppression method are also more effective when converted into the frequency domain. Further, the model is not limited to the HMM, and other probability models may be used.
The apparatus shown in FIGS. 3 and 5 may be operated by a computer. In this case, a voice recognition program for causing a computer to execute the steps of the processing procedure shown in FIG. 4 or FIG. 6 is installed in a computer from a recording medium such as a CD-ROM, a magnetic disk device, or a semiconductor storage device, or The program may be downloaded via a communication line and executed by a computer.

従来の音声認識装置の機能構成を示すブロック図。The block diagram which shows the function structure of the conventional speech recognition apparatus. 従来のスペクトルサブトラクション法を用いた音声認識装置の機能構成を示すブロック図。The block diagram which shows the function structure of the speech recognition apparatus using the conventional spectrum subtraction method. この発明装置の第１実施形態の機能構成例を示すブロック図。The block diagram which shows the function structural example of 1st Embodiment of this invention apparatus. この発明方法の第１実施形態の処理手順例を示す流れ図。The flowchart which shows the process sequence example of 1st Embodiment of this invention method. この発明装置の第２実施形態の機能構成例を示すブロック図。The block diagram which shows the function structural example of 2nd Embodiment of this invention apparatus. この発明方法の第２実施形態の処理手順例を示す流れ図。The flowchart which shows the process sequence example of 2nd Embodiment of this invention method.

Claims

A speech recognition method for performing speech recognition using a probability model for a sound signal collected by the microphone in a device that collects an acoustic signal by a microphone and emits the acoustic signal from a speaker,
Noise that suppresses a sound emission signal component (hereinafter referred to as echo signal) from the speaker in the input signal from the microphone by using a supply signal to the speaker and suppresses an ambient noise signal in the input signal.・ Echo suppression step,
A feature vector extracting step of converting the signal in which the echo signal and the ambient noise signal are suppressed into a feature vector sequence;
The feature vector sequence is either a speech section including a speech signal to be recognized, a noise section of only an ambient noise signal, or a noise / echo section in which the ambient noise signal and the echo signal exist. An interval determining step for determining whether or not
A noise model learning step of learning a noise model using the feature vector sequence determined to be a noise interval in the interval determination step;
A noise echo model learning step of learning a noise / echo model using the feature vector sequence determined to be a noise / echo interval in the interval determination step;
Noise superimposed speech model synthesis that generates a noise superimposed speech model by combining the noise model and the noise / echo model with a clean speech model created in advance using clean speech data that does not contain noise signals or echo signals Steps,
A likelihood calculating step of calculating a likelihood for a recognition category using the feature vector sequence determined as a speech section in the section determining step and the noise-superimposed speech model;
An output step for outputting a recognition result based on the calculated likelihood.

The noise / echo suppression step includes a step of suppressing an echo signal in the input signal by a supply signal to the speaker, a step of suppressing an ambient noise signal in the input signal in which the echo signal is suppressed, the echo signal and The speech recognition method according to claim 1, further comprising a step of suppressing a remaining echo signal in the input signal in which the ambient noise signal is suppressed.

A speech recognition device that is used for a device that collects an acoustic signal by a microphone and emits an acoustic signal from a speaker, and that recognizes the speech signal collected by the microphone using a probability model,
Noise that receives an input signal from the microphone and a supply signal to the speaker, suppresses the sound emission signal component (hereinafter referred to as echo signal) in the input signal, and suppresses an ambient noise signal in the input signal・ Echo suppression part,
A feature vector extraction unit that receives an input signal in which the echo signal and the ambient noise signal are suppressed, and converts the signal into a feature vector sequence;
The feature vector sequence and a signal indicating whether or not a sound emission signal is supplied to the speaker are input, and the feature vector sequence is of a speech section including a recognition target speech signal or of a noise section of only ambient noise signals. A section determination unit that determines whether the one is a noise / echo section including the echo signal and the ambient noise,
A switch that receives the feature vector series and the determination result output and separates and outputs the feature vector series into three series according to the determination result output;
A noise model learning unit that receives a feature vector sequence of the noise section separated by the switch and learns a noise model for the feature vector sequence;
A feature vector sequence of the noise / echo section separated by the switch is input, and for this feature vector, a noise / echo model learning unit for learning a noise / echo model,
A clean speech model memory for storing a clean speech model created based on clean speech data without noise, and
The noise model, the noise echo model, and the clean speech model are input, and the noise synthesis model that generates the noise superimposed speech model by synthesizing the noise model and the noise echo model and the clean speech model, and the noise superimposed speech model A noise superimposed speech model memory in which is stored,
A likelihood calculation unit that receives the feature vector sequence of the speech section separated by the switch and the noise superimposed speech model, and calculates the likelihood for each recognition category of the feature vector sequence based on the noise superimposed speech model;
A speech recognition apparatus, comprising: a recognition result output unit that receives a likelihood for each recognition category and outputs a recognition result.

The noise / echo suppression unit receives an input signal and a supply signal to the speaker, and an echo unit that suppresses an echo signal in the input signal;
An output signal of the echo part is input, and a noise part for suppressing an ambient noise signal in the input signal in which the echo signal is suppressed;
An output signal of the noise unit and a supply signal to the speaker are input, and a residual echo unit that suppresses the remaining echo signal in the input signal in which the echo signal and the ambient noise signal are suppressed. The speech recognition apparatus according to claim 3, wherein

Speech recognition program for executing the steps of the speech recognition method according to claim 1 or 2 in a computer.

A computer-readable recording medium on which the voice recognition program according to claim 5 is recorded.