JP2004101963A5

JP2004101963A5 -

Info

Publication number: JP2004101963A5
Application number: JP2002264718A
Authority: JP
Filing date: 2002-09-10
Publication date: 2005-05-19

Description

【０００３】
前者はノイズに対する適合と補償とにより行なわれる。後者は、主として認識結果に対する信頼性を再評価することにより行なわれる。この場合、後処理の信頼性の尺度としては認識に用いられるものよりも複雑な言語モデル（ＬＭ）、または信頼尺度（ＣＭ）が用いられる。後処理では、ＣＭとして事後確率を用いて再スコアリングを行なう技術が後にあげる非特許文献１において報告されている。この報告では、事後確率に基づくＣＭを用い、一度認識した後、その発声全体にわたるＣＭスコアの積を最大化する基準によりその認識結果を再評価する。
【非特許文献１】
Ｆ．ウェセル（F. Wessel）、Ｒ．シュルター（R. Schluter）、Ｈ．ネイ（H.Ney）著「改善された音声認識のための事後単語確率の使用（Using Posterior word probabilities for improved speech recognition）」、ＩＣＡＳＳＰ２０００予稿集、ｐｐ．５３６−５６６
【非特許文献２】
Ｇ．エバーマン（G. Everman）、Ｐ．Ｃ．ウッドランド（P.C. Woodland）著「単語事後確率を用いた大ボキャブラリデコーディングおよび信頼性推定（Large vocabulary decoding and confidence estimation using word posterior probabilities）」、ＩＣＡＳＳＰ２０００予稿集、ｐｐ．２３６６−２３６９
【非特許文献３】
Ｔ．マツイ（T. Matsui)、Ｆ．Ｋ．スーン（F.K. Soong）、Ｂ．−Ｈ．ファン（B.-H. Juan）著「多重クラス認識結果の検証のための識別関数の設計（Classification design for Verification of Multi-Class Recognition Design）」、日本音響学会2002年春季研究発表会予稿集 Vol.1, pp.85-86, 2002
【非特許文献４】
Ｊ．Ｇ．フィスカス(J.G. Fiscus)著「エラー率を低減する後処理システム：認識装置出力多数決エラーリダクション（ＲＯＶＥＲ）（A Post-processing system to yield reduced error rates: Recognizer output voting error reduction (ROVER)）」
【非特許文献５】
Ｊ．ツァン（J. Zhang）、Ｋ．マルコフ（K. Markov）、Ｔ．マツイ（T. Matsui）、Ｒ．グルーン（R. Gruhn）、およびＳ．ナカムラ（S. Nakamura）著「SPINE2 プロジェクトのための耐雑音性に優れたベースライン音響モデルの構築（Developing Robust Baseline Acoustic Models for Noisy Speech Recognition in SPINE2 Project）日本音響学会2002年春季研究発表会予稿集
Vol.1, pp.65-66, 2002
【発明が解決しようとする課題】
しかし、ＣＭは通常は実験的に定められるものであり、事後確率のように確率論的に定式化できない場合が多い。その場合には、非特許文献１のように、発声全体にわたり算出される基準でＣＭを適用したとしても、それは必ずしも真の最適化とはいえない。 [0003]
The former is performed by adaptation to noise and compensation. The latter is mainly done by reevaluating the credibility of the recognition results. In this case, a language model (LM) more complex than that used for recognition or a confidence measure (CM) is used as a measure of the reliability of post-processing. In post-processing, a technique for performing re-scoring using a posterior probability as a CM is reported in Non-Patent Document 1 mentioned later. In this report, a CM based on a posteriori probability is used, and once recognized, the recognition result is reevaluated by a criterion that maximizes the product of CM scores over the entire utterance.
[Non-patent document 1]
F. Wesel (F. Wessel), R.S. R. Schluter, H.I. H. Ney, "Using Posterior word probability for improved speech recognition", ICASSP 2000 Proceedings, pp. 536-566
[Non-patent document 2]
G. G. Everman, P .; C. Woodland (PC Woodland) al., "Word posterior probability a large vocabulary decoding and reliability estimation was used (Large vocabulary decoding and confidence estimation using word posterior probabilities) ", ICASSP 2000 Proceedings, pp. 2366-2369
[Non-patent document 3]
T. Matsui, F. K. Sung (FK Soong), B. -H. Fan (B.-H. Juan) "Classification design for Verification of Multi-Class Recognition Design", Proceedings of the Spring Meeting of the Acoustical Society of Japan, Vol. .1, pp.85-86, 2002
[Non-patent document 4]
J. G. JSG Fiscus, “Post-processing system to reduce error rate: Recognizer output majority error reduction (ROVER)” (A Post-processing system to yield reduced error rates: Recognizer output voting error reduction (ROVER))
[Non-patent document 5]
J. J. Zhang, K. Markov (K. Markov), T .; Matsui, R .; Gruen (R. Gruhn), and S. S. Nakamura, "Developing Robust Baseline Acoustic Models for SPINE 2 Project (Developing Robust Baseline Acoustic Models for Noisy Speech Recognition in SPINE 2 Project)" Proceedings of the Spring Meeting of the Acoustical Society of Japan 2002
Vol.1, pp.65-66, 2002
[Problems to be solved by the invention]
However, CM is usually determined experimentally, and often can not be formulated stochastically like posterior probability. In that case, even if CM is applied on a basis calculated over the entire utterance as in Non-Patent Document 1, it is not necessarily a true optimization.

再び図１を参照して、認識結果訂正装置３２は、音声認識回路３０の出力するラティスから後述する単語遷移ネットワークを作成するための単語遷移ネットワーク作成部４２と、入力音声２２から、音声認識回路３０とは独立に、単語の検証のためのサブ音声認識を行ない、単語ごとの検証（バイナリ判定）のためのＣＭを出力するためのサブ音声認識回路４０と、単語遷移ネットワークおよびサブ音声認識回路４０から出力されるＣＭに基づいて、音声認識回路３０の認識結果Ｎ−ベストのうち第１位のものについて、単語ごとに認識が正しく行なわれたか否かのバイナリ判定を行なうための検証部４４と、検証部４４の検証結果と、単語遷移ネットワークとに基づき、音声認識３０の出力する音声認識結果の第１位の単語列のうち、認識が正しく行なわれなかったと判定された部分を抽出し、第２位以下の候補の対応する部分について、訂正のためのＣＭの再スコアリングを行ない最もＣＭの高い部分と置換して訂正後の認識結果２４として出力するための訂正部４６とを含む。本実施の形態では、再スコアリングはトリグラムのＬＭスコアにより行なう。なお、ＬＭスコアとは、通常は、ある言語において、ある数の特定の単語列が表れる統計的な確率により表わされる。特定の２つの単語が連続して現れる場合をバイグラム、３つの単語が連続して現れる場合をトリグラム、一般的にＮ個の単語が連続して現れる場合をＮグラムとよぶ。これらは、たとえばその言語のコーパスを統計的に処理して算出することができる。 Referring again to FIG. 1, the recognition result correction device 32 generates a word transition network creation unit 42 for creating a word transition network to be described later from the lattice output from the speech recognition circuit 30, and a speech recognition circuit from the input speech 22. Sub-speech recognition circuit 40 for performing sub-speech recognition for word verification and outputting CM for word-by-word verification (binary decision) independently of word line 30; word transition network and sub-speech recognition circuit Verification unit 44 for performing binary determination as to whether or not recognition has been correctly performed for each word for the first one among the recognition results N-best of the speech recognition circuit 30 based on the CM output from 40 When the verification result of the verification unit 44, based on the word transition network, among the first of the word string of the speech recognition result output of the speech recognition 30, the recognition Extract the part judged not to be performed correctly, re-scoring CM for correction about the corresponding part of the second and lower candidates, replace it with the part with the highest CM, and correct the recognition result And 24 a correction unit 46 for outputting. In the present embodiment, the re-scoring is performed by the trigram's LM score. The LM score is usually represented by a statistical probability that a certain number of specific word strings appear in a certain language. The case where two specific words appear consecutively is called a bigram, the case where three words appear consecutively is called a trigram, and the case where N words appear consecutively is called an N-gram. These can be calculated, for example, by statistically processing the corpus of the language.

図２に示す候補リスト作成部８０は、単語遷移ネットワークと持続時間情報とに基づいて、誤認識された単語シーケンスに対する置換候補を選択しリストにする。この置換候補としては、単語遷移ネットワーク上のパスのうち、誤認識された単語シーケンスと対応する、第２位以下の候補の単語列に対応するパスが基本的に選ばれる。図６において、単語Ｗ₂ ⁽¹⁾ が誤認識であると判定された場合の置換候補を枠１１０で囲って示してある。本実施の形態では、置換候補には、そのパスの単語シーケンスのグローバル持続時間が、誤認識された単語シーケンスのグローバル持続時間以下であり、その開始時刻が、誤認識された単語シーケンスの開始時刻以後であり、その終了時刻が、誤認識された単語シーケンスの終了時刻以前であるものという条件を課してある。ただし、この条件についてはその一部のみを課すようにしてもよい。さらに、上記したのとは別の時間的条件を課すようにしてもよい。また、場合によっては時間的条件を課さなくてもよい。 The candidate list creation unit 80 shown in FIG. 2 selects replacement candidates for the misrecognized word sequence based on the word transition network and the duration information, and makes a list. As the replacement candidate, among the paths on the word transition network, a path corresponding to the second or lower candidate word string corresponding to the misrecognized word sequence is basically selected. In FIG. 6, a replacement candidate when it is determined that the word W ₂ ⁽¹⁾ is an incorrect recognition is shown surrounded by a frame 110. In this embodiment, for the replacement candidate, the global duration of the word sequence of the path is equal to or less than the global duration of the misrecognized word sequence, and the start time is the start time of the misrecognized word sequence. The following condition is imposed that the end time is before the end time of the misrecognized word sequence. However, only a part of this condition may be imposed. Furthermore, other temporal conditions may be imposed as described above. Also, in some cases, time conditions may not be imposed.

図８を参照して、コンピュータシステム１２０はさらに、コンピュータ１４０に接続されるプリンタ１４４を含むが、これは図７には示していない。またコンピュータ１４０はさらに、ＣＤ―ＲＯＭドライブ１５０およびＦＤドライブ１５２に接続されたバス１６６と、いずれもバス１６６に接続された中央演算装置（Central Processing Unit：ＣＰＵ）１５６、コンピュータ１４０のブートアッププログラムなどを記憶したＲＯＭ（Read-Only Memory）１５８、ＣＰＵ１５６が使用する作業エリアおよびＣＰＵ１５６により実行されるプログラムの格納エリアを提供するＲＡＭ(Random Access Memory）１６０、およびハードディスク１５４を含む。 Referring to FIG. 8, computer system 120 further includes a printer 144 connected to computer 140, which is not shown in FIG. The computer 140 further includes a central processing unit (CPU) 156 connected to the bus 166 connected to the CD-ROM drive 150 and the FD drive 152, and a boot-up program for the computer 140. And a RAM (Random Access Memory) 160 for providing a work area used by the CPU 156 and a storage area for programs executed by the CPU 156 , and a hard disk 154.

以上の処理により、上述した本発明の実施の形態の装置が実現される。
−実験結果−
上記した音声認識システムを実際にコンピュータ上で作成し、以下に述べるような実験を行なった。用いたベースライン音声認識システムは非特許文献５に記載されたものである。データベースとしてはＳＰＩＮＥ（speech in noisy environments）２を用いた。利用したＬＭはＳＰＩＮＥ２のためにＣＭＵ（カーネギーメロン大学）が設計したバイグラムとトリグラムとである。背景雑音が存在する環境下で、１０人の女性および１０人の男性話者による発声データを、５ｄＢから２０ｄＢまでの種々のＳＮＲ（Signal-to-Noise Ratio）で学習データとして収集した。テストデータは、男性話者一人および女性話者一人（学習データの話者とは別の話者)について、４種類の異なる背景雑音のもとで収集した。 The above-described processing realizes the apparatus of the embodiment of the present invention described above.
-Experimental result-
The speech recognition system described above was actually created on a computer, and an experiment as described below was performed. The baseline speech recognition system used is that described in [5]. SPINE (speech in noisy environments) 2 was used as a database. The LM used is the bigram and trigram designed by CMU (Carnegie Mellon University) for SPINE2. In the presence of background noise, utterance data by 10 women and 10 male speakers were collected as training data with various SNRs ( Signal-to-Noise Ratio ) from 5 dB to 20 dB. Test data were collected under four different background noises for one male and one female speaker (speakers different from the speakers of the training data).

【図面の簡単な説明】
【図１】本発明の一実施の形態にかかる音声認識装置のブロック図である。
【図２】図１に示す検証部４４および訂正部４６をより詳細に示すブロック図である。
【図３】音声認識回路３０から出力される認識結果のラティスおよびＮ−ベスト情報を模式的に示す図である。
【図４】ラティスの一例を模式的に示す図である。
【図５】単語遷移ネットワークの一例を模式的に示す図である。
【図６】本実施の形態にかかる認識結果の単語の検証処理および訂正処理の原理を説明するための図である。
【図７】本発明の一実施の形態を実現するコンピュータシステムの外観図である。
【図８】図７に示すコンピュータシステムのブロック図である。
【図９】本発明の一実施の形態の音声認識装置および認識結果訂正回路を実現するようにコンピュータシステムを動作させるためのプログラムのフローチャートである。
【図１０】本発明の一実施の形態が効果を発揮する条件を検証するためのシミュレーション結果を示すグラフである。 Brief Description of the Drawings
FIG. 1 is a block diagram of a speech recognition apparatus according to an embodiment of the present invention.
FIG. 2 is a block diagram showing the verification unit 44 and the correction unit 46 shown in FIG. 1 in more detail.
3 schematically shows lattices and N-best information of recognition results output from a speech recognition circuit 30. FIG.
FIG. 4 is a view schematically showing an example of a lattice.
FIG. 5 is a view schematically showing an example of a word transition network.
FIG. 6 is a view for explaining the principle of verification processing and correction processing of a word of a recognition result according to the present embodiment;
7 is an outer appearance view of a computer system implementing an embodiment of the present invention.
8 is a block diagram of the computer system shown in FIG. 7;
FIG. 9 is a flowchart of a program for operating a computer system to implement the speech recognition device and the recognition result correction circuit according to the embodiment of the present invention.
FIG. 10 is a graph showing simulation results for verifying conditions under which an embodiment of the present invention is effective.

Claims

Operating the computer to obtain a predetermined first confidence measure for each of the words contained in the first recognition result candidate obtained by the speech recognition process for outputting a plurality of recognition result candidate word strings;
Operating the computer to determine, word by word, whether the first confidence measure satisfies a predetermined relationship with a predetermined threshold value;
For each successive word string determined that the first confidence measure does not satisfy the predetermined relationship with the threshold value, the second or lower recognition result candidate obtained by the speech recognition process is obtained. Operating the computer to calculate a predetermined second confidence measure for the corresponding word sequence included and replace it with the word sequence from which the highest confidence measure was obtained. .

The step of operating the computer to obtain the predetermined first confidence measure includes the step of operating the computer to obtain the likelihood of the word output for each word by the speech recognition process. The correction method of the speech recognition result as described in.

The step of operating the computer to obtain the predetermined first confidence measure includes the step of operating the computer to calculate the likelihood for each word by processing independent of the speech recognition processing. The correction method of the speech recognition result as described in 1.

The step of operating the computer to obtain the predetermined first confidence measure is calculated by processing independent of the speech recognition processing and the likelihood of the word output for each word by the speech recognition processing. The method for correcting speech recognition results according to claim 1, comprising operating the computer to calculate the ratio of likelihood for each word.

The method for correcting a speech recognition result according to claim 3 or 4, wherein the process independent of the speech recognition process includes a calculation process of likelihood for each word by a phoneme loop model.

Operating the computer to determine, word by word, whether the first confidence measure satisfies a predetermined relationship with a predetermined threshold value, the first confidence measure being predetermined The method according to any of the preceding claims, comprising operating the computer to determine for each word whether it is above a defined threshold or not.

The method according to any one of claims 1 to 6, wherein the predetermined second confidence measure is a statistical occurrence probability of a word string by a language model.

The method according to claim 7, wherein the language model is a trigram language model.

The speech recognition result is
A lattice composed of word strings of recognition result candidates;
Including time information of each word included in each recognition result candidate,
Operating the computer to replace the word sequence for which the highest confidence measure was obtained:
Operating a computer to create a word transition network for the speech recognition result based on the lattice and the time information;
For each successive word string determined that the first confidence measure does not satisfy the predetermined relationship with the threshold value, the corresponding second or lower recognition result candidate on the word transition network Operating the computer to select a corresponding word string;
Operating the computer to calculate the second confidence measure for each of the word strings selected by operating the computer to select.
A word string having the largest calculated second confidence measure replaces a continuous word string determined that the first confidence measure does not satisfy the predetermined relationship with the threshold value. The method according to any one of claims 1 to 8, further comprising the step of: operating the computer.

The step of operating the computer to select is
For each successive word string determined that the first confidence measure does not satisfy the predetermined relationship with the threshold value, the corresponding second or lower recognition result candidate on the word transition network a start time after the word start time is the judgment, and causing a computer to operate as the end time to select the word sequence is the end time previous word is determined not to satisfy the previous SL predetermined relationship The method for correcting speech recognition results according to claim 9, comprising the steps.

A computer program for correcting a speech recognition result, for operating a computer to implement a speech recognition result correction method for correcting speech recognition results, said speech recognition result correction method comprising:
The computer is configured to obtain a predetermined first confidence measure for each of the words included in the first recognition result candidate obtained by the speech recognition process of outputting word strings of a plurality of recognition result candidates of the speech recognition result. Operation steps,
Operating the computer to determine, word by word, whether the first confidence measure satisfies a predetermined relationship with a predetermined threshold value;
For each successive word string determined that the first confidence measure does not satisfy the predetermined relationship with the threshold value, the second or lower recognition result candidate obtained by the speech recognition process is obtained. Operating the computer to calculate a predetermined second confidence measure for the corresponding word sequence contained and replace it with the word sequence from which the highest confidence measure was obtained. Computer program for.

The step of operating the computer to obtain the predetermined first confidence measure includes the step of operating the computer to obtain the likelihood of the word output for each word by the speech recognition process. A computer program for the correction of speech recognition results as described in.

The step of operating the computer to obtain the predetermined first confidence measure includes the step of operating the computer to calculate the likelihood for each word by processing independent of the speech recognition processing. 11. A computer program for the correction of speech recognition results according to 11.

The step of operating the computer to obtain the predetermined first confidence measure is calculated by processing independent of the speech recognition processing and the likelihood of the word output for each word by the speech recognition processing. The computer program for correcting speech recognition results according to claim 11, comprising operating the computer to calculate a ratio of likelihood for each word.

The computer program for correcting a speech recognition result according to claim 13 or 14, wherein the process independent of the speech recognition process includes a calculation process of likelihood for each word by a phoneme loop model.

Operating the computer to determine, word by word, whether the first confidence measure satisfies a predetermined relationship with a predetermined threshold value, the first confidence measure being predetermined 16. A computer program for correcting speech recognition results according to any of claims 11 to 15, comprising operating the computer to determine word by word whether it is above a defined threshold.

The computer program for correcting speech recognition results according to any of claims 11 to 16, wherein the predetermined second confidence measure is a statistical occurrence probability of a word string by a language model.

The computer program for correcting speech recognition results according to claim 17, wherein the language model is a trigram language model.

The speech recognition result is
A lattice composed of word strings of recognition result candidates;
Including time information of each word included in each recognition result candidate,
Operating the computer to replace the word sequence for which the highest confidence measure was obtained:
Operating a computer to create a word transition network for the speech recognition result based on the lattice and the time information;
For each successive word string determined that the first confidence measure does not satisfy the predetermined relationship with the threshold value, the corresponding second or lower recognition result candidate on the word transition network Operating the computer to select a corresponding word string;
Operating the computer to calculate the second confidence measure for each of the word strings selected by operating the computer to select.
A word string having the largest calculated second confidence measure replaces a continuous word string determined that the first confidence measure does not satisfy the predetermined relationship with the threshold value. 20. A computer program for correcting speech recognition results according to any of claims 11 to 18, comprising the steps of operating the computer.

The step of operating the computer to select is
For each successive word string determined that the first confidence measure does not satisfy the predetermined relationship with the threshold value, the corresponding second or lower recognition result candidate on the word transition network a start time after the word start time is the judgment, and causing a computer to operate as the end time to select the word sequence is the end time previous word is determined not to satisfy the previous SL predetermined relationship 20. A computer program for the correction of speech recognition results according to claim 19, comprising the steps.