JP2999726B2

JP2999726B2 - Continuous speech recognition device

Info

Publication number: JP2999726B2
Application number: JP8246012A
Authority: JP
Inventors: 徹清水; 博史山本; 芳典匂坂
Original assignee: 株式会社エイ・ティ・アール音声翻訳通信研究所
Priority date: 1996-09-18
Filing date: 1996-09-18
Publication date: 2000-01-17
Anticipated expiration: 2016-09-18
Also published as: JPH1091185A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、入力される発声音
声文の音声信号に基づいて連続的に音声認識する連続音
声認識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a continuous speech recognition apparatus for continuously recognizing speech based on a speech signal of an input speech sentence.

【０００２】[0002]

【従来の技術】従来から、本特許出願人は、自然発話の
音声認識を目的として、連続音声認識系（以下、第１の
従来例という。）の開発を進めている（例えば、従来技
術文献１「Nagai,Takami,Sagayama,“The SSS-LR Conti
nuous Speech Recognition System: Integrating SSS-D
erivrd Allopohne Models and a Phoneme-Context-Depe
ndent LR Parser",Proc.of ICSLP92,pp.1511-1514,1992
年」及び従来技術文献２「Shimizu,Monzen,Singer,Mats
unaga,“Time-Synchronous Continuous Speech Recogni
zer Driven by a Context-Free Grammar",Proc.of ICAS
SP95,pp.584-587,1995年」参照。）。この第１の従来例
では、入力される発声音声文の音声信号に基づいて、音
素隠れマルコフモデル（以下、隠れマルコフモデルをＨ
ＭＭという。）と単語辞書を用いて、発声開始からの単
語の履歴及び文法状態を管理しながら、音声認識を行っ
ている。2. Description of the Related Art Conventionally, the present applicant has been developing a continuous speech recognition system (hereinafter referred to as a first conventional example) for the purpose of speech recognition of spontaneous utterances (for example, see the prior art document). 1 “Nagai, Takami, Sagayama,“ The SSS-LR Conti
nuous Speech Recognition System: Integrating SSS-D
erivrd Allopohne Models and a Phoneme-Context-Depe
ndent LR Parser ", Proc. of ICSLP92, pp. 1511-1514, 1992
Year "and prior art document 2" Shimizu, Monzen, Singer, Mats
unaga, “Time-Synchronous Continuous Speech Recogni
zer Driven by a Context-Free Grammar ", Proc.of ICAS
SP95, pp. 584-587, 1995 ". ). In this first conventional example, a phoneme hidden Markov model (hereinafter, a hidden Markov model is referred to as H
It is called MM. ) And a word dictionary, while performing speech recognition while managing the history and grammatical state of words from the start of utterance.

【０００３】一方、単語グラフを用いた音声認識方法
（以下、第２の従来例という。）が、従来技術文献３
「Ney,Aubert,“A Word Graph Algorithm for Large Vo
cabulary, Continuous Speech Recognition",Proc.of I
CSLP94,pp.1355-1358,1994年」及び従来技術文献４「Wo
odland,Leggetter,Odell,Valtchev,Young,“The 1994 H
TKLarge Vocabulary Speech Recognition System",Pro
c. of ICASSP95,pp.73-76,1995年」において提案されて
いる。On the other hand, a speech recognition method using a word graph (hereinafter referred to as a second conventional example) is disclosed in Prior Art Document 3.
“Ney, Aubert,“ A Word Graph Algorithm for Large Vo
cabulary, Continuous Speech Recognition ", Proc.of I
CSLP94, pp.1355-1358, 1994 "and prior art document 4" Wo
odland, Leggetter, Odell, Valtchev, Young, “The 1994 H
TKLarge Vocabulary Speech Recognition System ", Pro
c. of ICASSP95, pp.73-76, 1995 ”.

【０００４】この第２の従来例の単語グラフの主たるア
イデアは、音声認識におけるあいまいさが比較的高い音
声信号の領域において単語仮説の候補を処理するという
ことである。この利点は、純粋の音声認識は言語モデル
のアプリケーションとは切り離されていることと、複雑
な言語モデルは、現在認識中の単語に続く公知のステッ
プに適用することができることである。単語仮説の候補
の数は音声認識におけるあいまいさのレベルに対応して
変化する必要がある。良い単語グラフを効率的に構築す
るときの困難さは次の通りである。単語の開始時刻は、
一般的に、先行する単語に依存している。第１の近似に
おいては、この依存性を直前の先行単語に対して制限を
加えることにより、以下に示すようないわゆる単語ペア
近似法を得ている。すなわち、単語のペアとその終了時
刻が与えられたときに、２つの単語の間の単語境界は別
の先行する単語に独立であるということである。この単
語ペア近似法は、本来、複数の文又はｎ個のベスト（最
良）である文を効率的に計算するために導入されてき
た。この単語グラフは、ｎ個のベストを得るアプローチ
の方法（以下、ｎベスト法という。）よりも効率的であ
ると期待されている。この単語グラフを用いた方法で
は、複数の単語仮説を局所的にのみ発生する必要がある
一方、ｎベスト法においては、各局所的な単語仮説の候
補は、ｎ個のベストである文のリストに対して加えるべ
き全体の文を必要としている。The main idea of the word graph of the second conventional example is to process word hypothesis candidates in a region of a speech signal where ambiguity in speech recognition is relatively high. The advantage is that pure speech recognition is separate from the language model application, and that complex language models can be applied to known steps following the word currently being recognized. The number of word hypothesis candidates needs to change according to the level of ambiguity in speech recognition. The difficulties in constructing a good word graph efficiently are: The start time of a word is
Generally, it depends on the preceding word. In the first approximation, a so-called word pair approximation method as described below is obtained by restricting this dependency to the immediately preceding preceding word. That is, given a word pair and its end time, the word boundary between two words is independent of another preceding word. This word-pair approximation method was originally introduced to efficiently calculate a plurality of sentences or n best sentences. This word graph is expected to be more efficient than the approach of obtaining the n bests (hereinafter referred to as the n best method). In the method using this word graph, it is necessary to generate a plurality of word hypotheses only locally, while in the n-best method, each local word hypothesis candidate is a list of n best sentences. Need a whole sentence to add to.

【０００５】しかしながら、第１の従来例においては、
発声開始からの単語の履歴及び文法状態を管理する必要
があるため、間投詞の挿入や、言い淀み、言い直しが頻
繁に生じる自然発話の認識に用いた場合、単語仮説の併
合又は分割に要する計算コストが極めて大きいという問
題点があった。すなわち、音声認識のために必要な処理
量が大きくなって比較的大きな記憶容量を有する記憶装
置が必要となる一方、処理量が大きくなるので処理時間
が長くなるという問題点があった。However, in the first conventional example,
Since it is necessary to manage the history and grammatical state of words from the start of utterance, if it is used for recognition of natural utterances in which interjections are inserted, stagnation, and rephrasing frequently, calculations required for merging or dividing word hypotheses There was a problem that the cost was extremely large. In other words, the amount of processing required for speech recognition is large, and a storage device having a relatively large storage capacity is required. On the other hand, there is a problem that the processing time is long because the amount of processing is large.

【０００６】また、上記第２の従来例の単語ペア近似法
においては、先行単語毎に１つの仮説で代表させるが、
いまだ近似効果は比較的小さい。このため、上記第１の
従来例と同様の問題点が生じる。In the second conventional word pair approximation method, one hypothesis is represented for each preceding word.
The approximation effect is still relatively small. For this reason, the same problem as the first conventional example occurs.

【０００７】以上の問題点を解決するために、本出願人
は、特願平７−２３４０４３号の特許出願において、
「入力される発声音声文の音声信号に基づいて上記発声
音声文の単語仮説を検出し尤度を計算することにより、
連続的に音声認識する音声認識手段を備えた連続音声認
識装置において、上記音声認識手段は、終了時刻が等し
く開始時刻が異なる同一の単語の単語仮説に対して、当
該単語の先頭音素環境毎に、発声開始時刻から当該単語
の終了時刻に至る計算された総尤度のうちの最も高い尤
度を有する１つの単語仮説で代表させるように単語仮説
の絞り込みを行うことを特徴とする連続音声認識装
置。」（以下、第３の従来例という。）を提案してい
る。[0007] In order to solve the above problems, the present applicant has filed a patent application of Japanese Patent Application No. 7-234043.
"By detecting the word hypothesis of the uttered speech sentence based on the speech signal of the input uttered speech sentence and calculating the likelihood,
In a continuous speech recognition apparatus provided with a speech recognition means for continuously recognizing a speech, the speech recognition means performs, for a word hypothesis of the same word having a same end time and a different start time, for each head phoneme environment of the word. Continuous speech recognition characterized by narrowing down word hypotheses so as to be represented by one word hypothesis having the highest likelihood among the calculated total likelihoods from the utterance start time to the end time of the word. apparatus. (Hereinafter referred to as a third conventional example).

【０００８】しかしながら、第３の従来例のような連続
音声認識装置における時間同期ビーム探索において、最
尤候補から一定の尤度幅をしきい値として採用する場
合、音響尤度や言語尤度の時間に対する局所的変動に弱
い問題点があった。時間に対する局所的変動を吸収する
ためには、ビーム幅を広くするもしくは尤度の先読みを
する必要がある。しかしながら、広いビーム幅は探索に
要する計算量の増加に直結し、尤度の先読みはアルゴリ
ズムが複雑になったり場合によっては先読みにおける計
算量が多くなったりする可能性があるという問題点があ
った。以下、この問題点について詳述する。However, in the time-synchronous beam search in the continuous speech recognition apparatus as in the third conventional example, when a certain likelihood width is used as the threshold from the maximum likelihood candidates, the acoustic likelihood and the language likelihood are not considered. There was a problem with local fluctuations over time. In order to absorb local fluctuations with respect to time, it is necessary to widen the beam width or look ahead for likelihood. However, a wide beam width directly leads to an increase in the amount of calculation required for search, and there is a problem that the look-ahead of likelihood may complicate the algorithm or, in some cases, increase the amount of calculation in the look-ahead. . Hereinafter, this problem will be described in detail.

【０００９】例えば第３の従来例における時間同期ビー
ム探索では、以下に示す問題点がある。（ａ）未探索部分の尤度がわからない：ある時刻ｔまで
の尤度が高いことは、文全体の尤度が高いことを保証し
ていない。（ｂ）音素の中間状態でも音素の終端と同様な枝刈りを
行っている：音響モデルのトレーニングは、一音素もし
くは音素列の尤度が最大になるように学習される。従っ
て、音素の中間状態で尤度が最大になるかどうかが保証
されていない。しかし、一定の幅のビームを使用した場
合、音素の中間状態と終端は同じ条件で枝刈りされる。（ｃ）継続時間長の短い音素（単語）の挿入：滞在時間
が短く尤度の低い音素は累積尤度への寄与が少ない。累
積尤度は、滞在時間が長く尤度の高い音素により支配さ
れる。この結果、累積尤度への寄与が少ない「滞在時間
の短く尤度の低い音素」の挿入が頻繁に発生する。For example, the time synchronization beam search in the third conventional example has the following problems. (A) The likelihood of an unsearched portion is unknown: High likelihood up to a certain time t does not guarantee that the likelihood of the entire sentence is high. (B) In the intermediate state of a phoneme, the same pruning is performed as at the end of the phoneme: The training of the acoustic model is learned so that the likelihood of one phoneme or a phoneme sequence is maximized. Therefore, it is not guaranteed whether the likelihood is maximized in the intermediate state of phonemes. However, when a beam having a constant width is used, the intermediate state and the end of the phoneme are pruned under the same conditions. (C) Insertion of phonemes (words) with a short duration: phonemes with a short stay time and low likelihood contribute little to the accumulated likelihood. The cumulative likelihood is dominated by phonemes with long stay times and high likelihood. As a result, insertion of “phonemes with a short stay time and low likelihood”, which have little contribution to the accumulated likelihood, frequently occurs.

【００１０】これらの問題点を解決する手段として、従
来技術文献５「Ney et al.,“An overview of the phil
ips research system for large vocabulary continuou
s speech recognition",International Jurnal of Patt
ern Recognition and Artificial Intelligence,Vol.8,
No.1,pp.58-59,1994年」において、尤度を先読みする方
法（phoneme look-ahead）（以下、第４の従来例とい
う。）が提案されている。この第４の従来例において
は、１音素分の未探索部分の尤度を考慮し、音素境界で
音素の中間状態より厳しい枝刈りを行う。すなわち、あ
らかじめ１音素分の音響尤度を別に計算しておき、音素
の終端に達した時点で後続する１音素分の尤度を考慮し
第２のビーム探索を行う。[0010] As means for solving these problems, a technique disclosed in prior art document 5 “Ney et al.,“ An overview of the phil
ips research system for large vocabulary continuou
s speech recognition ", International Journal of Patt
ern Recognition and Artificial Intelligence, Vol. 8,
No. 1, pp. 58-59, 1994 ", a method of pre-reading the likelihood (phoneme look-ahead) (hereinafter referred to as a fourth conventional example) is proposed. In the fourth conventional example, stricter pruning is performed at a phoneme boundary than at an intermediate state of phonemes in consideration of the likelihood of an unsearched portion for one phoneme. That is, the acoustic likelihood of one phoneme is separately calculated in advance, and when the end of the phoneme is reached, the second beam search is performed in consideration of the likelihood of the subsequent one phoneme.

【００１１】しかしながら、第４の従来例においては、
先読みの長さと言語制約が制限される。すなわち、先読
みにおける計算量をあまり大きくすると、先読みに基づ
く枝刈り効果と相殺してしまう。このため、先読みの時
間幅は数フレームに、言語制約は非常に簡単な制約に限
定されるという問題点があった。However, in the fourth conventional example,
Limited look-ahead length and language constraints. That is, if the amount of calculation in the pre-reading is too large, the pruning effect based on the pre-reading is offset. For this reason, there is a problem that the time width of the pre-reading is limited to several frames, and the language constraint is limited to a very simple constraint.

【００１２】本発明の目的は以上の問題点を解決し、従
来例に比較して狭いビーム幅で単語仮説の絞り込みを行
うことができ、より小さい計算コストで自然発話の連続
音声認識を行うことができる連続音声認識装置を提供す
ることにある。An object of the present invention is to solve the above problems, to narrow down word hypotheses with a narrower beam width than in the conventional example, and to perform continuous speech recognition of spontaneous utterance at a smaller calculation cost. It is to provide a continuous speech recognition device which can perform the following.

【００１３】[0013]

【課題を解決するための手段】本発明に係る請求項１記
載の連続音声認識装置は、入力される発声音声文の音声
信号に基づいて上記発声音声文の単語仮説を検出し音響
尤度を計算することにより、連続的に音声認識する音声
認識手段を備えた連続音声認識装置において、上記音声
認識手段は、単語の各音素の時間方向の中央部の音響尤
度を、当該中央部よりも遅延された時刻に移動するよう
に遅延させて、単語仮説の音響尤度を補正することを特
徴とする。According to a first aspect of the present invention, there is provided a continuous speech recognition apparatus for detecting a word hypothesis of an uttered speech sentence based on an input speech signal of the uttered speech sentence and calculating an acoustic likelihood. By calculating, in the continuous speech recognition device provided with a speech recognition means for continuously recognizing speech, the speech recognition means sets the acoustic likelihood of each phoneme of the word in the time direction at a higher value than that of the center. The sound likelihood of the word hypothesis is corrected by delaying to move to the delayed time.

【００１４】また、請求項２記載の連続音声認識装置
は、請求項１記載の連続音声認識装置において、上記音
声認識手段は、終了時刻が等しく開始時刻が異なる同一
の単語の単語仮説に対して、当該単語の先頭音素環境毎
に、発声開始時刻から当該単語の終了時刻に至る計算さ
れた、音響尤度を含む総合尤度のうちの最も高い総合尤
度を有する１つの単語仮説で代表させるように単語仮説
の絞り込みを行うことを特徴とする。According to a second aspect of the present invention, there is provided the continuous speech recognition apparatus according to the first aspect, wherein the speech recognition means is adapted to detect a word hypothesis of the same word having the same end time and different start time. For each head phoneme environment of the word, one word hypothesis having the highest total likelihood among the total likelihoods including the acoustic likelihood calculated from the utterance start time to the end time of the word is represented. Thus, the word hypothesis is narrowed down.

【００１５】[0015]

【発明の実施の形態】以下、図面を参照して本発明に係
る実施形態について説明する。図１に本発明に係る一実
施形態の連続音声認識装置のブロック図を示す。本実施
形態の連続音声認識装置は、公知のワン−パス・ビタビ
復号化法を用いて、入力される発声音声文の音声信号の
特徴パラメータに基づいて上記発声音声文の単語仮説を
検出し音響尤度を計算して出力する単語照合部４を備え
た連続音声認識装置において、単語照合部４からバッフ
ァメモリ５を介して出力される、単語仮説に対して、当
該単語の各音素の時間方向の中央部の音響尤度のピーク
を、当該中央部よりも遅延された時刻に移動するように
遅延させて、当該単語仮説の音響尤度を補正する尤度補
正部７と、尤度補正部７から出力される音響尤度を含む
総合尤度を有する単語仮説に基づいて、当該単語の先頭
音素環境毎に、発声開始時刻から当該単語の終了時刻に
至る計算された総合尤度のうちの最も高い尤度を有する
１つの単語仮説で代表させるように単語仮説の絞り込み
を行う単語仮説絞込部６を備えたことを特徴とする。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 shows a block diagram of a continuous speech recognition apparatus according to an embodiment of the present invention. The continuous speech recognition apparatus according to the present embodiment detects the word hypothesis of the uttered speech sentence based on the feature parameter of the speech signal of the input uttered speech sentence using a known one-pass Viterbi decoding method, and In the continuous speech recognition device including the word matching unit 4 that calculates and outputs the likelihood, the time direction of each phoneme of the word with respect to the word hypothesis output from the word matching unit 4 via the buffer memory 5 And a likelihood correction unit 7 that delays the peak of the acoustic likelihood in the central part of the word hypothesis so as to move to a time delayed from the central part, and corrects the acoustic likelihood of the word hypothesis. 7 based on the word hypothesis having the total likelihood including the acoustic likelihood output from the utterance start time from the utterance start time to the end time of the word for each head phoneme environment of the word. One word with the highest likelihood Characterized by comprising a word hypotheses narrowing-down unit 6 that performs narrowing of the word hypotheses to be represented by the theory.

【００１６】本発明に係る実施形態において用いる尤度
補正部７の尤度補正は、遅延決定（Delayed decision）
のビーム探索と呼ぶことができる。この遅延決定のビー
ム探索は、第４の従来例のような尤度の先読みや、非線
形関数による尤度のマッピングによらずに、すでに探索
を終えた経路の尤度の評価を遅らせることによって、尤
度の局所的変動に対処する。なお、以下の計算におい
て、尤度とは対数尤度を指すものとする。本実施形態に
おいて、各符号を以下のように定義する。（ａ）ｔ：時刻；（ｂ）Ｓ：ビーム探索の経路；（ｃ）ｑ_A（Ｓ，ｔ）：経路Ｓ，時刻ｔにおける音響尤
度；（ｄ）Ｑ_A（Ｓ，ｔ）：経路Ｓ，時刻ｔにおける文頭か
ら累積音響尤度；（ｅ）Ｑ_L（Ｓ，ｔ）：経路Ｓ，時刻ｔにおける文頭か
らの累積言語尤度。The likelihood correction of the likelihood correction unit 7 used in the embodiment according to the present invention is performed by a delayed decision.
Beam search. The beam search for this delay determination is performed by delaying the evaluation of the likelihood of the path that has already been searched, without relying on likelihood look-ahead and the likelihood mapping by a nonlinear function as in the fourth conventional example. Address local variations in likelihood. In the following calculation, the likelihood indicates the log likelihood. In the present embodiment, each code is defined as follows. (A) t: time; (b) S: path of beam search; (c) q _A (S, t): path S, acoustic likelihood at time t; (d) Q _A (S, t): path S, cumulative acoustic likelihood from beginning of a sentence at time _{t; (e) Q L (} S, t): path S, the cumulative language likelihood from beginning of a sentence at time t.

【００１７】ここで、音響尤度は、単語照合部４におい
て音素ＨＭＭメモリ１１内の音素ＨＭＭを参照して計算
される尤度であり、言語尤度は、単語照合部４において
統計的言語モデルメモリ１３内の言語モデルを参照して
計算される尤度である。以上のように定義したとき、一
般に、累積音響尤度は１フレーム毎の音響尤度を足し合
わせることによって次式で求められる。Here, the acoustic likelihood is the likelihood calculated by the word matching unit 4 with reference to the phoneme HMM in the phoneme HMM memory 11, and the language likelihood is calculated by the statistical language model in the word matching unit 4. The likelihood is calculated with reference to the language model in the memory 13. When defined as described above, generally, the cumulative acoustic likelihood is obtained by adding the acoustic likelihood for each frame by the following equation.

【００１８】[0018]

【数１】Ｑ_A（Ｓ，ｔ）＝Ｑ_A（Ｓ，ｔ−１）＋ｑ
_A（Ｓ，ｔ）## EQU1 ## Q _A (S, t) = Q _A (S, t-1) + q
_A (S, t)

【００１９】そして、ビーム探索に使用する文頭からの
累積総合尤度Ｑ_all（Ｓ，ｔ）は、音響尤度Ｑ_A（Ｓ，
ｔ）と言語尤度Ｑ_L（Ｓ，ｔ）を用いて次式で計算され
る。The cumulative total likelihood Q _all (S, t) from the head of the sentence used for beam search is the acoustic likelihood Q _A (S,
t) and the language likelihood Q _L (S, t) are calculated by the following equation.

【００２０】[0020]

【数２】Ｑ_all（Ｓ，ｔ）＝Ｑ_A（Ｓ，ｔ）＋α・Ｑ
_L（Ｓ，ｔ）## EQU2 ## Q _all (S, t) = Q _A (S, t) + α · Q
_L (S, t)

【００２１】ここで、定数αは言語尤度の音響尤度に対
する重み係数であり、好ましい実施形態においては、α
＝４．５である。本実施形態における、遅延決定のビー
ム探索では、次式に示すように、上記数２において、Ｑ
_A（Ｓ，ｔ）の代わりにＱ_A（Ｓ，ｔ）から遅延音響尤度
Ｑ_Ad（Ｓ，ｔ）を差し引いた尤度Ｑ_A’（Ｓ，ｔ）を使
用する。すなわち、時刻ｔ−１では、図３及び次の数３
に示すように、Ｑ_A（Ｓ，ｔ−１）の代わりにＱ_A（Ｓ，
ｔ−１）から遅延音響尤度Ｑ_Ad（Ｓ，ｔ−１）を差し引
いた尤度Ｑ_A’（Ｓ，ｔ−１）を使用する。Here, the constant α is a weight coefficient for the sound likelihood of the language likelihood, and in a preferred embodiment, α
= 4.5. In the beam search for delay determination in the present embodiment, as shown in the following equation,
Using the _A (S, t) likelihood Q _A '(S, t) obtained by subtracting the Q _A (S, t) delayed from the acoustic likelihood Q _Ad (S, t) instead of. That is, at time t-1, FIG.
As shown _{in, Q A (S, t-} 1) Q A (S instead of,
The likelihood Q _A ′ (S, t−1) obtained by subtracting the delayed acoustic likelihood Q _Ad (S, t−1) from t−1) is used.

【００２２】[0022]

【数３】Ｑ_A’（Ｓ，ｔ）＝Ｑ_A（Ｓ，ｔ）−Ｑ_Ad（Ｓ，
ｔ）## EQU3 ## Q _A '(S, t) = Q _A (S, t) -Q _Ad (S,
t)

【００２３】ここで、上記数３の右辺の第２項の尤度Ｑ
_Ad（Ｓ，ｔ）は次式で計算される。Here, the likelihood Q of the second term on the right side of the above equation (3)
_Ad (S, t) is calculated by the following equation.

【００２４】[0024]

【数４】Ｄ＝Ｑ_Ad（Ｓ，ｔ−１）＋ｑ_A（Ｓ，ｔ）D = Q _Ad (S, t-1) + q _A (S, t)

【数５】Ｑ_Ad（Ｓ，ｔ）＝Ｆ（Ｄ）・Ｄ## _EQU5 ## Q _Ad (S, t) = F (D) .D

【００２５】上記数３を書き換えると、上記数１を参照
して書き換えると、次式を得る。When the above equation (3) is rewritten, when the above equation (1) is rewritten, the following equation is obtained.

【００２６】[0026]

【数６】Ｑ_A’（Ｓ，ｔ）＝Ｑ_A’（Ｓ，ｔ−１）＋
ｑ_A’（Ｓ，ｔ）## EQU6 ## Q _A '(S, t) = Q _A ' (S, t-1) +
q _A '(S, t)

【００２７】ここで、尤度ｑ_A’（Ｓ，ｔ）を次式によ
り決定する。Here, the likelihood q _A ′ (S, t) is determined by the following equation.

【００２８】[0028]

【数７】ｑ_A’（Ｓ，ｔ）＝ｆ（ｘ）＝ｆ（ｑ_A（Ｓ，ｔ）＋Ｑ_A（Ｓ，ｔ−１）−Ｑ_A’
（Ｓ，ｔ−１）Equation 7] _{q A '(S, t)} = f (x) = f (q A (S, t) + Q A (S, t-1) -Q A'
(S, t-1)

【００２９】ここで、上記数７における｛Ｑ_A(Ｓ，ｔ−
１)−Ｑ_A’(Ｓ，ｔ−１)｝は、Ｑ_Ad（Ｓ，ｔ−１）であ
り、第３の従来例と比較して１時刻前の過小評価分であ
り、このデータは、尤度補正部７に接続される過小評価
尤度メモリ１４に順次記憶されて、次の時刻ｔにおける
音響尤度を補正して総合尤度を計算するために用いられ
る。従って、本実施形態においては、尤度補正部７は、
時刻（ｔ−１）において、各単語仮説に対して、１時刻
前の過小評価分データである上記数７における｛Ｑ
_A(Ｓ，ｔ−１)−Ｑ_A’(Ｓ，ｔ−１)｝を計算して、過小
評価尤度メモリ１４に記憶し、次いで、時刻ｔにおい
て、上記数６と上記数７とを用いて、過小評価するよう
に補正された音響尤度Ｑ_A’（Ｓ，ｔ）を計算し、次い
で、上記数２を書き換えた次の数８とを用いて、累積尤
度である総合尤度Ｑ’_all（Ｓ，ｔ）を計算し、当該計
算された総合尤度Ｑ’_all（Ｓ，ｔ）を有する単語仮説
をバッファメモリ５を介して単語仮説絞込部６に出力す
る。Here, ΔQ _A (S, t−
1) -Q _A '(S, t-1)} is Q _Ad (S, t-1), which is an underestimate one time before compared to the third conventional example, and this data is Are sequentially stored in the underestimated likelihood memory 14 connected to the likelihood correction unit 7 and are used to correct the acoustic likelihood at the next time t to calculate the total likelihood. Therefore, in the present embodiment, the likelihood correction unit 7
At time (t-1), for each word hypothesis, ΔQ
Calculate the _{A (S, t-1)} -Q A '(S, t-1)}, and stored in underestimated likelihood memory 14, then, at time t, and the number 6 and the number 7 To calculate the acoustic likelihood Q _A ′ (S, t) corrected to be underestimated, and then using the following equation 8 obtained by rewriting equation 2 to obtain the total likelihood that is the cumulative likelihood. The degree Q ′ _all (S, t) is calculated, and the word hypothesis having the calculated total likelihood Q ′ _all (S, t) is output to the word hypothesis narrowing unit 6 via the buffer memory 5.

【００３０】[0030]

【数８】Ｑ’_all（Ｓ，ｔ）＝Ｑ_A’（Ｓ，ｔ）＋α・Ｑ
_L（Ｓ，ｔ）Q ′ _all (S, t) = Q _A ′ (S, t) + α · Q
_L (S, t)

【００３１】なお、上記数７において、関数ｆ（ｘ）
は、上記尤度ｘに対する遅延割合を求める第１の関数で
あり、その一例を図５で図示した。図５から明らかなよ
うに、関数ｘは、ｘが増加するにつれて、概ね、関数ｆ
（ｘ）の傾斜を小さくするように変化する関数となって
いる。また、上記数５における関数Ｆ（Ｄ）は上記第１
の関数に関連し、尤度Ｄに対する遅延割合を求める第２
の関数であって、その一例を図６に図示した。In the above equation (7), the function f (x)
Is a first function for calculating the delay ratio with respect to the likelihood x, an example of which is shown in FIG. As is evident from FIG. 5, the function x is approximately the function f as x increases.
This is a function that changes so as to reduce the slope of (x). Further, the function F (D) in the above equation 5 is the first
The second function for calculating the delay ratio with respect to the likelihood D in relation to the function
FIG. 6 shows an example of the function.

【００３２】音響モデルとして音素ＨＭＭを使用した場
合、図４に示すように、一般に音素境界では尤度が低く
なり音素中心では尤度が高くなる傾向がある。従って、
図５及び図６の関数を使用することにより、図４に示す
ように、音素中心では遅延が大きく、音素境界では遅延
がほぼなくなるように音響尤度を補正する。言い換えれ
ば、単語の各音素の時間方向の中央部の音響尤度のピー
クを、当該中央部よりも遅延された時刻に移動するよう
に遅延（群遅延）させて、単語仮説の音響尤度を補正す
る。この結果、音素中心における音響尤度の全部又は一
部分は音素境界に近い時刻で評価されることになり、第
４の従来例と同様の効果を期待できる。When a phoneme HMM is used as an acoustic model, as shown in FIG. 4, generally, the likelihood tends to decrease at the phoneme boundary and increase at the phoneme center. Therefore,
By using the functions of FIGS. 5 and 6, the acoustic likelihood is corrected so that the delay is large at the center of the phoneme and almost eliminated at the phoneme boundary as shown in FIG. In other words, the sound likelihood of the word hypothesis is delayed by delaying the peak of the sound likelihood of the central part of each phoneme of the word in the time direction (group delay) so as to move to a time delayed from the central part. to correct. As a result, all or a part of the acoustic likelihood at the phoneme center is evaluated at a time close to the phoneme boundary, and the same effect as in the fourth conventional example can be expected.

【００３３】次いで、図１の連続音声認識装置の構成及
び動作について説明する。図１において、音素ＨＭＭメ
モリ１１は、単語照合部４に接続され、音素ＨＭＭを予
め記憶し、当該音素ＨＭＭは、各状態を含んで表され、
各状態はそれぞれ以下の情報を有する。（ａ）状態番号（ｂ）受理可能なコンテキストクラス（ｃ）先行状態、及び後続状態のリスト（ｄ）出力確率密度分布のパラメータ（ｅ）自己遷移確率及び後続状態への遷移確率なお、本実施例において用いる音素ＨＭＭは、各分布が
どの話者に由来するかを特定する必要があるため、所定
の話者混合ＨＭＭを変換して作成する。ここで、出力確
率密度関数は３４次元の対角共分散行列をもつ混合ガウ
ス分布である。Next, the configuration and operation of the continuous speech recognition apparatus of FIG. 1 will be described. In FIG. 1, a phoneme HMM memory 11 is connected to the word matching unit 4 and stores a phoneme HMM in advance, and the phoneme HMM is represented including each state,
Each state has the following information. (A) State number (b) Acceptable context class (c) List of preceding state and succeeding state (d) Parameter of output probability density distribution (e) Self transition probability and transition probability to succeeding state The phoneme HMM used in the example is created by converting a predetermined speaker mixed HMM because it is necessary to specify which speaker each distribution originates from. Here, the output probability density function is a Gaussian mixture distribution having a 34-dimensional diagonal covariance matrix.

【００３４】また、単語辞書メモリ１２は、単語照合部
４に接続され、単語辞書を予め記憶し、当該単語辞書
は、音素ＨＭＭメモリ１１内の音素ＨＭＭの各単語毎に
シンボルで表した読みを示すシンボル列を格納する。さ
らに、統計的言語モデルメモリ１３は、単語照合部４に
接続され、所定の統計的言語モデルを予め記憶する。こ
こで、統計的言語モデルは、例えば、従来技術文献６
「政瀧浩和ほか，“連続音声認識のための可変長連鎖統
計言語モデル”，電子通信情報学会技術報告，ＳＰ９５
−７３，１９９５年１１月」において開示されている、
時間方向の長さが可変である可変長Ｎ−ｇｒａｍと呼ば
れる言語モデルを使用することができる。当該統計的言
語モデルは、品詞クラスと単語との可変長Ｎ−ｇｒａｍ
であり、次の３種類のクラス間のバイグラムとして表現
する。（ａ）品詞クラス、（ｂ）品詞クラスから分離した単語のクラス、及び、（ｃ）連接単語が結合してできたクラス。The word dictionary memory 12 is connected to the word collating unit 4 and stores the word dictionary in advance. The word dictionary stores a reading of each word of the phoneme HMM in the phoneme HMM memory 11 with a symbol. The symbol string shown is stored. Further, the statistical language model memory 13 is connected to the word matching unit 4 and stores a predetermined statistical language model in advance. Here, the statistical language model is described in, for example, the related art document 6
"Hirokazu Masataki et al.," Variable Length Statistical Language Model for Continuous Speech Recognition ", IEICE Technical Report, SP95
-73, November 1995 ".
A language model called variable length N-gram whose length in the time direction is variable can be used. The statistical language model includes a variable length N-gram of a part of speech class and a word.
And expressed as a bigram between the following three types of classes. (A) part-of-speech class, (b) class of word separated from part-of-speech class, and (c) class formed by connecting connected words.

【００３５】図１の連続音声認識装置において、特徴抽
出部２と、単語照合部４と、尤度補正部７と、単語仮説
絞込部６とは、例えば、ＣＰＵを備えたデジタル計算機
で構成される。また、バッファメモリ３，５と、音素Ｈ
ＭＭメモリ１１と、単語辞書メモリ１２と、統計的言語
モデルメモリ１３と、過小評価尤度メモリ１４とは、例
えば、ハードディスクメモリで構成される。In the continuous speech recognition apparatus of FIG. 1, the feature extracting unit 2, the word matching unit 4, the likelihood correcting unit 7, and the word hypothesis narrowing unit 6 are constituted by, for example, a digital computer having a CPU. Is done. The buffer memories 3 and 5 and the phoneme H
The MM memory 11, the word dictionary memory 12, the statistical language model memory 13, and the underestimated likelihood memory 14 are composed of, for example, a hard disk memory.

【００３６】図１において、話者の発声音声はマイクロ
ホン１に入力されて音声信号に変換された後、特徴抽出
部２に入力される。特徴抽出部２は、入力された音声信
号をＡ／Ｄ変換した後、例えばＬＰＣ分析を実行し、対
数パワー、１６次ケプストラム係数、Δ対数パワー及び
１６次Δケプストラム係数を含む３４次元の特徴パラメ
ータを抽出する。抽出された特徴パラメータの時系列は
バッファメモリ３を介して単語照合部４に入力される。In FIG. 1, a uttered voice of a speaker is input to a microphone 1 and converted into a voice signal, and then input to a feature extracting unit 2. After performing A / D conversion on the input audio signal, the feature extraction unit 2 performs, for example, LPC analysis, and performs 34-dimensional feature parameters including logarithmic power, 16th-order cepstrum coefficient, Δlogarithmic power, and 16th-order Δcepstrum coefficient. Is extracted. The time series of the extracted feature parameters is input to the word matching unit 4 via the buffer memory 3.

【００３７】単語照合部４は、ワン−パス・ビタビ復号
化法を用いて、バッファメモリ３を介して入力される特
徴パラメータのデータに基づいて、音素ＨＭＭメモリ１
１内の音素ＨＭＭと、単語辞書メモリ１２内の単語辞書
と、統計的言語モデルメモリ１３内の統計的言語モデル
とを用いて単語仮説を検出し、音素ＨＭＭに基づいた音
響尤度と、統計的言語モデルに基づいた言語尤度とを計
算して、単語仮説とともに尤度補正部７に出力する。こ
こで、単語照合部４は、各時刻の各ＨＭＭの状態毎に、
単語内の尤度と発声開始からの音響尤度を計算する。音
響尤度及び言語尤度を含む尤度は、単語の識別番号、単
語の開始時刻、先行単語の違い毎に個別にもつ。また、
計算処理量の削減のために、音素ＨＭＭ、単語辞書及び
統計的言語モデルとに基づいて計算される総合尤度のう
ちの低い総合尤度のグリッド仮説を削減する。単語照合
部４は、その結果の単語仮説と総合尤度の情報を発声開
始時刻からの時間情報（具体的には、例えばフレーム番
号）とともに尤度補正部７に出力する。The word collating unit 4 uses the one-pass Viterbi decoding method to store the phoneme HMM memory 1 based on feature parameter data input via the buffer memory 3.
1, a word hypothesis is detected using the word dictionary in the word dictionary memory 12, and the statistical language model in the statistical language model memory 13, and the acoustic likelihood based on the phoneme HMM, And calculates the linguistic likelihood based on the statistical language model, and outputs the linguistic likelihood together with the word hypothesis to the likelihood correcting unit 7. Here, the word matching unit 4 determines, for each state of each HMM at each time,
The likelihood within a word and the acoustic likelihood from the start of utterance are calculated. The likelihood including the acoustic likelihood and the linguistic likelihood is individually provided for each word identification number, word start time, and preceding word difference. Also,
In order to reduce the amount of calculation processing, the grid hypothesis of a low total likelihood among the total likelihoods calculated based on the phoneme HMM, the word dictionary, and the statistical language model is reduced. The word matching unit 4 outputs the resulting word hypothesis and information on the overall likelihood to the likelihood correction unit 7 together with time information (specifically, for example, a frame number) from the utterance start time.

【００３８】これに応答して、尤度補正部７は、時刻
（ｔ−１）において、各単語仮説に対して、１時刻前の
過小評価分データである上記数７における｛Ｑ_A(Ｓ，ｔ
−１)−Ｑ_A’(Ｓ，ｔ−１)｝を計算して、過小評価尤度
メモリ１４に記憶し、次いで、時刻ｔにおいて、上記数
６と上記数７とを用いて、過小評価するように補正され
た音響尤度Ｑ_A’（Ｓ，ｔ）を計算し、次いで、上記数
８とを用いて、総合尤度Ｑ’_all（Ｓ，ｔ）を計算し、
当該計算された総合尤度Ｑ’_all（Ｓ，ｔ）を有する単
語仮説をバッファメモリ５を介して単語仮説絞込部６に
出力する。In response to this, at time (t-1), likelihood correction section 7 calculates the ｛Q _A (S , T
-1) -Q _A ′ (S, t−1)} is stored in the underestimated likelihood memory 14, and at time t, the underestimated is calculated using the above equations 6 and 7. The sound likelihood Q _A ′ (S, t) corrected so as to calculate the total likelihood Q ′ _all (S, t) by using the above equation (8),
The word hypothesis having the calculated overall likelihood Q ′ _all (S, t) is output to the word hypothesis narrowing unit 6 via the buffer memory 5.

【００３９】単語仮説絞込部６は、尤度補正部７からバ
ッファメモリ５を介して出力される総合尤度を有する単
語仮説に基づいて、終了時刻が等しく開始時刻が異なる
同一の単語の単語仮説に対して、当該単語の先頭音素環
境毎に、発声開始時刻から当該単語の終了時刻に至る計
算された総合尤度のうちの最も高い尤度を有する１つの
単語仮説で代表させるように単語仮説の絞り込みを行っ
た後、絞り込み後のすべての単語仮説の単語列のうち、
最大の総合尤度を有する仮説の単語列を認識結果として
出力する。本実施形態においては、好ましくは、処理す
べき当該単語の先頭音素環境とは、当該単語より先行す
る単語仮説の最終音素と、当該単語の単語仮説の最初の
２つの音素とを含む３つの音素並びをいう。Based on the word hypotheses having the total likelihood output from the likelihood correction unit 7 via the buffer memory 5, the word hypothesis narrowing unit 6 determines the words of the same word having the same end time and different start time. With respect to the hypothesis, the word is represented by one word hypothesis having the highest likelihood among the calculated total likelihoods from the utterance start time to the end time of the word for each head phoneme environment of the word. After narrowing down hypotheses, of the word strings of all the narrowed word hypotheses,
A word string of a hypothesis having the maximum overall likelihood is output as a recognition result. In the present embodiment, preferably, the first phoneme environment of the word to be processed is three phonemes including the last phoneme of the word hypothesis preceding the word and the first two phonemes of the word hypothesis of the word. I mean a line.

【００４０】例えば、図２に示すように、（ｉ−１）番
目の単語Ｗ_i-1の次に、音素列ａ₁，ａ₂，…，ａ_nからな
るｉ番目の単語Ｗ_iがくるときに、単語Ｗ_i-1の単語仮説
として６つの仮説Ｗａ，Ｗｂ，Ｗｃ，Ｗｄ，Ｗｅ，Ｗｆ
が存在している。ここで、前者３つの単語仮説Ｗａ，Ｗ
ｂ，Ｗｃの最終音素は／ｘ／であるとし、後者３つの単
語仮説Ｗｄ，Ｗｅ，Ｗｆの最終音素は／ｙ／であるとす
る。終了時刻ｔ_eと先頭音素環境が等しい仮説（図２で
は先頭音素環境が“ｘ／ａ₁／ａ₂”である上から３つの
単語仮説）のうち総合尤度が最も高い仮説（例えば、図
２において１番上の仮説）以外を削除する。なお、上か
ら４番めの仮説は先頭音素環境が違うため、すなわち、
先行する単語仮説の最終音素がｘではなくｙであるの
で、上から４番めの仮説を削除しない。すなわち、先行
する単語仮説の最終音素毎に１つのみ仮説を残す。図２
の例では、最終音素／ｘ／に対して１つの仮説を残し、
最終音素／ｙ／に対して１つの仮説を残す。[0040] For example, as shown in FIG. 2, the (i-1) th word W _i-1 of the following, a phoneme string a _1, a _2, ..., comes i-th word W _i, which consists of a _n Sometimes, six hypotheses Wa, Wb, Wc, Wd, We, and Wf are assumed as the word hypotheses of the word Wi _-1.
Exists. Here, the former three word hypotheses Wa, W
It is assumed that the final phonemes of b and Wc are / x /, and the final phonemes of the latter three word hypotheses Wd, We and Wf are / y /. Of the hypotheses in which the end time t _e is equal to the head phoneme environment (in FIG. 2, the top three word hypotheses whose head phoneme environment is “x / a ₁ / a ₂ ”), the hypothesis with the highest overall likelihood (for example, FIG. 2 except for the top hypothesis). Note that the fourth hypothesis from the top has a different phoneme environment, that is,
Since the last phoneme of the preceding word hypothesis is y instead of x, the fourth hypothesis from the top is not deleted. That is, only one hypothesis is left for each final phoneme of the preceding word hypothesis. FIG.
In the example, leave one hypothesis for the final phoneme / x /
Leave one hypothesis for the final phoneme / y /.

【００４１】以上の実施形態においては、当該単語の先
頭音素環境とは、当該単語より先行する単語仮説の最終
音素と、当該単語の単語仮説の最初の２つの音素とを含
む３つの音素並びとして定義されているが、本発明はこ
れに限らず、先行する単語仮説の最終音素と、最終音素
と連続する先行する単語仮説の少なくとも１つの音素と
を含む先行単語仮説の音素列と、当該単語の単語仮説の
最初の音素を含む音素列とを含む音素並びとしてもよ
い。In the above embodiment, the head phoneme environment of the word is defined as a sequence of three phonemes including the last phoneme of the word hypothesis preceding the word and the first two phonemes of the word hypothesis of the word. Although defined, the present invention is not limited to this. The phoneme sequence of the preceding word hypothesis including the final phoneme of the preceding word hypothesis, and at least one phoneme of the preceding word hypothesis that is continuous with the final phoneme, And a phoneme sequence that includes a phoneme sequence that includes the first phoneme of the word hypothesis.

【００４２】[0042]

【実施例】本発明者は、図１の連続音声認識装置の有効
性を確認するために、自然発話データベースを用いて単
語グラフ生成実験を行なった。“トラベル・プランニン
グ”をタスクとした本出願人が所有する音声言語データ
ベース（例えば、従来技術文献７「Morimoto et al.,
“A Speech and Language Database for Speech Transl
ation Research",Proc.of ICSLP94,pp.1791-1794,1994
年」参照。）の「ホテル予約」に関する７対話（対話の
申込者側発声、男性３名及び女性４名、１００発声、９
８３語）を用いた。音響分析は、標本化周波数１２ｋＨ
ｚ、フレーム間隔１０ｍｓｅｃ、ハミング窓２０ｍｓｅ
ｃを用いて行い、ここで、特徴パラメータとして、１〜
１６次ＬＰＣケプストラム、１〜１６次ΔＬＰＣケプス
トラム、ｌｏｇパワー、Δｌｏｇパワーを用いた。４０
１状態で５混合されたＨＭｎｅｔである音響モデルは、
１５０文の朗読音声を用いて学習したモデルを自然発話
約２０発声で話者適応した。また、言語モデルは、延べ
３３０，５１３語を含む８２８対話から７１３クラスの
クラスバイグラムを作成した。テストセットの単語パー
プレキシティは、４９．６である。語彙数は６，６３５
語で、評価データの語彙を全て含んでおり未知語はない
ものとした。さらに、上記数２における言語尤度の音響
尤度に対する重みαは４．５と設定した。DESCRIPTION OF THE PREFERRED EMBODIMENTS The present inventor conducted a word graph generation experiment using a natural utterance database to confirm the effectiveness of the continuous speech recognition device of FIG. A spoken language database owned by the applicant with the task of “travel planning” (for example, see Prior Art Document 7, “Morimoto et al.,
“A Speech and Language Database for Speech Transl
ation Research ", Proc. of ICSLP94, pp. 1791-1794, 1994
See year. 7) “Hotel reservation” dialogue (voices from the applicant of the dialogue, 3 males and 4 females, 100 voices, 9 voices)
83 words). The acoustic analysis was performed at a sampling frequency of 12 kHz.
z, frame interval 10 msec, hamming window 20 msec
c, where the feature parameters are 1 to
A 16th-order LPC cepstrum, 1st to 16th-order ΔLPC cepstrum, log power, and Δlog power were used. 40
An acoustic model that is HMNet mixed in one state with 5
The model trained using 150 reading utterances was speaker-adapted with about 20 natural utterances. The language model created a class bigram of 713 classes from 828 conversations including a total of 330,513 words. The word perplexity of the test set is 49.6. 6,635 vocabulary words
It is assumed that all words in the evaluation data are included and there are no unknown words. Further, the weight α of the language likelihood with respect to the acoustic likelihood in Equation 2 is set to 4.5.

【００４３】当該装置の認識性能を、ビーム幅に対する
単語認識率（word accuracy）とＣＰＵ時間（時間）で
評価した。尤度幅一定のビームで探索を行った場合、ビ
ーム幅を広げるに従って単語認識率は向上するが、ある
程度以上広げると逆に単語認識率が低下する現象が見ら
れる。この現象は、単語仮説の探索のためのビーム幅の
拡大がビームの下限を下げるのではなく上限を押し上げ
るように働いたものと説明できる。従って、本実施例で
は、単語認識率がピークになる付近で比較を行う。The recognition performance of the apparatus was evaluated based on the word accuracy with respect to the beam width and the CPU time (hour). When a search is performed with a beam having a constant likelihood width, the word recognition rate increases as the beam width is increased, but when the beam width is increased to a certain degree or more, the word recognition rate decreases. This phenomenon can be explained as the fact that the expansion of the beam width for searching for the word hypothesis worked not to lower the lower limit of the beam but to raise the upper limit. Therefore, in this embodiment, the comparison is performed near the peak of the word recognition rate.

【００４４】図７は、図１の連続音声認識装置の実験結
果であって、ビーム幅に対する単語認識率を示すグラフ
であり、図８は、図１の連続音声認識装置の実験結果で
あって、ビーム幅に対するＣＰＵ計算時間（時間）を示
すグラフである。FIG. 7 is a graph showing the experimental results of the continuous speech recognition apparatus of FIG. 1 and showing the word recognition rate with respect to the beam width. FIG. 8 is an experimental result of the continuous speech recognition apparatus of FIG. 7 is a graph showing CPU calculation time (time) with respect to a beam width.

【００４５】図７において、ビーム幅が６０から７０ま
でにおける単語認識率を比較すると、尤度補正ありの場
合は尤度補正なしの場合に比較してより狭いビーム幅で
ピークを迎えることがわかる。また、尤度補正なしの特
性曲線を、ビーム幅方向に−２程度シフトすると両者の
特性曲線がほぼ重なることから、当該尤度補正はビーム
幅を２程度狭くするのと同じ効果がある。なお、図８か
ら明らかなように、ビーム幅が同じならば認識時間は尤
度補正のあり／なしに影響を受けないので、尤度補正あ
りは、尤度補正なしに比較して計算時間が少なくて済
み、ここで、計算時間の削減率は約１０％である。In FIG. 7, comparing the word recognition rates when the beam width is from 60 to 70, it can be seen that the peak reaches a narrower beam width in the case of the likelihood correction than in the case of no likelihood correction. . Further, when the characteristic curve without likelihood correction is shifted by about −2 in the beam width direction, the two characteristic curves almost overlap, and thus the likelihood correction has the same effect as reducing the beam width by about 2. As can be seen from FIG. 8, if the beam width is the same, the recognition time is not affected by the presence / absence of likelihood correction. Less is needed, where the reduction in computation time is about 10%.

【００４６】以上説明したように、本実施形態によれ
ば、単語仮説に対して、当該単語の各音素の時間方向の
中央部の音響尤度のピークを、当該中央部よりも遅延さ
れた時刻に移動するように遅延させて、当該単語仮説の
音響尤度を補正したので、第３の従来例に比較してより
狭いビーム幅で単語仮説の絞り込みを行うことができ、
より小さい計算コストでかつより高い認識率で自然発話
の連続音声認識を行うことができる。As described above, according to the present embodiment, for the word hypothesis, the peak of the acoustic likelihood at the central part in the time direction of each phoneme of the word is set at the time delayed from the central part. , And the acoustic likelihood of the word hypothesis is corrected, so that the word hypothesis can be narrowed down with a narrower beam width compared to the third conventional example.
Continuous speech recognition of spontaneous utterance can be performed with a smaller calculation cost and a higher recognition rate.

【００４７】また、本実施形態によれば、終了時刻が等
しく開始時刻が異なる同一の単語の単語仮説に対して、
当該単語の先頭音素環境毎に、発声開始時刻から当該単
語の終了時刻に至る計算された総合尤度のうちの最も高
い総合尤度を有する１つの単語仮説で代表させるように
単語仮説の絞り込みを行う。すなわち、先行単語毎に１
つの単語仮説で代表させる第２の従来例の単語ペア近似
法に比較して、単語の先頭音素の先行音素（つまり、先
行単語の最終音素）が等しいものをひとまとめに扱うた
めに、単語仮説数を削減することができ、近似効果は大
きい。特に、語彙数が増加した場合において削減効果が
大きい。従って、当該連続音声認識装置を、間投詞の挿
入や、言い淀み、言い直しが頻繁に生じる自然発話の認
識に用いた場合であっても、単語仮説の併合又は分割に
要する計算コストは従来例に比較して小さくなる。すな
わち、音声認識のために必要な処理量が小さくなり、そ
れ故、単語照合部４のワーキングメモリ（図示せ
ず。）、バッファメモリ５及び単語仮説絞込部６のワー
キングメモリ（図示せず。）などの音声認識のための記
憶装置において必要な記憶容量は小さくなる一方、処理
量が小さくなるので音声認識のための処理時間を短縮す
ることができる。Further, according to the present embodiment, for a word hypothesis of the same word having the same end time and different start time,
For each head phoneme environment of the word, narrow down the word hypothesis so that it is represented by one word hypothesis having the highest overall likelihood among the calculated overall likelihoods from the utterance start time to the end time of the word. Do. That is, one for each preceding word
Compared to the word pair approximation method of the second conventional example, which is represented by two word hypotheses, the number of word hypotheses is larger in order to collectively treat words having the same leading phoneme of the first phoneme (that is, the last phoneme of the preceding word). Can be reduced, and the approximation effect is large. In particular, when the number of words increases, the reduction effect is large. Therefore, even when the continuous speech recognition device is used for recognizing natural utterances in which interjections are inserted, stagnant, and rephrased frequently, the calculation cost required for merging or dividing word hypotheses is lower than in the conventional example. It will be smaller than that. That is, the processing amount required for speech recognition is reduced, and therefore, the working memory (not shown) of the word matching unit 4, the buffer memory 5, and the working memory (not shown) of the word hypothesis narrowing unit 6 are provided. )), The storage capacity required for a storage device for voice recognition is reduced, while the processing amount is reduced, so that the processing time for voice recognition can be reduced.

【００４８】[0048]

【発明の効果】以上詳述したように本発明に係る請求項
１記載の連続音声認識装置によれば、入力される発声音
声文の音声信号に基づいて上記発声音声文の単語仮説を
検出し音響尤度を計算することにより、連続的に音声認
識する音声認識手段を備えた連続音声認識装置におい
て、上記音声認識手段は、単語の各音素の時間方向の中
央部の音響尤度を、当該中央部よりも遅延された時刻に
移動するように遅延させて、単語仮説の音響尤度を補正
する。従って、第３の従来例に比較してより狭いビーム
幅で単語仮説の絞り込みを行うことができ、より小さい
計算コストで、すなわち音声認識のための処理時間を短
縮して、かつより高い認識率で自然発話の連続音声認識
を行うことができる。As described above in detail, according to the continuous speech recognition apparatus of the first aspect of the present invention, the word hypothesis of the uttered voice sentence is detected based on the voice signal of the input uttered voice sentence. In a continuous speech recognition device provided with a speech recognition means for continuously recognizing speech by calculating an acoustic likelihood, the speech recognition means calculates the acoustic likelihood of a central part of each phoneme of a word in the time direction. The sound likelihood of the word hypothesis is corrected by delaying to move to a time delayed from the center. Therefore, word hypotheses can be narrowed down with a narrower beam width than in the third conventional example, and a smaller calculation cost, that is, a processing time for speech recognition is reduced, and a higher recognition rate is obtained. Can perform continuous speech recognition of natural speech.

【００４９】また、請求項２記載の連続音声認識装置に
おいては、請求項１記載の連続音声認識装置において、
上記音声認識手段は、終了時刻が等しく開始時刻が異な
る同一の単語の単語仮説に対して、当該単語の先頭音素
環境毎に、発声開始時刻から当該単語の終了時刻に至る
計算された、音響尤度を含む総合尤度のうちの最も高い
総合尤度を有する１つの単語仮説で代表させるように単
語仮説の絞り込みを行う。従って、当該連続音声認識装
置を、間投詞の挿入や、言い淀み、言い直しが頻繁に生
じる自然発話の認識に用いた場合であっても、単語仮説
の併合又は分割に要する計算コストは従来例に比較して
小さくなる。すなわち、音声認識のために必要な処理量
が小さくなり、それ故、音声認識のための記憶装置にお
いて必要な記憶容量は小さくなる一方、処理量が小さく
なるので音声認識のための処理時間を短縮することがで
きる。Further, in the continuous speech recognition apparatus according to the second aspect, in the continuous speech recognition apparatus according to the first aspect,
The speech recognition means calculates, for each word phoneme environment of the same word having the same end time and different start time, a calculated acoustic likelihood from the utterance start time to the end time of the word for each head phoneme environment of the word. The word hypotheses are narrowed down so as to be represented by one word hypothesis having the highest overall likelihood among the total likelihoods including the degrees. Therefore, even when the continuous speech recognition device is used for recognizing natural utterances in which interjections are inserted, stagnant, and rephrased frequently, the calculation cost required for merging or dividing word hypotheses is lower than in the conventional example. It will be smaller than that. That is, the amount of processing required for speech recognition is reduced, and therefore, the storage capacity required for the storage device for speech recognition is reduced, while the amount of processing is reduced, so that the processing time for speech recognition is reduced. can do.

[Brief description of the drawings]

【図１】本発明に係る一実施形態である連続音声認識
装置のブロック図である。FIG. 1 is a block diagram of a continuous speech recognition apparatus according to an embodiment of the present invention.

【図２】図１の連続音声認識装置における単語仮説絞
込部６の処理を示すタイミングチャートである。FIG. 2 is a timing chart showing a process of a word hypothesis narrowing section 6 in the continuous speech recognition device of FIG.

【図３】第３の従来例の連続音声認識装置と、図１の
本実施形態の連続音声認識装置とにおける音響尤度の関
係を示す図である。FIG. 3 is a diagram illustrating a relationship between acoustic likelihoods in a third conventional example of a continuous speech recognition apparatus and the continuous speech recognition apparatus of the present embodiment in FIG. 1;

【図４】図１の連続音声認識装置において、音素／ａ
／に対する尤度補正部７による補正前と補正後の音響尤
度の関係の一例であって、音響尤度の時間変化を示すグ
ラフである。FIG. 4 shows a phoneme / a in the continuous speech recognition apparatus of FIG.
6 is a graph showing an example of the relationship between the acoustic likelihood before and after correction by the likelihood correction unit 7 with respect to /, and is a graph showing a temporal change in the acoustic likelihood.

【図５】図１の尤度補正部７において用いる、尤度に
対する遅延割合を求める第１の関数ｆ（ｘ）を示すグラ
フである。FIG. 5 is a graph showing a first function f (x) for calculating a delay ratio with respect to likelihood, which is used in the likelihood correction unit 7 of FIG. 1;

【図６】図１の尤度補正部７において用いる、尤度に
対する遅延割合を求める第２の関数Ｆ（Ｄ）を示すグラ
フである。FIG. 6 is a graph showing a second function F (D) for calculating a delay ratio with respect to likelihood, which is used in the likelihood correction unit 7 of FIG. 1;

【図７】図１の連続音声認識装置の実験結果であっ
て、ビーム幅に対する単語認識率を示すグラフである。7 is a graph showing experimental results of the continuous speech recognition apparatus of FIG. 1 and showing a word recognition rate with respect to a beam width.

【図８】図１の連続音声認識装置の実験結果であっ
て、ビーム幅に対するＣＰＵ計算時間（時間）を示すグ
ラフである。FIG. 8 is a graph showing experimental results of the continuous speech recognition apparatus of FIG. 1, showing CPU calculation time (time) with respect to a beam width.

【符号の説明】１…マイクロホン、２…特徴抽出部、３，５…バッファメモリ、４…単語照合部、６…単語仮説絞込部、７…尤度補正部、１１…音素ＨＭＭメモリ、１２…単語辞書メモリ、１３…統計的言語モデル、１４…過小評価尤度メモリ。[Explanation of Codes] 1 ... microphone, 2 ... feature extraction unit, 3, 5 ... buffer memory, 4 ... word collation unit, 6 ... word hypothesis narrowing unit, 7 ... likelihood correction unit, 11 ... phoneme HMM memory, 12 ... word dictionary memory, 13 ... statistical language model, 14 ... underestimated likelihood memory.

───────────────────────────────────────────────────── フロントページの続き (72)発明者匂坂芳典京都府相楽郡精華町大字乾谷小字三平谷５番地株式会社エイ・ティ・アール音声翻訳通信研究所内 (56)参考文献特開平８−241094（ＪＰ，Ａ) 特開平５−341797（ＪＰ，Ａ) 特開平８−6588（ＪＰ，Ａ) 特開平８−123472（ＪＰ，Ａ) 特許2731133（ＪＰ，Ｂ２) 日本音響学会平成８年度秋季研究発表会講演論文集▲Ｉ▼ ３−３−６「Ｄｅｌａｙｅｄｄｅｃｉｓｉｏｎビーム探索の検討」ｐ．97−98（平成８年９月 25日発行) 日本音響学会平成７年度秋季研究発表会講演論文集▲Ｉ▼ ２−２−12「単語グラフを用いた連続音声認識法」ｐ．61 −62（平成７年９月28日国立国会図書館受入) 電子情報通信学会論文誌Ｖｏｌ．Ｊ 79−Ｄ−▲ＩＩ▼ Ｎｏ．12，Ｄｅｃｅｍｂｅｒ 1996，「大語い連続音声認識のための単語仮説数削減」，ｐ．2117− 2124，（平成８年12月25日発行) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 3/00 561 G10L 3/00 531 G10L 3/00 537 ＪＩＣＳＴファイル（ＪＯＩＳ)────────────────────────────────────────────────── ─── Continuation of the front page (72) Inventor Yoshinori Sakasaka 5th Sanraya, Inaya, Seika-cho, Soraku-gun, Kyoto Pref. 241094 (JP, A) JP-A-5-341797 (JP, A) JP-A-8-6588 (JP, A) JP-A 8-123472 (JP, A) Patent 2731133 (JP, B2) Proceedings of the Fall Meeting of the 8th Annual Meeting ▲ I ▲ 3-3-6 “Examination of Delayed Decision Beam Search” p. 97-98 (Published September 25, 1996) Proceedings of the Acoustical Society of Japan Fall Meeting 2007 (I) 2-2-12 "Continuous Speech Recognition Method Using Word Graph" p. 61-62 (accepted by the National Diet Library on September 28, 1995) Transactions of the Institute of Electronics, Information and Communication Engineers, Vol. J 79-D- ▲ IIＮｏ No. 12, Decmber 1996, “Reducing the Number of Word Hypotheses for Large Vocabulary Continuous Speech Recognition,” p. 2117-2124, (Issued December 25, 1996) (58) Fields investigated (Int. Cl. ⁷ , DB name) G10L 3/00 561 G10L 3/00 531 G10L 3/00 537 JICST file (JOIS)

Claims

(57) [Claims]

1. A continuous speech having speech recognition means for continuously recognizing speech by detecting a word hypothesis of the speech speech sentence based on a speech signal of the speech speech input and calculating an acoustic likelihood. In the recognition device, the voice recognition means may delay the acoustic likelihood of a central part of each phoneme of the word in the time direction so as to move to a time delayed from the central part, and A speech recognition device characterized by correcting the following.

2. The method according to claim 1, wherein the speech recognition unit calculates a word hypothesis of the same word having a same end time and a different start time, for each head phone environment of the word, from a utterance start time to an end time of the word. 2. The continuous speech recognition apparatus according to claim 1, wherein the word hypothesis is narrowed down so as to be represented by one word hypothesis having the highest total likelihood among the total likelihoods including the acoustic likelihood. .