WO2009139230A1 - 言語モデルスコア先読み値付与装置およびその方法ならびにプログラム記録媒体 - Google Patents
言語モデルスコア先読み値付与装置およびその方法ならびにプログラム記録媒体 Download PDFInfo
- Publication number
- WO2009139230A1 WO2009139230A1 PCT/JP2009/056324 JP2009056324W WO2009139230A1 WO 2009139230 A1 WO2009139230 A1 WO 2009139230A1 JP 2009056324 W JP2009056324 W JP 2009056324W WO 2009139230 A1 WO2009139230 A1 WO 2009139230A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- language model
- model score
- phoneme
- word
- value
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims description 27
- 238000009499 grossing Methods 0.000 claims abstract description 14
- 238000004364 calculation method Methods 0.000 claims description 41
- 239000012141 concentrate Substances 0.000 claims description 6
- 238000013138 pruning Methods 0.000 abstract description 10
- 230000001186 cumulative effect Effects 0.000 description 9
- 238000004458 analytical method Methods 0.000 description 8
- 235000009508 confectionery Nutrition 0.000 description 4
- 238000010586 diagram Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 238000007796 conventional method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000007562 laser obscuration time method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/187—Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
Definitions
- the present invention relates to a speech recognition device that performs a frame-synchronized beam search using a language model score prefetch value, and a language model score prefetch value assigning device suitable for such a speech recognition device.
- a high-performance speech recognition device such as a large vocabulary continuous speech recognition device
- various hypotheses predicted from three knowledge sources of an acoustic model, a word dictionary, and a language model, and an unknown input speech
- the acoustic similarity and the linguistic similarity are calculated as an acoustic model score and a language model score, and the most probable hypothesis is output as a recognition result.
- the acoustic model score and language model score at each time are comprehensively judged, and hypotheses with poor scores are not expected Pruning, that is, a technique called a frame-synchronized beam search method (hereinafter simply referred to as a beam search method) is employed so as not to perform subsequent hypothesis expansion.
- a frame-synchronized beam search method hereinafter simply referred to as a beam search method
- FIG. 6 An example of this type of speech recognition apparatus is shown in FIG.
- a speech waveform that is a speech recognition target is input by the speech input unit 301 and transmitted to the acoustic analysis unit 302.
- the acoustic analysis unit 302 calculates an acoustic feature amount for each frame and outputs the acoustic feature amount to the distance calculation unit 303.
- the distance calculation unit 303 calculates the distance between the input acoustic feature quantity and each model in the acoustic model 304, and outputs an acoustic model score corresponding to the distance to the search unit 305.
- the searching means 305 obtains a cumulative score obtained by adding the language model score obtained by the language model 402 acquired from the language model score prefetching value assigning device 308 and the acoustic model score for all hypotheses to be searched, and determines a hypothesis having a bad cumulative score. Pruning. Processing is continued for the remaining hypotheses, and the best recognition result is output from the recognition result output means 309.
- the word dictionary 403 in this example is a tree structure dictionary.
- the language model score of each word given by the language model 402 is added.
- the word “handshake” (reading: “akusyu”) has a phoneme sequence “a-k-u-sy-u” and a language model score of 80.
- the word “red” (reading: “akai”) has a phoneme string “a-k-a-i” and a language model score of 50.
- the smaller the language model score value the better the score.
- the language model score Cannot be added to the cumulative score. If the language model score is added to the cumulative score for the first time at the end of any word, the score will fluctuate greatly depending on the hypothesis before and after the transition between words. Therefore, even if the correct hypothesis score fluctuates greatly, it is necessary to increase the beam width so that it is not pruned, and an efficient beam search cannot be performed.
- the language model score prefetching value assigning device 308 obtains the best value of the language model score of the word corresponding to each branch or less of the tree structure dictionary as an optimistic language model in that branch.
- the best language model score acquisition means 401 acquired as a score is provided.
- the best language model score acquisition unit 401 uses the word dictionary 403 and the language model 402, and as shown in the equation (1), the hypothesis language model score prefetch value ⁇ h (s) of the phoneme s having the word history h. For the word w belonging to the set of words W (s) traced from the phoneme s in the dictionary, the best value of the language model score -log ⁇ p (w
- the numerical value on the right of the terminal phoneme represents the language model score of each word
- the numerical value of each branch represents the language model score look-ahead difference value that the branch has.
- the language model score of 50 can be added to the cumulative score when the root part of the tree structure is connected to the previous hypothesis, and the language model score is only accumulated when reaching the end of the word. Compared with the case of adding to, it is possible to perform an efficient beam search.
- Non-patent document 1 describes the best language model score acquisition means 401 as described above.
- Non-Patent Document 1 describes two methods of prefetching a unigram language model score and prefetching a bigram language model score.
- the prefetch of the unigram language model score uses the unigram language model score as the prefetch difference value of the language model score, and when the hypothesis reaches the word end of the tree structure dictionary and the word is confirmed, the unigram used until then Discard the language model score and add the confirmed bigram language model score. Such processing performed when the word end is reached is called word end processing.
- the prefetching of the bigram language model score is a method of using the bigram language model score from the prefetching stage.
- the search unit 305 shown in FIG. 6 includes a word end processing unit 307 in addition to the main search unit 306 that performs an original search, and corresponds to an example in which a prefetch method of a unigram language model score is used.
- the best value of the language model score of the word corresponding to each branch or less of the tree structure dictionary is used as the language model score in that branch. If the language model scores of the corresponding words were all bad, most of the bad language scores were added to those words at an early point in time, and even the correct answer hypothesis was pruned early. .
- the language model scores of the corresponding words were all bad, most of the bad language scores were added to those words at an early point in time, and even the correct answer hypothesis was pruned early.
- FIG. 9 includes “candy” (reading: “okasi”) as a recognition target word, and does not include words starting from the chain of phonemes “o” and “k” other than “candy”. It is an example of a word dictionary and a language model score look-ahead value at the time of using a structure dictionary.
- the best language model score acquisition unit 401 assigns the language model score look-ahead value “50” to the branch connected to “o”, and there is no phoneme having branching in the phonemes after “k”.
- a language model score prefetch value “90” (“40” in the difference value) is assigned to the branch connected to k ′′.
- a triphone triple phoneme that considers the context before and after is used as a recognition unit. Therefore, for the hypothesis of “sweets”, at the time of the initial phoneme “o” in the phoneme sequence “okasi”, “k” in the right context is also taken into account, and a bad language model score of “90” is obtained. All will be added. Therefore, if the phone model match after “k” is examined, the acoustic model score will improve, and even if “candy” becomes the correct hypothesis, a large language model score look-ahead value will be added at an early stage. Pruning is likely to occur, resulting in recognition errors.
- FIG. 10 includes “cancel” (reading: “kyanseru”) in the recognition target word, and there are a plurality of recognition target words starting from the chain of phonemes “ky” and “a”, but the language model score is “100”. ”Is an example of a word dictionary and a language model score prefetch value when a tree structure dictionary is used for the word dictionary 403. In this case, the best language model score acquisition unit 401 assigns the language model score prefetch value “50” to the branch connected to “ky”, and the language model score prefetch value “100” to the branch connected to “a”. (The difference value is “50”).
- FIG. 11 shows examples of word dictionary and language model score prefetch values when the recognition target word includes “belt” (reading: “beruto”) and a linear dictionary is used as the word dictionary 403.
- a linear dictionary it is possible to give the language model score of each word as the language model score prefetch value from the beginning of all words.
- the language model score of “Belt” is bad as “100”
- all the language model scores are added to the cumulative score when the leading part of the word is connected to the previous hypothesis. , Easy to be pruned.
- One way to prevent the correct hypothesis from being pruned is to increase the beam width. However, if the beam width is widened, another problem arises that the number of hypotheses increases and the amount of calculation increases.
- An object of the present invention is to provide a language model score prefetching value assigning apparatus and method, and a program recording medium that can prevent pruning of correct hypotheses while suppressing an increase in the number of hypotheses.
- the language model score prefetching value assigning device of the present invention includes a word dictionary that defines a phoneme sequence of a word, a language model that gives a score of word appearance, a phoneme sequence that is defined in the word dictionary for a word, and the Smoothing language model score prefetch value calculation means for obtaining a language model score prefetch value for each phoneme in the word so that the language model score prefetch value does not concentrate on the beginning of the word from the score defined in the language model .
- pruning of correct hypotheses can be prevented while suppressing an increase in the number of hypotheses.
- the reason is to obtain the language model score prefetch value for each phoneme in the word so that the language model score prefetch value does not concentrate on the beginning of the word.
- the speech recognition apparatus includes a speech input unit 101, an acoustic analysis unit 102, a distance calculation unit 103, an acoustic model 104, a search unit 105, The language model score prefetch value assigning device 108 and a recognition result output means 109 are configured.
- the search means 105 includes a main search means 106 and a word end processing means 107.
- the language model score prefetching value assigning device 108 includes a smoothed language model score prefetching value calculation unit 201, a language model 202, and a word dictionary 203. Each of these has the following functions.
- the acoustic model 104 is a model that gives an acoustic feature amount to a phoneme or a phoneme string.
- the word dictionary 203 is a dictionary that defines a phoneme string of words, and this embodiment uses a tree structure dictionary.
- the tree structure dictionary is a dictionary in which correspondence between a word and its phoneme sequence is recorded, and is a tree formed by sharing a common head phoneme sequence between words.
- the language model 202 is a model that gives a score of the likelihood of appearance to a word or a word string. In the present embodiment, the language model 202 includes a unigram language model and a bigram language model.
- the smoothed language model score look-ahead value calculation means 201 uses the phoneme string defined in the word dictionary 203 for the word and the language model score defined in the language model 202 (unigram language model score in the case of the present embodiment). This is means for obtaining a language model score prefetch value for each phoneme in the word so that the language model score prefetch value does not concentrate on the beginning of the word. Specifically, by obtaining the language model score prefetch value for each phoneme in the word based on the appearance order of the phoneme in the word, the language model score prefetch value for the first phoneme of the word or a phoneme close thereto is obtained. Try not to be approximately equal to the language model score of the word.
- the voice input means 101 is a means for inputting a voice waveform that is a voice recognition target.
- the acoustic analysis unit 102 is a unit that calculates an acoustic feature amount from the input speech waveform for each frame.
- the distance calculation means 103 is a means for calculating an acoustic model score corresponding to the distance between the acoustic feature quantity of the input speech waveform and the acoustic model for each frame.
- the search means 105 is the probability that the pronunciation of each word calculated as the acoustic model score using the acoustic model 104 among the candidate word strings (hypotheses) obtained by combining the words in the word dictionary 203 outputs the input speech waveform. This is a means for searching for and outputting a word string having the largest cumulative score between the value and the probability value of word chain calculated as a language model score using the language model 202.
- the search means 105 includes a word end processing means 107 that performs word end processing and a main search means 106 that performs other search processing.
- the recognition result output unit 109 is a unit that outputs the recognition result output from the search unit 105.
- step S1 a speech waveform is input using the speech input means 101.
- step S2 the acoustic analysis means 102 receives the speech waveform as an input, calculates an acoustic feature quantity such as a cepstrum, and outputs it.
- an acoustic feature quantity such as a cepstrum
- step S3 the distance calculation means 103 receives the acoustic feature value as input, calculates a distance from each model of the acoustic model 104, and outputs an acoustic model score.
- step S4 the smoothed language model score prefetch value calculation means 201 calculates a language model score prefetch value in all hypotheses to be searched.
- step S5 the search means 106 adds the acoustic model score and the language model score look-ahead value to the accumulated score for each hypothesis, and updates the accumulated score.
- step S6 it is determined whether the hypothesis is at the end of the word. If the hypothesis is at the end of the word, in step S7, the word model score prefetching value by the unigram language model added by the word end processing means 107 is The bigram language model score obtained from the language model 203 is corrected.
- step S8 a hypothesis with a poor cumulative score is pruned. For example, pruning is performed by a method of discarding hypotheses that are less than the likelihood threshold or a method of leaving a certain upper number of hypotheses and discarding others.
- step S9 it is determined whether or not the voice input is completed. If the input is still continued, the process returns to step S1 and the same process is performed for the new input. When the input is completed, the process proceeds to step S10.
- the recognition result output means 109 receives the result from the search means 105 and outputs the best recognition result. Not only the best recognition result but also the top several recognition results may be output.
- pruning of the correct answer hypothesis can be prevented, and thereby the recognition error rate can be reduced.
- the reason is that the language model score prefetching value is not concentrated on the beginning of the word, so that early pruning of the correct hypothesis due to the concentration of the language model score prefetching value on the word beginning is prevented. .
- the increase in the number of hypotheses can be suppressed as compared with the case where the beam width is increased.
- the reason for this is that the increase in the amount of calculation due to the language model score look-ahead value not being concentrated at the beginning of the word is only the amount of calculation of the hypothesis where the language model score look-ahead value was concentrated at the beginning of the word and pruned.
- the method of expanding the beam width while the amount is very small, leaves hypotheses with poor acoustic model scores and words with poor scores at the end of words remaining in the search space without being pruned. It is because the increase of becomes larger.
- the smoothed language model score prefetching value calculation unit 201 obtains a language model score prefetching value for each phoneme in a word based on the number of phonemes from the beginning of the word to the phoneme.
- the smoothing language model score look-ahead value is defined as shown in equations (3) and (4) for calculation.
- ⁇ ′h (s) min w ⁇ W (s) ⁇ -log p (w
- a threshold value T (n) is determined depending on whether the phoneme is the nth phoneme from the beginning. If the phoneme s is the d (s) th phoneme from the beginning, ⁇ ′h (s) is T (d If (s)) is exceeded, the language model score look-ahead value is added only up to the value of the threshold value T (d (s)).
- the threshold is determined so that the smaller n is, the smaller T (n) is. As a result, it is possible to avoid concentrating the language model score look-ahead values at the beginning of the word.
- E is a set of final phonemes of words.
- FIG. 3 shows a specific example of the language model score look-ahead value when this embodiment is operated using a tree structure dictionary.
- the threshold T (d) of the language model look-ahead value is determined for each phoneme number from the beginning of the word.
- the thresholds are determined as “45”, “70”, “90”, and “100” in order from the first phoneme to the fourth phoneme.
- the threshold value T (d) may be determined in advance and set in the smoothed language model score prefetch value calculation unit 201, the word dictionary 203, or the language model 202, or the smoothed language model score prefetch value. May be determined by the smoothed language model score prefetching value calculation means 201.
- the language model score prefetching difference value to be given to the branch connected to the first phoneme is set as the first phoneme threshold, and the excess value exceeding the first phoneme threshold Carry over to the branch that leads to the next phoneme.
- the language model score look-ahead difference of the branch connected to the first phoneme “a” The value is set to the first phoneme threshold “45”, and the threshold excess value “5” is carried over to the branch connected to the next phoneme.
- the language model score prefetch difference value is set so that the best language score becomes the language model score prefetch value even when the threshold of the phoneme is exceeded.
- the second phoneme threshold “70” is the second phoneme “k”.
- Language model score look-ahead value of “25” obtained by subtracting the language model score look-ahead value “45” added up to the first phoneme “a” from this value to the branch connected to the second phoneme “k”.
- the language model prefetching difference value to be used is carried over, and the threshold excess value “20” is carried over to the branch connected to the next phoneme. As a result, excessive language model prefetch values exceeding the threshold are not added.
- the smoothed language model score prefetch value calculation unit 201 obtains the language model score prefetch value for each phoneme in the word based on the number of phonemes of the word traced from the phoneme. Specifically, the smoothing language model score look-ahead value is defined as shown in equations (5) and (6) for calculation.
- ⁇ h (s) min w ⁇ W (s) [ ⁇ -log p (w
- ⁇ h (s) ⁇ h (s ⁇ ) + ⁇ h (s) (6)
- N (w) is the phoneme number of word w.
- d (s) represents that the phoneme s is the d (s) th phoneme from the beginning, as in the first embodiment.
- the language model score divided by the number of phonemes of words is used as the language model prefetch difference value.
- the numerator of equation (5) is a value obtained by subtracting the pre-phoneme s ⁇ previously added language model score from the language model score, and the denominator is the number of phonemes after the phoneme s in the word w. Therefore, the language model score is equally divided by the number of phonemes, and the minimum value in the word w traced from s is given to the branch connected to the phoneme s as the language model score prefetching difference value ⁇ h (s).
- the language model score prefetching value ⁇ h (s) can be obtained by adding the difference value to the language model score prefetching value of the previous phoneme s ⁇ according to the equation (6).
- FIG. 4 shows a specific example of the language model score look-ahead value when this embodiment is operated using the tree structure dictionary.
- the language model score not added by “a” is the pre-difference difference value of the language model score from “90” of the language model score to the phoneme “a”. Is “80” minus “10”, and the number of phonemes after “s” is four. Therefore, when this is equally divided, it becomes “20” per branch. This is repeated for the next phoneme, and the language model score look-ahead value is determined.
- the language model score prefetch value is distributed from the beginning to the end of the word, so that the language model prefetch value is smoothed and an excessive language model prefetch value is not added at the beginning of the word.
- the smoothed language model score prefetching value calculation unit 201 obtains the language model score prefetching value for each phoneme in the word based on the number of phonemes in the phoneme string that does not include a branch including the phoneme.
- B is a set of branched phonemes in the tree structure dictionary.
- m (s) is the difference between the number of phonemes from the head of the phoneme having a branch in the tree structure that appears first after s and the number of phonemes from the head of the previous phoneme s ⁇ . If there is no branching in the tree structure after s, the difference between the number of phonemes from the beginning of the phoneme at the end of the word and the number of phonemes from the beginning of the previous phoneme s ⁇ is taken.
- the best value ⁇ h (s) of the language model score is obtained by the equation (1) as in the conventional case.
- the language model score prefetching is performed by equally dividing by the m (s) that is the number of branches without branching instead of having the difference value of the best value as it is. Smooth the value.
- FIG. 5 shows a specific example of the language model score look-ahead value when this embodiment is operated using a tree structure dictionary.
- the branch “a-s” connecting “a” and “s” has a language model score look-ahead difference value of “40”.
- “s”, “o”, and “b” all have only one branch, and there is no branching. Therefore, the language model score prefetching difference values that are given to the branch “as” are used as these. Also distribute to the branches. Since there is no branching after the phoneme "s", the number of phonemes from the beginning of the word end phoneme is used. The word end phoneme “i” is the fifth phoneme from the beginning, and s ⁇ is the first phoneme “a”, so m (s) is the difference “4”.
- the language model score look-ahead difference value “40” of the branch “as” is equally distributed to each of the four branches “as”, “so”, “ob”, “bi”, and “10” is assigned to each branch.
- “ku-sy-u” and “ari” also distribute the language model score look-ahead difference value, which smoothes the language model score look-ahead value and adds an excessive language model look-ahead value at the beginning of the word. It will not be done.
- prefetching is performed with a unigram language model and replaced with a bigram language model by word end processing.
- prefetching is performed with a bigram language model and replaced with a trigram language model by word end processing.
- the language model to be replaced and the language model to be replaced by word end processing can be variously changed.
- an embodiment in which the bigram language model or trigram language model is used from the point of prefetching without performing word ending processing is also conceivable.
- Examples 1 to 3 show examples in which the word dictionary 203 is a tree structure dictionary, a similar method can be used even when the word dictionary 203 is a linear dictionary.
- the smoothed language model score prefetch value is calculated each time.
- the smoothed language model score prefetch value is calculated in advance, for example, the word dictionary 203 or the language model 202.
- the smoothing language model score prefetching value calculation means 201 searches for and acquires a corresponding value from the stored smoothing language model score prefetching values in the course of the search. It is done.
- the smoothed language model score prefetching value calculation means 201, the voice input means 101, the acoustic analysis means 102, the distance calculation means 103, the search means 105, and the recognition result output means 109 are of course realized by hardware, as well as a computer.
- a program The program is provided by being recorded on a computer-readable recording medium such as a magnetic disk or a semiconductor memory, and is read by the computer when the computer is started up, etc., and the computer is controlled as the above-described means by controlling the operation of the computer. Let it function and perform the process described above.
- the present invention can be applied to voice recognition systems in general, such as automatic interpretation using voice recognition, information retrieval, and a voice dialogue system.
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Abstract
Description
πh(s)=min w∈W(s){-log p(w|h)} …(1)
δh(s)=πh(s)-πh(s~) …(2)
102 音響分析手段
103 距離計算手段
104 音響モデル
105 探索手段
106 本探索手段
107 単語末処理手段
108 言語モデルスコア先読み値付与装置
109 認識結果出力手段
201 平滑化言語モデルスコア先読み値計算手段
202 言語モデル
203 単語辞書
301 音声入力手段
302 音響分析手段
303 距離計算手段
304 音響モデル
305 探索手段
306 本探索手段
307 単語末処理手段
308 言語モデルスコア先読み値付与装置
309 認識結果出力手段
401 最良言語モデルスコア取得手段
402 言語モデル
403 単語辞書
図1を参照すると、本発明の第1の実施の形態に係る音声認識装置は、音声入力手段101と、音響分析手段102と、距離計算手段103と、音響モデル104と、探索手段105と、言語モデルスコア先読み値付与装置108と、認識結果出力手段109とから構成されている。また、探索手段105は、本探索手段106と、単語末処理手段107とから構成される。さらに、言語モデルスコア先読み値付与装置108は、平滑化言語モデルスコア先読み値計算手段201と、言語モデル202と、単語辞書203とから構成される。これらはそれぞれ以下のような機能を有する。
本実施例1の平滑化言語モデルスコア先読み値計算手段201は、単語中の各音素での言語モデルスコア先読み値を、単語先頭から当該音素までの音素数に基づいて求める。具体的には、平滑化言語モデルスコア先読み値を式(3)、(4)のように定義して計算を行う。
π'h(s)=min w∈W(s){-log p(w|h)} …(3)
πh(s)=π'h(s) if π'h(s)<=T(d(s)) or s∈E
=T(d(s)) otherwise …(4)
本実施例2の平滑化言語モデルスコア先読み値計算手段201は、単語中の各音素での言語モデルスコア先読み値を、当該音素から辿れる単語の音素数に基づいて求める。具体的には、平滑化言語モデルスコア先読み値を式(5)、(6)のように定義して計算を行う。
δh(s)=min w∈W(s) [ {-log p(w|h)-πh(s~)} / {N(w)-d(s)+1}] …(5)
πh(s)=πh(s~)+δh(s) …(6)
本実施例3の平滑化言語モデルスコア先読み値計算手段201は、単語中の各音素での言語モデルスコア先読み値を、当該音素が含まれる枝分かれを持たない音素列の音素数に基づいて求める。具体的には、平滑化言語モデルスコア先読み値を式(7)のように定義して計算を行う。
δh(s)={πh(s)-πh(s~)}/m(s) if s~∈B
=δh(s~) otherwise …(7)
以上の実施の形態では、先読みをユニグラム言語モデルで行い、単語末処理でバイグラム言語モデルに置き換えたが、先読みをバイグラム言語モデルで行い、単語末処理でトライグラム言語モデルに置き換える等、先読みで使用する言語モデル、単語末処理で置き換える言語モデルは種々変更可能である。また、単語末処理を行わず、先読みの時点からバイグラム言語モデルやトライグラム言語モデルを使用する実施の形態も考えられる。
Claims (27)
- 単語の音素列を定義する単語辞書と、単語の出現し易さのスコアを与える言語モデルと、単語について前記単語辞書で定義された音素列と前記言語モデルで定義されたスコアとから、言語モデルスコア先読み値が単語の語頭に集中しないように単語中の各音素での言語モデルスコア先読み値を求める平滑化言語モデルスコア先読み値計算手段とを備えることを特徴とする言語モデルスコア先読み値付与装置。
- 前記平滑化言語モデルスコア先読み値計算手段は、単語中の各音素での言語モデルスコア先読み値を、当該音素の当該単語中の出現順序に基づいて求めることを特徴とする請求項1に記載の言語モデルスコア先読み値付与装置。
- 前記平滑化言語モデルスコア先読み値計算手段は、単語中の各音素での言語モデルスコア先読み値を、単語先頭から当該音素までの音素数に基づいて求めることを特徴とする請求項2に記載の言語モデルスコア先読み値付与装置。
- 前記平滑化言語モデルスコア先読み値計算手段は、単語先頭から音素までの音素数に基づいて設定された言語モデルスコア先読み値の閾値以内の言語モデルスコア先読み値を求めることを特徴とする請求項3に記載の言語モデルスコア先読み値付与装置。
- 前記平滑化言語モデルスコア先読み値計算手段は、単語中の各音素での言語モデルスコア先読み値を、当該音素から辿れる単語の音素数に基づいて求めることを特徴とする請求項2に記載の言語モデルスコア先読み値付与装置。
- 前記平滑化言語モデルスコア先読み値計算手段は、当該音素から辿れる単語の音素数に基づいて、言語モデルスコア先読み差分値が当該音素から辿れる音素に等分されるように言語モデルスコア先読み値を求めることを特徴とする請求項5に記載の言語モデルスコア先読み値付与装置。
- 前記平滑化言語モデルスコア先読み値計算手段は、単語中の各音素での言語モデルスコア先読み値を、当該音素が含まれる枝分かれを持たない音素列の音素数に基づいて求めることを特徴とする請求項2に記載の言語モデルスコア先読み値付与装置。
- 前記平滑化言語モデルスコア先読み値計算手段は、当該音素が含まれる枝分かれを持たない音素列の音素数に基づいて、言語モデルスコア先読み差分値が枝分かれを持たない音素に等分されるように言語モデル先読み値を求めることを特徴とする請求項7に記載の言語モデルスコア先読み値付与装置。
- 言語モデルスコア先読み値を使ってフレーム同期ビームサーチを行う音声認識装置において、請求項1乃至8の何れか1項に記載の言語モデルスコア先読み値付与装置を備えたことを特徴とする音声認識装置。
- 単語について単語辞書で定義された音素列と言語モデルで定義されたスコアとから、言語モデルスコア先読み値が単語の語頭に集中しないように単語中の各音素での言語モデルスコア先読み値を求めることを特徴とする言語モデルスコア先読み値付与方法。
- 単語中の各音素での言語モデルスコア先読み値を、当該音素の当該単語中の出現順序に基づいて求めることを特徴とする請求項10に記載の言語モデルスコア先読み値付与方法。
- 単語中の各音素での言語モデルスコア先読み値を、単語先頭から当該音素までの音素数に基づいて求めることを特徴とする請求項11に記載の言語モデルスコア先読み値付与方法。
- 単語先頭から音素までの音素数に基づいて設定された言語モデルスコア先読み値の閾値以内の言語モデルスコア先読み値を求めることを特徴とする請求項12に記載の言語モデルスコア先読み値付与方法。
- 単語中の各音素での言語モデルスコア先読み値を、当該音素から辿れる単語の音素数に基づいて求めることを特徴とする請求項11に記載の言語モデルスコア先読み値付与方法。
- 当該音素から辿れる単語の音素数に基づいて、言語モデルスコア先読み差分値が当該音素から辿れる音素に等分されるように言語モデルスコア先読み値を求めることを特徴とする請求項14に記載の言語モデルスコア先読み値付与方法。
- 単語中の各音素での言語モデルスコア先読み値を、当該音素が含まれる枝分かれを持たない音素列の音素数に基づいて求めることを特徴とする請求項11に記載の言語モデルスコア先読み値付与方法。
- 当該音素が含まれる枝分かれを持たない音素列の音素数に基づいて、言語モデルスコア先読み差分値が枝分かれを持たない音素に等分されるように言語モデル先読み値を求めることを特徴とする請求項16に記載の言語モデルスコア先読み値付与方法。
- 請求項10乃至17の何れか1項に記載の言語モデルスコア先読み値付与方法により求められる言語モデルスコア先読み値を使ってフレーム同期ビームサーチを行う音声認識方法。
- 単語の音素列を定義する単語辞書と単語の出現し易さのスコアを与える言語モデルとを記憶する記憶手段を備えたコンピュータに、単語について前記単語辞書で定義された音素列と前記言語モデルで定義されたスコアとから、言語モデルスコア先読み値が単語の語頭に集中しないように単語中の各音素での言語モデルスコア先読み値を求めるステップを実行させるための言語モデルスコア先読み値付与プログラムを前記コンピュータに読み取り可能に記録するプログラム記録媒体。
- 前記ステップでは、単語中の各音素での言語モデルスコア先読み値を、当該音素の当該単語中の出現順序に基づいて求めることを特徴とする請求項19に記載のプログラム記録媒体。
- 前記ステップでは、単語中の各音素での言語モデルスコア先読み値を、単語先頭から当該音素までの音素数に基づいて求めることを特徴とする請求項20に記載のプログラム記録媒体。
- 前記ステップでは、単語先頭から音素までの音素数に基づいて設定された言語モデルスコア先読み値の閾値以内の言語モデルスコア先読み値を求めることを特徴とする請求項21に記載のプログラム記録媒体。
- 前記ステップでは、単語中の各音素での言語モデルスコア先読み値を、当該音素から辿れる単語の音素数に基づいて求めることを特徴とする請求項20に記載のプログラム記録媒体。
- 前記ステップでは、当該音素から辿れる単語の音素数に基づいて、言語モデルスコア先読み差分値が当該音素から辿れる音素に等分されるように言語モデルスコア先読み値を求めることを特徴とする請求項23に記載のプログラム記録媒体。
- 前記ステップでは、単語中の各音素での言語モデルスコア先読み値を、当該音素が含まれる枝分かれを持たない音素列の音素数に基づいて求めることを特徴とする請求項20に記載のプログラム記録媒体。
- 前記ステップでは、当該音素が含まれる枝分かれを持たない音素列の音素数に基づいて、言語モデルスコア先読み差分値が枝分かれを持たない音素に等分されるように言語モデル先読み値を求めることを特徴とする請求項25に記載のプログラム記録媒体。
- 請求項19乃至26の何れか1項に記載のプログラム記録媒体に記録された言語モデルスコア先読み値付与プログラムにより求められる言語モデルスコア先読み値を使って、前記コンピュータに、フレーム同期ビームサーチを行う音声認識ステップを実行させるための音声認識プログラムをコンピュータに読み取り可能に記録するプログラム記録媒体。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/992,760 US8682668B2 (en) | 2008-05-16 | 2009-03-27 | Language model score look-ahead value imparting device, language model score look-ahead value imparting method, and program storage medium |
JP2010511918A JP5447373B2 (ja) | 2008-05-16 | 2009-03-27 | 言語モデルスコア先読み値付与装置およびその方法ならびにプログラム記録媒体 |
CN200980117762.7A CN102027534B (zh) | 2008-05-16 | 2009-03-27 | 语言模型得分前瞻值赋值方法及设备 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2008-129937 | 2008-05-16 | ||
JP2008129937 | 2008-05-16 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2009139230A1 true WO2009139230A1 (ja) | 2009-11-19 |
Family
ID=41318603
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2009/056324 WO2009139230A1 (ja) | 2008-05-16 | 2009-03-27 | 言語モデルスコア先読み値付与装置およびその方法ならびにプログラム記録媒体 |
Country Status (4)
Country | Link |
---|---|
US (1) | US8682668B2 (ja) |
JP (1) | JP5447373B2 (ja) |
CN (1) | CN102027534B (ja) |
WO (1) | WO2009139230A1 (ja) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9683862B2 (en) | 2015-08-24 | 2017-06-20 | International Business Machines Corporation | Internationalization during navigation |
TWI731921B (zh) * | 2017-01-20 | 2021-07-01 | 香港商阿里巴巴集團服務有限公司 | 語音識別方法及裝置 |
CN108733739B (zh) * | 2017-04-25 | 2021-09-07 | 上海寒武纪信息科技有限公司 | 支持集束搜索的运算装置和方法 |
CN108959421B (zh) * | 2018-06-08 | 2021-04-13 | 腾讯科技(深圳)有限公司 | 候选回复评价装置和问询回复设备及其方法、存储介质 |
KR102177741B1 (ko) * | 2018-10-26 | 2020-11-11 | 아주대학교산학협력단 | 순환신경망 및 분기예측에 기반한 통신 메시지 해석 장치 및 그 방법 |
CN112242144A (zh) * | 2019-07-17 | 2021-01-19 | 百度在线网络技术(北京)有限公司 | 基于流式注意力模型的语音识别解码方法、装置、设备以及计算机可读存储介质 |
CN113838462B (zh) * | 2021-09-09 | 2024-05-10 | 北京捷通华声科技股份有限公司 | 语音唤醒方法、装置、电子设备及计算机可读存储介质 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH05197395A (ja) * | 1991-09-14 | 1993-08-06 | Philips Gloeilampenfab:Nv | 音声信号のワードシーケンス認識方法および装置 |
JPH07295587A (ja) * | 1994-04-14 | 1995-11-10 | Philips Electron Nv | 単語列認識方法及び装置 |
JPH10105189A (ja) * | 1996-09-27 | 1998-04-24 | Philips Electron Nv | シーケンス取出し方法及びその装置 |
JP2003140685A (ja) * | 2001-10-30 | 2003-05-16 | Nippon Hoso Kyokai <Nhk> | 連続音声認識装置およびそのプログラム |
JP2007163896A (ja) * | 2005-12-14 | 2007-06-28 | Canon Inc | 音声認識装置および方法 |
Family Cites Families (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE4130633A1 (de) * | 1991-09-14 | 1993-03-18 | Philips Patentverwaltung | Verfahren zum erkennen der gesprochenen woerter in einem sprachsignal |
JP2905674B2 (ja) | 1993-10-04 | 1999-06-14 | 株式会社エイ・ティ・アール音声翻訳通信研究所 | 不特定話者連続音声認識方法 |
JP3454959B2 (ja) * | 1995-03-15 | 2003-10-06 | 株式会社東芝 | 携帯電話装置 |
US5799065A (en) * | 1996-05-06 | 1998-08-25 | Matsushita Electric Industrial Co., Ltd. | Call routing device employing continuous speech |
US5822730A (en) * | 1996-08-22 | 1998-10-13 | Dragon Systems, Inc. | Lexical tree pre-filtering in speech recognition |
JP3061114B2 (ja) * | 1996-11-25 | 2000-07-10 | 日本電気株式会社 | 音声認識装置 |
JP3027543B2 (ja) | 1996-12-11 | 2000-04-04 | 株式会社エイ・ティ・アール音声翻訳通信研究所 | 連続音声認識装置 |
US6285786B1 (en) * | 1998-04-30 | 2001-09-04 | Motorola, Inc. | Text recognizer and method using non-cumulative character scoring in a forward search |
JP2938865B1 (ja) | 1998-08-27 | 1999-08-25 | 株式会社エイ・ティ・アール音声翻訳通信研究所 | 音声認識装置 |
JP3252815B2 (ja) * | 1998-12-04 | 2002-02-04 | 日本電気株式会社 | 連続音声認識装置及び方法 |
US6928404B1 (en) * | 1999-03-17 | 2005-08-09 | International Business Machines Corporation | System and methods for acoustic and language modeling for automatic speech recognition with large vocabularies |
US6963837B1 (en) * | 1999-10-06 | 2005-11-08 | Multimodal Technologies, Inc. | Attribute-based word modeling |
US6871341B1 (en) * | 2000-03-24 | 2005-03-22 | Intel Corporation | Adaptive scheduling of function cells in dynamic reconfigurable logic |
JP4105841B2 (ja) * | 2000-07-11 | 2008-06-25 | インターナショナル・ビジネス・マシーンズ・コーポレーション | 音声認識方法、音声認識装置、コンピュータ・システムおよび記憶媒体 |
US6980954B1 (en) * | 2000-09-30 | 2005-12-27 | Intel Corporation | Search method based on single triphone tree for large vocabulary continuous speech recognizer |
US7043422B2 (en) * | 2000-10-13 | 2006-05-09 | Microsoft Corporation | Method and apparatus for distribution-based language model adaptation |
JP2002215187A (ja) * | 2001-01-23 | 2002-07-31 | Matsushita Electric Ind Co Ltd | 音声認識方法及びその装置 |
GB2384901B (en) * | 2002-02-04 | 2004-04-21 | Zentian Ltd | Speech recognition circuit using parallel processors |
US7181398B2 (en) * | 2002-03-27 | 2007-02-20 | Hewlett-Packard Development Company, L.P. | Vocabulary independent speech recognition system and method using subword units |
US7930181B1 (en) * | 2002-09-18 | 2011-04-19 | At&T Intellectual Property Ii, L.P. | Low latency real-time speech transcription |
JP2004191705A (ja) | 2002-12-12 | 2004-07-08 | Renesas Technology Corp | 音声認識装置 |
US7031915B2 (en) * | 2003-01-23 | 2006-04-18 | Aurilab Llc | Assisted speech recognition by dual search acceleration technique |
US20040158468A1 (en) * | 2003-02-12 | 2004-08-12 | Aurilab, Llc | Speech recognition with soft pruning |
US7725319B2 (en) * | 2003-07-07 | 2010-05-25 | Dialogic Corporation | Phoneme lattice construction and its application to speech recognition and keyword spotting |
US7904296B2 (en) * | 2003-07-23 | 2011-03-08 | Nexidia Inc. | Spoken word spotting queries |
JP4583772B2 (ja) | 2004-02-05 | 2010-11-17 | 日本電気株式会社 | 音声認識システム、音声認識方法および音声認識用プログラム |
JP4541781B2 (ja) * | 2004-06-29 | 2010-09-08 | キヤノン株式会社 | 音声認識装置および方法 |
US8036893B2 (en) * | 2004-07-22 | 2011-10-11 | Nuance Communications, Inc. | Method and system for identifying and correcting accent-induced speech recognition difficulties |
US7734460B2 (en) * | 2005-12-20 | 2010-06-08 | Microsoft Corporation | Time asynchronous decoding for long-span trajectory model |
US7774197B1 (en) * | 2006-09-27 | 2010-08-10 | Raytheon Bbn Technologies Corp. | Modular approach to building large language models |
-
2009
- 2009-03-27 WO PCT/JP2009/056324 patent/WO2009139230A1/ja active Application Filing
- 2009-03-27 JP JP2010511918A patent/JP5447373B2/ja not_active Expired - Fee Related
- 2009-03-27 US US12/992,760 patent/US8682668B2/en not_active Expired - Fee Related
- 2009-03-27 CN CN200980117762.7A patent/CN102027534B/zh not_active Expired - Fee Related
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH05197395A (ja) * | 1991-09-14 | 1993-08-06 | Philips Gloeilampenfab:Nv | 音声信号のワードシーケンス認識方法および装置 |
JPH07295587A (ja) * | 1994-04-14 | 1995-11-10 | Philips Electron Nv | 単語列認識方法及び装置 |
JPH10105189A (ja) * | 1996-09-27 | 1998-04-24 | Philips Electron Nv | シーケンス取出し方法及びその装置 |
JP2003140685A (ja) * | 2001-10-30 | 2003-05-16 | Nippon Hoso Kyokai <Nhk> | 連続音声認識装置およびそのプログラム |
JP2007163896A (ja) * | 2005-12-14 | 2007-06-28 | Canon Inc | 音声認識装置および方法 |
Also Published As
Publication number | Publication date |
---|---|
US8682668B2 (en) | 2014-03-25 |
JP5447373B2 (ja) | 2014-03-19 |
CN102027534B (zh) | 2013-07-31 |
US20110191100A1 (en) | 2011-08-04 |
CN102027534A (zh) | 2011-04-20 |
JPWO2009139230A1 (ja) | 2011-09-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10210862B1 (en) | Lattice decoding and result confirmation using recurrent neural networks | |
JP5739718B2 (ja) | 対話装置 | |
JP4465564B2 (ja) | 音声認識装置および音声認識方法、並びに記録媒体 | |
JP5447373B2 (ja) | 言語モデルスコア先読み値付与装置およびその方法ならびにプログラム記録媒体 | |
Alleva et al. | An improved search algorithm using incremental knowledge for continuous speech recognition | |
US6961701B2 (en) | Voice recognition apparatus and method, and recording medium | |
JP4757936B2 (ja) | パターン認識方法および装置ならびにパターン認識プログラムおよびその記録媒体 | |
WO2001065541A1 (fr) | Dispositif de reconnaissance de la parole, procede de reconnaissance de la parole et support d'enregistrement | |
KR20140028174A (ko) | 음성 인식 방법 및 이를 적용한 전자 장치 | |
US20100100379A1 (en) | Voice recognition correlation rule learning system, voice recognition correlation rule learning program, and voice recognition correlation rule learning method | |
US11705116B2 (en) | Language and grammar model adaptation using model weight data | |
JPWO2009081895A1 (ja) | 音声認識システム、音声認識方法、および音声認識用プログラム | |
JP5276610B2 (ja) | 言語モデル生成装置、そのプログラムおよび音声認識システム | |
JP2007047412A (ja) | 認識文法モデル作成装置、認識文法モデル作成方法、および、音声認識装置 | |
JP6690484B2 (ja) | 音声認識用コンピュータプログラム、音声認識装置及び音声認識方法 | |
JP6027754B2 (ja) | 適応化装置、音声認識装置、およびそのプログラム | |
JP5309343B2 (ja) | パタン認識方法および装置ならびにパタン認識プログラムおよびその記録媒体 | |
JP4528540B2 (ja) | 音声認識方法及び装置及び音声認識プログラム及び音声認識プログラムを格納した記憶媒体 | |
JP2004109535A (ja) | 音声合成方法、音声合成装置および音声合成プログラム | |
US20050049873A1 (en) | Dynamic ranges for viterbi calculations | |
JP5008078B2 (ja) | パタン認識方法および装置ならびにパタン認識プログラムおよびその記録媒体 | |
JP2008242059A (ja) | 音声認識辞書作成装置および音声認識装置 | |
JP2000075885A (ja) | 音声認識装置 | |
JP4600705B2 (ja) | 音声認識装置および音声認識方法、並びに記録媒体 | |
JP2005134442A (ja) | 音声認識装置および方法、記録媒体、並びにプログラム |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 200980117762.7 Country of ref document: CN |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09746437 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2010511918 Country of ref document: JP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 12992760 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 09746437 Country of ref document: EP Kind code of ref document: A1 |