CN1201286C

CN1201286C - Speech recognizer with a lexial tree based N-gram language model

Info

Publication number: CN1201286C
Application number: CN99817058.5A
Authority: CN
Inventors: 林志威(音译); 严永宏(音译); 赵青薇(音译); 袁宝生(音译)
Original assignee: Intel Corp
Current assignee: Intel Corp
Priority date: 1999-12-23
Filing date: 1999-12-23
Publication date: 2005-05-11
Anticipated expiration: 2019-12-23
Also published as: WO2001048737A2; WO2001048737A3; AU1767600A; CN1406374A

Abstract

In some embodiments, the present invention comprises a method for creating a lexical tree and identifying beginning phonemes in the lexical tree. The method of the embodiments further comprises estimating the probabilities of words with particular beginning phonemes in the lexical tree and storing at least some of the estimated probabilities, and is characterized in that backoff weights are not stored together with the estimated probabilities. The estimated probabilities can be stored in a query table. In other embodiments, the present invention comprises a method for receiving and identifying phonemes in the lexical tree. The method of the other embodiments also comprises estimating the probabilities of words including the phonemes through estimated probabilities retrieved from a storage area, and is characterized in that the retrieve probabilities do not include backoff weights stored with the estimated probabilities. Besides, the estimated probabilities can be stored in a query table and can be used for establishing trimming thresholds. The methods can be realized via orders on a readable medium.

Description

Use is based on the method for the execution speech recognition of the N lattice Aramaic language model of words tree

Technical field

The present invention relates to speech recognition system, more particularly, relate to a kind of n Ge Lamu (n-gram) language mode based on words tree.

Background technology

An ingredient in the speech recognition device is a language mode.Language mode comprises that probability that word occurs and word follow the probability of another word or a plurality of words in a vocabulary.Really, the popular approach of catching a kind of syntactic structure of given language is to use conditional probability to catch to be embedded in the continuous information in the sentence word string.For example, if current word is w ₁, so just can set up a kind of language mode, some other words w is described ₂, w ₃... w _NTo follow at w ₁The probability of back.Conditional probability usually by check in the training main body (for example, newspaper) word each other neighbour's frequency computation part come out.For example, conditional probability P ₂₁=(w ₂| w ₁) be word w ₂Follow word w ₁Probability.Probability P ₂₁Be called as two Ge Lamu.One three lattice Aramaic language model is the conditional probability that two other word followed in order in a word.For example, P ₂₁₀=(w ₂| w ₁w ₀) be word w ₂Follow word w ₁And w ₁Follow word w again ₀Probability.Single Ge Lamu or 1-Ge Lamu probability are a probability that word will occur.For example, p ₁=p (w ₁) be word w ₁The probability that under the situation of not considering previous word, will occur at a special time.

The possible quantity of related combinations of words is the rising of geometric series ground among single Ge Lamu, two Ge Lamu, three Ge Lamu etc.The term of Shi Yonging " lower Ge Lamu " and " higher Ge Lamu " are meant the rank of Ge Lamu herein.For example, single Ge Lamu is lower than two Ge Lamu, and two Ge Lamu is lower than three Ge Lamu.Three Ge Lamu are than two Ge Lamugao, and two Ge Lamu is than single Ge Lamugao.For a big vocabulary, the sum of the combination of three Ge Lamu, even the sum of the combination of two Ge Lamu also big must be difficult to the management.Yet the result is that three so a large amount of Ge Lamu and two Ge Lamu cause conditional probability very little (almost nil), are unworthy they are put in the language mode.Someone once used the over-compensation weight to adjust the probability of low Ge Lamu.For example, when three Ge Lamu probability are not included in the language mode, so just can use two Ge Lamu probability to multiply by a backoff weight (bowt) again.If backoff weight does not exist, Ge Lamu that so just can be lower replaces higher Ge Lamu.Correspondingly, the n-gram language mode based on word can be expressed as equation (1), and is as follows:

As mentioned above, represent, also seldom consider to be higher than the situation of three Ge Lamu although equation (1) is a more common n-gram.

Typical n-gram language mode file memory format is as follows:

For 1-Ge Lamu: p (w ₁) w ₁Bowt (w ₁)

For i-Ge Lamu: (for i=1 ..., n-1) p (w _i| w _I-1... w ₁) w ₁... w _iBowt (w ₁... w _I-1)

For n-gram:p (w _n| w _N-1w _N-2... w ₁) w ₁... w _n

Words tree is used to organize possible word.For example, suppose in a words tree word w ₂, w ₃... w _NIn any one all may be followed at word w ₁The back.Can calculate conditional probability to help decision word w ₂, w ₃... w _NIn which word follow word w ₁The back.For large-scale vocabulary, the quantity of possibility is huge.The someone develops various technology, compares low probability path low with respect to peaked threshold value by using one " pruning velocity of sound " " cutting " its conditional probability, thereby reduces the quantity of related possibility.

Word arrives as a series of phoneme detection.Phoneme is meant the digital electric signal of expression sound herein.But before last phoneme of word was detected, what say was which word is normally ignorant, and the result causes the pruning to the word of receiving to postpone, thereby slack-off to the speed integral body of reception word decoding.

The articles that the people write such as S.Ortmanns " Language-Model Look-Aheadfor Large Vocabulary Speech Recognition; " ICSLP96 (1996), among the pp.2095-98, propose a kind of Look-ahead Technique, in the pruning process of sound beam search strategy, early merged language model probability.But how the author of this article fails to recognize that best the estimated probability with the words tree of storage remains on manageable level.For example, people's such as Ortmanns article is drawn a conclusion at last, and the big young pathbreaker of table who has stored calculating (estimation) probability is unusually big.See P2097 in the literary composition.

Therefore, large-scale vocabulary continuous speech voice recognition device (LVCSR) needs a kind of better words tree n-gram language mode form.

Summary of the invention

According to an aspect of the present invention, provide a kind of method of carrying out speech recognition, having comprised: created words tree; Discern first phoneme in this words tree; Estimate to have in the words tree probability of the word of first phoneme; And the probability of storing some estimations at least, wherein the probability of the estimation of being stored includes only directly the probability of the estimation of deriving from the n-gram probability, and the probability of the estimation of deriving from the compensation probability probability that compensated (n-1) gram to estimate approx.

According to another aspect of the invention, also provide a kind of method of carrying out speech recognition, having comprised: received phoneme and in words tree, discern them; And the probability of estimating to comprise the word of these phonemes by the probability that uses the estimation from the memory block, retrieve, the probability of the estimation that wherein retrieves includes only directly the probability of the estimation of deriving from the n-gram probability, and the probability of the estimation of deriving from the compensation probability probability that compensated (n-1) gram to estimate approx.

In certain embodiments, the present invention includes a kind of method of creating the beginning phoneme in words tree and this words tree of identification.The method of these embodiment further comprises to be estimated to have the probability of word in words tree of specific beginning phoneme and stores the probability of some estimations at least, and wherein backoff weight is not stored with the probability of estimating.The probability of estimating can be stored in the question blank.

In other embodiments, the present invention includes a kind of method that receives phoneme and in words tree, discern them.The method of these embodiment also comprises the probability of estimating to comprise the word of these phonemes by the probability that uses the estimation retrieve from the memory block, wherein retrieves probability and does not comprise the backoff weight that is stored together with the probability of estimating.Equally, the probability of estimation can be stored in the question blank.

The probability of estimating can use when setting up the pruning threshold value.

These methods can realize by the instruction on the computer-readable medium.

This paper has also introduced more embodiment and has been summarized in claims.

Description of drawings

By reading following detailed description and with reference to the accompanying drawing of embodiments of the invention, you will have one comprehensively to understand to the present invention, still, the present invention should not only limit to embodiment described here, and these embodiment are only with the usefulness that explains and understand.

Fig. 1 is the synoptic diagram of expression according to the words tree of some embodiments of the present invention.

Fig. 2 is a kind of block scheme of high level overview of the computer system that can be used for some embodiments of the present invention.

Fig. 3 is a kind of synoptic diagram of high level overview of the hand-held computer system that can be used for some embodiments of the present invention.

Embodiment

The present invention relates to the n-gram language mode form of a kind of LVCSR of being used for based on words tree.By means of the present invention,, can estimate the probability of a word in case detect a beginning phoneme.The path that pruning is lower than a threshold value can begin before identifying successor word.The present invention is used for quickening the search procedure of LVCSR.In decode procedure, language mode plays a part crucial, no matter is aspect accuracy rate, still at aspect of performance.Therefore, the performance of speech recognition system is relevant with language mode.

The present invention relates to organize the several different methods of words tree.As an example, Fig. 1 has shown a part of synoptic diagram of a words tree.Words tree among Fig. 1 links together many words according to phoneme, and different words can more shared identical phonemes.Predecessor word w ₀Represent with a rectangle.w ₀Have word before, also may not have word.In vocabulary, there are some phonemes to can be used as the beginning of successor word.These begin phoneme is Bph ₁, Bph ₂... Bph _x, may be less than the sum of phoneme.

A plurality of words can begin with a phoneme.For the ease of discussing, the word of shared identical phoneme has similar label.For example, word w ₁₁, w ₁₂And w ₁₃Each word is all with beginning phoneme Bph ₁Beginning.More particularly, phoneme Bph ₁, ph ₂, ph ₃And ph ₄Constituted word w ₁₁(for example, word " fund "); Phoneme Bph ₁, ph ₂, ph ₃, ph ₄And Ph ₅Constituted word w ₁₂(for example, " funds "), phoneme Bph ₁And ph ₂-ph ₄And ph ₆-ph ₁₀Constituted word w ₁₃(for example, " fundamental ").(note that in the word actual phoneme quantity may with show here different).In reality, typically situation is, and is much more with the word of identical phoneme beginning, but for the ease of discussing, only shown three and Bph ₁Related word.In the example of Fig. 1, suppose word w ₁₂It is last detected word.In this case, word w ₀And w ₁₂The place will be Actual path, other path will be potential path.

In certain embodiments, in case word W ₀First follow-up phoneme identified, just can carry out the estimation of the probability of successor word, so just can before knowing successor word definitely, begin to prune.

In some embodiments of the invention, can use the n-gram language mode form based on words tree, this form can be applied to effectively with (for example) a kind of Viterbi decoding algorithm based on tree and use the language mode controlling mechanism of going ahead of the rest.For a Viterbi sound beam search algorithm, usually for tree state s and older generation's word string w based on tree _N-1w _N-2W ₁The language model probability π of estimation _v(s) can estimate by equation as follows (2):

π_{v} (s) = \overset{\max}{w &Element; W (s)} (λ_{w} \cdot p (w | w_{n - 1} w_{n - 2} . . . w_{1})) - - - (2)

Wherein W (s) is one group of set of words that can obtain from words tree state s, λ _wExpression weight (with fraction representation), v is a predecessor word, p (w|w _N-1w _N-2W ₁) expression n-gram word conditional probability.π _v(s) also can be called the estimated probability P that in setting up the pruning threshold value, uses _EstimatedEstimated probability also can be called controls probability in advance.As the result that the applicational language pattern is controlled in advance, can obtain to prune more closely the sound bundle to quicken decode procedure.Fractional weight λ _wCan be set to 1 or can be between 0 and 1.In certain embodiments, λ _wMay be greater than 1.Fractional weight can adopt empirical method, determine or calculate by repetition test.For each Bphl, fractional weight may be identical, also may be different.Though the present invention represents with n-gram, also may use three Ge Lamu, two Ge Lamu, single Ge Lamu and/or other Ge Lamu in practice.

The skeleton view of phoneme node is the state of a tree.In the process of speaking, there is increasing phoneme to be detected in this tree, the probability of estimation may need to recomputate, and can continue so that prune.

Common above mentioned estimation (calculating) language model probability must dynamically be calculated and be generated when operation.This process is very time-consuming, although introduced high-speed cache to save overall computing cost.The probability of calculate estimating in advance also is stored in them in the question blank and can accelerates this process significantly.

In the example of Fig. 1, suppose that Bphl is first phoneme of successor word.In this case, will provide two Ge Lamu examples of equation (2) in the equation (3), as follows:

P _estimated＝λ _wmax{P(w ₁₁|w ₀)，P(w ₁₂|w ₀)，P(w ₁₃|w ₀)} (3)。

According to circumstances, will prune away those probability or conditional probability are lower than threshold value, or are equal to or less than the word of threshold value.The method of derivation threshold value is varied.For example, multiply by P with a numeral _EstimatedOr P _EstimatedSubtract a numeral.

For quickening decode procedure, we are limited in a compensation mechanism in the controllable scope by deployment with memory requirements and have defined a n-gram language mode form based on words tree, are used to store precalculated estimated probability.The probability P of Gu Jiing generally speaking _EstimatedCan obtain by equation (4) as follows:

P_{estimated} = \underset{\max}{P (S_{j} | w_{n - 1} w_{n - 2} . . . . w_{1})}

S wherein _jBe j state of potential successor word.Equation (4) comprises the triplex row in the bracket.Generally speaking, the top line of equation (4) is exactly equation (2).Certainly, equation (4) also can be used for different Ge Lamu, as single Ge Lamu, two Ge Lamu and three Ge Lamu.Equation (4) provides the approximate value of equation (2).Has only under the situation that the top line when equation (4) is met P _EstimatedJust can store in the memory block, for example be stored in the question blank, question blank just can be controlled at manageable less level like this.

In equation (4), we needn't store backoff weight because they with n-gram language mode based on standard word in the weight of storing identical.In decoding, backoff weight can obtain by the file of a routine.In decoding,,, just use the estimated probability of the lower-order of band backoff weight if be fit to so if first row of equation (4) is not being met.

The probability that is used to prune can be the estimated probability of successor word, and the perhaps probability addition of estimated probability and predecessor word (for example, in Fig. 1, p (w ₀)+P _Estimated).

In certain embodiments, question blank has been stored the language mode estimated probability based on the n-gram of tree, and is as follows.Yet, also can use extended formatting.

1-Ge Lamu:

p(s ₁)?s ₁

...

I-Ge Lamu: (i=1 ..., n-1)

p(si|w _i-1...w ₁)w ₁...w _i-1?s _i

...

n-gram：

p(s _n|w _n-1?w ₁)w ₁...w _n-1?s _n

Because the sum of the node in the compression words tree is equivalent to the sum of the word in the dictionary, based on the n-gram language mode of words tree and with equation (4) is total storage of the words tree of approximate value, compare with the language mode of traditional corresponding n-gram based on word, its rank are identical.The treatment technology that is used for common n-gram language mode can be applied to the new language mode file based on words tree of the present invention.

In certain embodiments, estimated probability calculated before identification, and was stored in the question blank.Yet,, in certain embodiments, only store the clauses and subclauses that those are directly derived from n-gram probability (obstructed over-compensation) for dwindling the size of table.(n-gram compensates to (n-1)-Ge Lamu) roughly compensate to (n-1)-Ge Lamu estimated probability from the clauses and subclauses derived of compensation probability.By compression, the size of table can narrow down to a manageable level.

When arriving last phoneme (or terminal note) of a word, just can identify successor word.For example, in Fig. 1, in case arrived phoneme ph ₅, just know it is word w ₁₂In case known word, just estimated probability can have been replaced with actual probabilities.This can be by adding full-scale condition probability (for example, p (w in Fig. 1 ₁₂| W ₀)) and deduct estimated probability and realize.In certain embodiments, can suppose from first word in the cumulative probability of searching period, for example, p (w ₁w ₂w ₃... w _i)=p (w ₁)+p (w ₂| w ₁)+P (w ₃| w ₂)+...+p (w _i| w _I-1).Logarithm that can probability of use is so that method is converted to addition in the future: log (P ₁* P ₂)=log (p ₁)+log (p ₂).

True probability can determine that after identifying last phoneme it can be expressed as P _True=p (w _Predecessor)+P _Estimated+ P (W _Actual| W _Predecessor)-P _EstimatedIn the example of Fig. 1, suppose that word w12 is an actual word, true probability P _True=p (w ₀)+P _Estimated+ P (W ₁₂| W ₀)-P _Estimated, P wherein _EstimatedCan obtain by method as described above.

The node of words tree can fold or compresses by eliminating unnecessary node.For example, in Fig. 1, phoneme Bph ₁, ph ₂, ph ₃, and ph ₄Can be folded into a state (node).Yet, in practice, Bph ₁Usually have other branch's words, therefore may not use ph ₂-ph ₄Folding.Phoneme ph ₆-ph ₁₀Can be folded into a kind of state.In certain embodiments, two kinds of words trees are arranged: originally one is used for speech recognition device, and the words tree of compression is used for language mode.The compression words tree can be used for creating during training question blank.In training, can create words tree from a dictionary according to known technology.

There are various computer systems can be applied in training and the speech recognition system.Only as an example, Fig. 2 has shown the synoptic diagram of a computer system 10, and this computer system has a processor 14, storer 16, and I/O and controll block 18.A large amount of memory spaces is arranged in the processor 14, and storer 16 can be represented the storer of the chip that is not positioned at processor 14 or a part is positioned at but a part is not positioned at the storer of the chip of processor 14.(perhaps storer 16 can fully be positioned on the chip of processor 14).At least some I/O can be positioned on the identical chip with processor 14 with controll block 18, perhaps are positioned on the independent chip.Mai Gelamu wind 26, monitor 30, annex memory 34 and an input equipment (such as keyboard and mouse 38), network are connected 42, and loudspeaker 44 can be connected with controll block 18 with I/O.Storer 34 can be represented various storeies, as hard disk, CD-ROM or DVD CD.Question blank can be any form, not as a restricted term.The estimated probability of storage may be all together or be distributed to different positions.Part or all of table can duplicate and be put in the different storeies.Question blank may be positioned at storer 16, storer 34 or other place.Question blank 22 and 24 is represented all or part of of question blank.Emphasize that more a bit, the system among Fig. 1 only explains usefulness, the present invention is not limited only to adopt the situation of such computer system.Be used to realize that computer system 10 of the present invention and other computer systems can be various forms of computers, as desktop computer, large scale computer and portable computer.

For example, Fig. 3 has shown a portable equipment 60, and has a display 62, can be used for realizing the part or all of function of Fig. 2.Portable equipment can be connected with another computer system (as the system among Fig. 2) sometimes.The shape of the object among Fig. 2 and 3 and relative size do not hint its true form and relative size yet.

Various storeies can be regarded as computer-readable medium, in the above can storage instructions, when carrying out these instructions, just can implement some embodiments of the present invention.

Other information and embodiment

Two lattice Aramaic language models have been realized based on the above-mentioned form of employing of words tree.By using the control in advance of precalculated language mode, we have not only saved the computing cost of estimated probability, the saving amount can reach decoding task total computing time 15%, but also the essential needed about 50MB internal memory of buffer memory when having saved these probability of dynamic generation.Yet (, these numerals are example just, is not requirement.) in addition, our newspeak pattern form is also handled the more language mode control in advance of high-order for we provide with rational time and internal memory.

" embodiment ", " embodiment ", " some embodiment " or " other embodiment " mentioned in this explanation are meant at least in some embodiments of the invention, specific function, structure or the feature related with embodiment that not necessarily comprise in all embodiments.Said " embodiment ", " embodiment " or " some embodiment " differ to establish a capital and are meant identical embodiment.

If say in illustrating " possibility ", " can " or " perhaps " comprise assembly, function, structure or a feature, this specific components, function, structure or feature are not necessarily leaveed no choice but be comprised so.If mention " one " element in instructions or " claims ", be not that to mean be an element so.If mention " other " element in instructions or " claims ", be not to get rid of a plurality of other elements are arranged so.

Those those skilled in the art will find can make many changes to aforesaid explanation and accompanying drawing within the scope of the invention.Correspondingly, define scope of the present invention by following claims and to its any revisal.

Claims

1. method of carrying out speech recognition comprises:

Create words tree;

Discern first phoneme in this words tree;

Estimate to have in the words tree probability of the word of first phoneme; And

At least store the probability of some estimations, wherein the probability of the estimation of being stored includes only directly the probability of the estimation of deriving from the n-gram probability, and the probability of the estimation of deriving from the compensation probability probability that compensated (n-1) gram to estimate approx.

2. method according to claim 1 is characterized in that only just storing the probability of estimating under the situation that the n-gram of correspondence exists.

3. method according to claim 1 is characterized in that the probability of estimating is stored in the question blank.

4. method according to claim 3 is characterized in that question blank comprises following message:

1-Ge Lamu: p (s ₁) s ₁

I-Ge Lamu: (for i=1 ..., n-1): p (s _i| w _I-1... w ₁) w ₁... w _I-1s _i

n-gram：p(s _n|w _n-1?w ₁)w ₁...w _n-1?s _n

5. method according to claim 1 is characterized in that the probability P of estimating _EstimatedTo obtain according to following equation:

S wherein _jBe j state of the word related with first phoneme, wherein W (s) is the set of words that can draw from words tree state s, λ _wRepresent a fractional weight, wherein only under the situation of first row that satisfies above-mentioned equation, just store the probability of estimating.

6. method according to claim 5 is characterized in that λ _wBe 1.

7. method according to claim 5 is characterized in that λ _wBetween 0 and 1, and select for each first phoneme.

8. method of carrying out speech recognition comprises:

Receive phoneme and in words tree, discern them; And

Estimate to comprise the probability of the word of these phonemes by the probability that uses the estimation from the memory block, retrieve, the probability of the estimation that wherein retrieves includes only directly the probability of the estimation of deriving from the n-gram probability, and the probability of the estimation of deriving from the compensation probability probability that compensated (n-1) gram to estimate approx.

9. method according to claim 8 is characterized in that the probability of estimating is stored in the question blank.

10. method according to claim 9 is characterized in that question blank comprises following message, and wherein s is the state of words tree, and p is a probability:

1-Ge Lamu: p (s ₁) s ₁

I-Ge Lamu: (for i=1 ..., n-1): p (s _i| w _I-1... w ₁) w ₁... w _I-1s _i

n-gram：p(s _n|w _n-1?w ₁)w ₁...w _n-1?s _n

11. method according to claim 8 is characterized in that can deriving backoff weight information from being stored in one based on the weight the n-gram language mode of word.

12. method according to claim 8 is characterized in that the probability of estimating uses when setting up a pruning threshold value.

13. method according to claim 8 is characterized in that the probability of estimating determines according to following equation:

S wherein _jBe j state of the word related with first phoneme, wherein W (s) is the set of words that can draw from words tree state s, λ _wRepresent a fractional weight, only store the result of first row.