CN108417210A

CN108417210A - A kind of word insertion language model training method, words recognition method and system

Info

Publication number: CN108417210A
Application number: CN201810022130.3A
Authority: CN
Inventors: 俞凯; 陈瑞年
Original assignee: Shanghai Jiaotong University; Suzhou Speech Information Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2018-01-10
Filing date: 2018-01-10
Publication date: 2018-08-17
Anticipated expiration: 2038-01-10
Also published as: CN108417210B

Abstract

The present invention discloses a kind of word insertion language model training method, including：The attribute of all words in corpus is determined to generate vocabulary, the attribute includes the probability distribution of the probability distribution and all words of the parts of speech classification of all words, all parts of speech classification under affiliated parts of speech classification；Generate the term vector of all words in the vocabulary；Generate the parts of speech classification vector for corresponding to the parts of speech classification of all words in the vocabulary；It is input with the parts of speech classification vector of the word in the term vector of the word in the vocabulary and the vocabulary, it is that output is trained with probability distribution of the word in the probability distribution of the parts of speech classification belonging to the word in the vocabulary and the vocabulary under affiliated parts of speech classification, to obtain, predicate is embedded in language model.Language model is being carried out in the embodiment of the present invention, is even encountering OOV words, also can accurately identified by the morphologic informations of the OOV words and the lexical level information of the parts of speech classification.

Description

A kind of word insertion language model training method, words recognition method and system

Technical field

The present invention relates to field of artificial intelligence more particularly to a kind of word insertion language model training method, word to know Other method and system.

Background technology

At present applied to the language model in speech recognition system be mainly used for the word gone out for speech recognition and sentence into Row marking, and plus the marking of acoustic model, with the optimal result identified.The existing language model based on neural network Training speed is slower, and while training needs fixed known vocabulary.In traditional language model, in the vocabulary for training Each word represented with an one-hot vector, for example, vocabulary size is 10,000 (that is, having 10,000 words in this table), that It indicates to use the vector of 10,000 dimensions when word and only corresponds to that of the word for 1.Then by this to Amount is multiplied after being input to neural network with a word embeded matrix, is converted into the vector of a real number, final to realize language mould The training of type；It is also that word is converted to real vector to be identified when same progress words recognition.

However, inventor has found in the implementation of the present invention, the vocabulary for train language model often can not All words are enough covered, so when using conventional language model, if encountering the word that word to be identified is non-typing vocabulary When language (for example, the word out-of-vocabulary, OOV except the vocabulary appeared in practical application later), conventional language Model correctly can not be reliably identified to word progress (because the word to be identified is not present in vocabulary, so at all With regard to there is no the vectors corresponding to the word to be identified, so that cannot also be embedded in multiplication of vectors with word should to obtain corresponding to The real vector of OOV), unless the word to be identified is added to vocabulary, and use new one new language of vocabulary re -training Say model.

In the presence of solving the problems, such as conventional language model, most common at present is exactly with a special label< unk>Represent all words beyond vocabulary.Obtain a determining vocabulary first, and be added one it is special<unk>Character.So All OOV words in training set are all replaced with afterwards<unk>It is trained.In use, all OOV words are equally used<unk >Instead of.

Inventor has found in the implementation of the present invention, and the word beyond vocabulary is usually all rare words, therefore these words Training data it is inherently seldom, way has given up the information in linguistics possessed by these OOV words in the prior art, because This, is also finally very inaccurate to the recognition result of these words.

Invention content

The embodiment of the present invention provides a kind of method and system for adding neologisms in neural network language model, at least Solve one of above-mentioned technical problem.

In a first aspect, the embodiment of the present invention provides a kind of word insertion language model training method, including：

The attribute of all words in corpus is determined to generate vocabulary, the attribute includes the part of speech point of all words The probability distribution of class, the probability distribution and all words of all parts of speech classification under affiliated parts of speech classification；

Generate the term vector of all words in the vocabulary；

Generate the parts of speech classification vector for corresponding to the parts of speech classification of all words in the vocabulary；

It is input with the parts of speech classification vector of the word in the term vector of the word in the vocabulary and the vocabulary, with institute Word in the probability distribution and the vocabulary of the parts of speech classification belonging to word in predicate table is general under affiliated parts of speech classification Rate is distributed as output and is trained, and to obtain, predicate is embedded in language model.

Second aspect, the embodiment of the present invention also provide a kind of words recognition method, and it is above-mentioned that the method uses the present invention Word in embodiment is embedded in language model, the method includes：

Generate the term vector of word to be identified；

Determine the parts of speech classification vector of the parts of speech classification of the word to be identified；

The term vector of the word to be identified and parts of speech classification vector input institute predicate are embedded in language model, to obtain State the probability of the probability distribution and the word to be identified of the parts of speech classification belonging to word to be identified under affiliated parts of speech classification point Cloth.

The third aspect, the embodiment of the present invention also provide a kind of word insertion language model training system, including：

Vocabulary generates program module, for determining the attribute of all words in corpus to generate vocabulary, the attribute Probability of the probability distribution and all words of parts of speech classification, all parts of speech classification including all words under affiliated parts of speech classification Distribution；

Term vector generates program module, the term vector for generating all words in the vocabulary；

Parts of speech classification vector generator module, for generating corresponding to the parts of speech classification of all words in the vocabulary Parts of speech classification vector；

Model training program module, for the word of the word in the term vector of the word in the vocabulary and the vocabulary Property class vector be input, with the word in the probability distribution of the parts of speech classification belonging to the word in the vocabulary and the vocabulary Probability distribution under affiliated parts of speech classification is that output is trained, and to obtain, predicate is embedded in language model.

Fourth aspect, the embodiment of the present invention also provide a kind of words recognition system, including：

Word is embedded in language model；

Term vector generates program module, the term vector for generating word to be identified；

Vocabulary generates program module, the parts of speech classification vector of the parts of speech classification for determining the word to be identified；

Words recognition program module, for the term vector of the word to be identified and parts of speech classification vector to be inputted institute's predicate Embedded language model, with obtain the parts of speech classification belonging to the word to be identified probability distribution and the word to be identified in institute Belong to the probability distribution under parts of speech classification.

5th aspect, the embodiment of the present invention provide a kind of non-volatile computer readable storage medium storing program for executing, the storage medium In to be stored with one or more include the programs executed instruction, described execute instruction can (include but not limited to by electronic equipment Computer, server or network equipment etc.) it reads and executes, it is embedded in language for executing any of the above-described word of the present invention Model training method and/or words recognition method.

6th aspect, provides a kind of electronic equipment comprising：At least one processor, and at least one place Manage the memory of device communication connection, wherein the memory is stored with the instruction that can be executed by least one processor, institute It states instruction to be executed by least one processor, so that at least one processor is able to carry out any of the above-described of the present invention Word is embedded in language model training method and/or words recognition method.

7th aspect, the embodiment of the present invention also provide a kind of computer program product, and the computer program product includes The computer program being stored on non-volatile computer readable storage medium storing program for executing, the computer program include program instruction, when When described program instruction is computer-executed, the computer is made to execute any of the above-described word insertion language model training method And/or words recognition method.

The advantageous effect of the embodiment of the present invention is：It is when carrying out the training of language model and non-straight in the embodiment of the present invention Connect the attribute that the word in corpus is brought and is directly used in training, but all words is determined first, including all words Parts of speech classification, all parts of speech classification probability distribution under affiliated parts of speech classification of probability distribution and all words；And Considered when language model training the morphologic information and syntax class information of word, especially syntax class information Introducing, the general character for the word for belonging to same parts of speech classification has been considered, so that the obtained language model of training is in reality In the application of border, OOV words are even encountered, the syntax of the morphologic information and the parts of speech classification of the OOV words can be also passed through Grade information is accurately identified.

Description of the drawings

In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment Attached drawing be briefly described, it should be apparent that, drawings in the following description are some embodiments of the invention, for ability For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to these attached drawings other attached Figure.

Fig. 1 is the flow chart that the word of the present invention is embedded in an embodiment of language model training method；

Fig. 2 is the flow chart that the word of the present invention is embedded in another embodiment of language model training method；

Fig. 3 is the flow chart of an embodiment of the words recognition method of the present invention；

Fig. 4 is the flow chart of another embodiment of the words recognition method of the present invention；

Fig. 5 is the flow chart of the another embodiment of the words recognition method of the present invention；

Fig. 6 is the structural schematic diagram that the word of the present invention is embedded in an embodiment of language model；

Fig. 7 is the functional block diagram that the word of the present invention is embedded in an embodiment of language model training system；

Fig. 8 is the functional block diagram that the word of the present invention is embedded in another embodiment of language model training system；

Fig. 9 is the functional block diagram of an embodiment of the words recognition system of the present invention；

Figure 10 is the functional block diagram of another embodiment of the words recognition system of the present invention；

Figure 11 is the functional block diagram of the another embodiment of the words recognition system of the present invention；

Figure 12 is the structural schematic diagram of an embodiment of the electronic equipment of the present invention.

Specific implementation mode

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art The every other embodiment obtained without creative efforts, shall fall within the protection scope of the present invention.

It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.

The present invention can describe in the general context of computer-executable instructions executed by a computer, such as program Module.Usually, program module includes routines performing specific tasks or implementing specific abstract data types, program, object, member Part, data structure etc..The present invention can also be put into practice in a distributed computing environment, in these distributed computing environments, by Task is executed by the connected remote processing devices of communication network.In a distributed computing environment, program module can be with In the local and remote computer storage media including storage device.

In the present invention, " module ", " device ", " system " etc. refer to the related entities applied to computer, such as hardware, firmly Combination, software or software in execution of part and software etc..In detail, for example, element can with but be not limited to run on place Manage process, processor, object, executable element, execution thread, program and/or the computer of device.In addition, running on server On application program or shell script, server can be element.One or more elements can be in the process and/or line of execution Cheng Zhong, and element can be localized and/or be distributed between two or multiple stage computers on one computer, and can be by Various computer-readable medium operations.Element can also be according to the signal with one or more data packets, for example, coming from one It is interacted with another element in local system, distributed system, and/or the network in internet is handed over by signal and other systems The signal of mutual data is communicated by locally and/or remotely process.

Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or orders.Moreover, the terms "include", "comprise", include not only those elements, and And further include other elements that are not explicitly listed, or further include for this process, method, article or equipment institute it is intrinsic Element.In the absence of more restrictions, the element limited by sentence " including ... ", it is not excluded that wanted including described There is also other identical elements in the process, method, article or equipment of element.

As shown in Figure 1, the embodiment of the present invention provides a kind of word insertion language model training method, this method includes：

S11, the attributes of all words in corpus is determined to generate vocabulary, the attribute includes the part of speech of all words Classify, the probability distribution of the probability distribution and all words of all parts of speech classification under affiliated parts of speech classification.

All words in corpus are stored according to parts of speech classification in vocabulary in the present embodiment, parts of speech classification can To include noun, adjective, verb, adverbial word etc., determine that the word that all parts of speech classification are included is accounted for from language material by counting The ratio of all words in library, that is, determine the probability distribution of all parts of speech classification.

Then the word that further statistics belongs in each parts of speech classification accounts for the ratio of all words under the parts of speech classification, Determine probability distribution of all words under the parts of speech classification.

S12, the term vector for generating all words in the vocabulary.Term vector by obtaining word has obtained the shape of word State information.

S13, the parts of speech classification vector for corresponding to the parts of speech classification of all words in the vocabulary is generated；By determining part of speech The mode of class vector can determine the syntax class information (that is, semantic information) for the word for belonging to corresponding parts of speech classification so that All words for belonging to a parts of speech classification share same parts of speech classification vector.

S14, with the parts of speech classification vector of the word in the term vector of the word in the vocabulary and the vocabulary be input, With the word in the probability distribution of the parts of speech classification belonging to the word in the vocabulary and the vocabulary under affiliated parts of speech classification Probability distribution be output be trained, to obtain predicate be embedded in language model.

Directly the two vectors are stitched together as input, in output layer, are referred to using a kind of normalization of factorization Function is counted to calculate the probability of each word.The probability distribution of each semantic classification is calculated first, then calculates word under these parts of speech Probability distribution.This two probability distribution multiplications finally can be obtained into required probability distribution.

In the embodiment of the present invention carry out language model training when and the indirect word by corpus bring directly For training, but the attribute of all words is determined first, including the parts of speech classification of all words, all parts of speech classification is general The probability distribution of rate distribution and all words under affiliated parts of speech classification；And consider when carrying out language model training The morphologic information and syntax class information of word, the especially introducing of syntax class information, the word utilized is in semantics Some general character allow OOV words that can also utilize the parameter of the word in vocabulary, consider by the method for parameter sharing Belong to the general character of the word of same parts of speech classification, so that the language model that training obtains is in practical applications, even meets It, also can be accurate to carry out by the morphologic information of the OOV words and the lexical level information of the parts of speech classification to OOV words Identification.

In addition, because we introduce additional information (semantic classification, morphology decompose) so that our method Directly unseen neologisms can be modeled, and modeling is more accurate.However in conventional methods where, being modeled to neologisms needs New data then re -training model is collected, this takes very much.Therefore in actual use, our model can be with Greatly save the time that neologisms are added.This method can be further integrated into speech recognition system, to realize quick vocabulary Update, while also having promotion on the recognition correct rate of low-frequency word.

As shown in Fig. 2, in some embodiments, the term vector for generating all words in the vocabulary includes：

S21, judgement are obtained from whether the word of the vocabulary is low-frequency word；

S22, if it is, disassembling the word for being obtained from the vocabulary at word, and the obtained word of dismantling is encoded For the corresponding term vector of determination；

S23, if it is not, then extraction described in be obtained from the vocabulary word vector as term vector.

In the embodiment of the present invention, different processing methods is used for high frequency words and low-frequency word：For high frequency words, each word There is its independent term vector (for example, one-hot vectors may be used)；For low-frequency word, it is carried out on morphology first Dismantling (being word for Chinese i.e. dismantling), the method for then further using some sequential codings is converted into one and determines Long vector, common coding method have character level fixed size routinely to forget coding (FOFE, Fixed-size Ordinally Forgetting Encoding), directly by word addition of vectors, using cycle or convolutional neural networks encode etc..This In inventive embodiments, a word not merely regards an one-hot vector as, but is classified (pos from syntax level Tag, part-of-speech tag) classified (that is, expression of the syntax level of one-hot), then from morphology level, it will One coding for carrying out FOFE code.

Why high frequency words and low-frequency word are distinguished, is because the meaning of not all word can be with its word come very It indicates well, therefore the embodiment of the present invention avoids the influence under the above situation to language model performance.

In addition, needing that one group of parameter is arranged to each word in conventional language model.And due in the embodiment of the present invention In method, low-frequency word has all been disassembled at word (morphology decomposition), therefore we only need to be to all high frequency words and word setting ginseng Number, caused effect is exactly that required parameter amount greatly reduces and (can usually reduce 80% or so), and parameter amount, which has been lacked, to be brought Benefit be exactly that obtained word insertion language model of the embodiment of the present invention can be embedded into some smaller equipment (such as hands Machine) in.

The morphology decomposition of also word can also be decomposed at the beginning with phoneme, but due to the meaning gap of certain homonyms It is very big, therefore the method effect that phoneme decomposes not is fine, and just overcome this by using the method for the embodiment of the present invention A little problems.

In some embodiments, the part of speech with the word in the term vector of the word in the vocabulary and the vocabulary Class vector is input, is existed with the word in the probability distribution of the parts of speech classification belonging to the word in the vocabulary and the vocabulary Probability distribution under affiliated parts of speech classification is that output is trained, and to obtain, predicate insertion language model includes：

The parts of speech classification vector of word in the term vector of word in the vocabulary and the vocabulary is input to length When memory network；

By the length, memory network is input to parts of speech classification device to obtain belonging to the word in the vocabulary in short-term Parts of speech classification probability distribution；

By the length, memory network is input to word grader to obtain the word in the vocabulary affiliated in short-term Parts of speech classification under probability distribution.

The word insertion language model that training obtains includes long memory network, parts of speech classification device and word grader in short-term.

As shown in figure 3, the embodiment of the present invention also provides a kind of words recognition method, the method uses implementation of the present invention Word described in example is embedded in language model, the method includes：

S31, the term vector for generating word to be identified；

S32, determine that the parts of speech classification of the parts of speech classification of the word to be identified is vectorial；

S33, the term vector of the word to be identified and parts of speech classification vector input institute predicate are embedded in language model, to obtain Take probability distribution and the word to be identified of the parts of speech classification belonging to the word to be identified general under affiliated parts of speech classification Rate is distributed.

Employed in the embodiment of the present invention language model is when being trained and the indirect word by corpus is taken It is directly used in training, but the attribute of all words is determined first, including the parts of speech classification of all words, all parts of speech point The probability distribution of the probability distribution of class and all words under affiliated parts of speech classification；And it is integrated when carrying out language model training The morphologic information and syntax class information of word are considered, the especially introducing of syntax class information, the word utilized is in language Some general character that justice is learned allow OOV words that can also utilize the parameter of the word in vocabulary by the method for parameter sharing, comprehensive The general character for the word for belonging to same parts of speech classification is considered, so that the language model that training obtains is in practical applications, i.e., It is to encounter OOV words, can be also carried out by the morphologic informations of the OOV words and the lexical level information of the parts of speech classification Accurately identification.

As shown in figure 4, in some embodiments, language mould is embedded in when the word to be identified belongs to the predicate for training When the vocabulary of type, the term vector for generating the word to be identified includes：

S41, judge whether the word to be identified is low-frequency word；

S42, if it is, by the word to be identified dismantling at word, and the word that dismantling is obtained encoded for Determine corresponding term vector；

S43, if it is not, then the vector of the extraction word to be identified as term vector.

As shown in figure 5, in some embodiments, language is embedded in when the word to be identified is not belonging to the predicate for training When the vocabulary of model, the term vector for generating the word to be identified includes：

S51, determine the attribute of the word to be identified to update the vocabulary；

S52, the word dismantling to be identified is encoded at word, and to disassembling obtained word for determining corresponding Term vector.

The method for quickly increasing neologisms in word insertion language model is realized in the present embodiment.This method can be into one Step is integrated into speech recognition system, to realize quick vocabulary update, while also having promotion on the recognition correct rate of low-frequency word.

Below by way of comparison tradition LSTM (Long Short-Term Memory) language models to the processing of OOV words and The technical solution of the embodiment of the present invention further describes the embodiment of the present invention.

LSTM language model introductions：

Deep learning method is widely used in language model, achieves prodigious success.Shot and long term is remembered (LSTM) network is a kind of recurrent neural network (RNN) architecture especially suitable for sequence.If V is word finder, each At timestamp t, word w is inputted_tBy single hot (one-hot) vector e_tIt indicates, then word insertion can be obtained as x_t：

x_t=E_ie_t (l)

Wherein E_i∈R^m×|V|, it is input word embeded matrix, m indicates the gray scale of input word insertion.Specifically, LSTM A step by x_t, h_t-1, c_t-1As input, and generate h_t, c_t.Details is calculated to omit herein.Next word it is general Rate distribution is on output layer by calculating hidden layer progress affine transformation followed by softmax functions：

Wherein, E_o ^jIt is Eo ∈ R^m×|V|Jth row, also referred to as output insertion, and b^jIt is bias term.It was found that output The bias term of layer plays an important role, highly relevant with the frequency of word.

Since most of cost that calculates depends on output layer, so proposing the softmax output layers of factorization to carry The speed of high language model.This method, which is based on the assumption that, to be mapped to class by word.If S is class set.Different from equation (2), the probability distribution of next word of factorization output layer calculates as follows：

P(w_t+1=j | w_1:t)=P (s_t+1=s_j|h_t)P(w_t+1=j | s_j, h_t) (3)

Wherein, s_jIndicate word w_t+1Class, V_sjIt is to belong to class s_jAll words set.Here, the probability meter of word Point counting is two stages：We estimate the probability distribution of class first, and the probability of certain words is then calculated from desired class.Its A real word may belong to multiple classes.But herein, each word is mapped to a different class, i.e., all classes are mutual Reprimand.Common part of speech is the class that the class based on frequency is either obtained from the method for data-driven.

The processing of OOV words

As previously mentioned, being used for two methods in classical LSTM language models to handle OOV word problems：

1. a kind of special<UNK>For replacing all OOV words, another measurement for being known as adjusting complexity is used：

Wherein V_OOVIt is the lexical set of all OOV words.This method is known as " unk " by we in experiment.

2. with newer vocabulary retraining model.Due to OOV words in training set without or few positive examples, after training Its probability will be assigned to the value of a very little.This method can be similar to the smoothing method used in n gram language models. This method is known as " retraining " by us in an experiment.

Both conventional methods have disadvantage is that：In unk LSTM language models, due to the frequency of OOV words Rate is mismatched with training data and test data, therefore judges the probability of OOV words by accident.In addition, this method has ignored OOV words Language message.The main problem of the LSTM language models of retraining is time-consuming.

In traditional LSTM language models, the word insertion of each word is independent, and this generates two problems. First, neologisms cannot use training word to be embedded in.Second, due to the data that lack training, this is rarely found.The insertion of structuring word is moved Machine is to solve both of these problems using parameter sharing.Different from the method for data-driven, parameter sharing method must be based on bright True rule.By using syntax and form rule, we can easily find the shared parameter of OOV words, and at me Model in build oneself structuring word insertion.

Form syntactic structureization is embedded in：

In syntax level, each word is assigned to part of speech (POS) class.Same POS (part-of- Speech) all words in class share identical POS classes insertion, and referred to as syntax is embedded in.Part of speech is a kind of with similar grammer The word of feature.Therefore we assume that syntax insertion represents the basic syntactic function of word.

For each word, we mark its POS labels using several example sentences, and select most common label as Part of speech (POS labels can also be obtained from dictionary).The example sentence of (IV) word is selected from training set in vocabulary.For OOV words, Example sentence can form or be selected from other data sources, such as network data by other data sources.It is different from the method for data-driven, base It is that OOV words generate that can use rule easily in the syntax insertion of POS labels.

Character (or sub-word) indicates extensive in many NLP (Natural Language Processing) task It uses, as a supplementary features of the performance for improving low frequency word, especially in the abundant language of form.But for high frequency For word, improvement is limited.Herein in order to further capture the semanteme of low frequency word, form insertion is established.This is base In the Deta sparseness of the low frequency word hypothesis less serious in character level.For high frequency word, retain word insertion.Therefore, Mixing insertion, that is, the form insertion and the word insertion of high frequency word of low frequency word should be in same dimension.

In pervious document, word insertion is combined with sub-word grade feature to obtain the enhancing insertion of all words. On the contrary, the form insertion of low frequency word proposed in this paper only depends on character level feature.Therefore, it have the ability to OOV words into Row modeling.

Coding (FOFE) character information is routinely forgotten in the form insertion proposed using character level fixed size.Ours In model, all low frequency words are by character string e₁：T is indicated, wherein e_tIt is single hot (one-hot) of character at timestamp t It indicates.FOFE is based on a simple recurrence formula (z₀=0) entire sequence is encoded：

z_t=α z_t-1+e_t(1≤t≤T) (7)

Wherein 0<α<1 is the constant forgetting factor for controlling history and being influenced on final time step.In addition, feedforward neural network (FNN, Feedword Netural Network) is used to the FOFE code conversions of character level being embedded at final form.

Structuring is embedded in and is combined together with LSTM language models：

As shown in fig. 6, being embedded in the structural schematic diagram of language model for the word in one embodiment of the invention.

In input layer, the structuring insertion for inputting word is by the way that the insertion of its syntax and word insertion (are directed to high frequency list Word) or form insertion (be directed to low frequency word) connection and obtain.

The use of factorization softmax structures is easily in output layer.Output class embeded matrix in formula (4) Ec is embedded in by syntax and replaces, and the output embeded matrix Eo in formula (5) is replaced by word and form insertion.

Once training is completed, the syntax of OOV words and form insertion can obtain at any time.In order to calculate OOV words Probability, it would be desirable to the output layer parameter in reconstruction formula (5)：Eo, b.All insertions and bias term b in Eo are reserved for IV Word, and the insertion of the OOV words in Eo is embedded in by its form and fills.In an experiment, it has been found that have bias term and word frequency It is highly relevant, it means that word frequency is higher, and bias is bigger.Herein, we use the bias term of OOV words as one The small steady state value of experience.

It is embedded in, OOV words can be attached in LSTM language models without instructing again by using structuring word Practice.As we are above-mentioned, in the training process by the parameter in shared proposed model, OOV can also be mitigated The Deta sparseness of word.

Structuring word insertion language model proposed in the embodiment of the present invention also achieves compression of parameters.In LSTM languages It says in model, the word insertion of low frequency word occupies a big chunk model parameter but is undertrained.By with word Symbol indicates that, instead of low frequency word, the quantity of parameter can be greatly reduced.

If V is the word finder of all words, H is hidden layer size.In LSTM language models, the parameter number of word insertion Amount is 2 × | V | × H, however, in the LSTM language models of structuring insertion, parameter sum be (| Vh |+| Vchar |+| S |) ×H³, wherein Vh indicates that high frequency word, Vchar indicate that character set, S indicate POS set of tags.Experiment shows as | V |= 60000, | Vh |=8000, | Vchar |=5000, | S | when=32, it is possible to reduce nearly 90% parameter.

Inventor can get a desired effect for the method and system for verifying the present invention, based on Chinese short message breath clothes The word being engaged in (SMS, short message service) data set progress test assessment embodiment of the present invention is embedded in language mould Type (follow-up to be referred to as structuring word insertion LSTM language models).

Table 1- data set informations

Table 1 gives the details of data set.Two different size of vocabularies are used for each data set.Complete word Collect V_fCover all vocabulary appeared in corpus.Small word finder V_sIt is V_fA subset.Here (IV) is determined in vocabulary Justice is V_sIn word, vocabulary outer (OOV) means V_fRather than V_sIn word.Sms-30m data sets also be used as training set and The spontaneous exchange test collection of mandarin re-executes task in (about 25 hours, 3K language) for ASR (automatic speech recognition).

1. with small word finder V_sTrain LSTM language models, and all OOV words are considered as one individually<UNK> Symbol, referred to as " UNK ".

2. with whole word finder V_fRetraining LSTM language models, referred to as " retraining ".

For the LSTM language models being embedded in structuring word, small word finder V is used in the training stage_s, testing Model vocabulary is updated to V by the stage_f。

In order to be consistent with the size of the model proposed, the inputs of LSTM baselines insertion is dimensioned to 600, and And output insertion is dimensioned to 300.In the LSTM language models being embedded in structuring, syntax insertion size is set To 300.We carry out FOFE codings using 1 layer of 5000-300FNN, wherein 5000 be character set V_cSize, 300 be that form is embedding The dimension entered.The α of FOFE is set as 0.7, the bias term of neologisms is set as 0, in the active set by the two empirical parameters It is finely adjusted.8192 most frequent words are chosen as high frequency word, other words are considered as low frequency in our model Word.All models are all trained using identical hyper parameter stochastic gradient descent (SGD).

Complexity evaluations

Degree of aliasing assessment result is shown in Table 2.In particular, the PPL for " unk " LSTM, OOV words was calculated by equation (6) generation It replaces.The result shows that structure insertion (SE) method proposed has similar performance with unk LSTM.However, retraining LSTM performances are worse.In order to further investigate, we respectively carry out in vocabulary each model (IV) respectively and vocabulary is outer (OOV) The PPL of word is calculated.Experimental result is as shown in table 3.Unk LSTM are showed most using sacrificing OOV words as cost in IV words It is good, because the PPL of its OOV words is very high.Relative to unk LSTM, the LSTM of retraining substantially increases OOV words PPL, and decrease in IV words.Our method further improves the PPL in OOV words, in IV words With similar performance.

Complexity between table 2- difference OOV combined methods compares

The complexity of word and the outer word of vocabulary is decomposed in table 3- vocabularys

Quick vocabulary update is in ASR

In automatic speech recognition (ASR) system, rollback n meta-models are used as generating the language model of grid, Cong Zhongsheng At n-best lists.Then n-best lists can be readjusted using neural network language model to obtain better performance. In general, n members and neural network language model share identical vocabulary.Therefore, when vocabulary updates, n members and neural network language Model is required for retraining.Compared with neural network language model, the training time of n gram language models can be ignored.

This experiment is divided into two stages.In the first phase, it is unk LSTM and LSTM to use small lexical representation respectively LSTM language models are trained with SE and the LSTM language models of (SE, Structure embedding) are embedded in structuring. It is used to generate n-best lists with the Vs n gram language models trained.Then we carry out n-best lists using unk LSTM It beats again point.In second stage, vocabulary Vs expands as the vocabulary Vf of bigger.Since vocabulary changes, it would be desirable to retraining unk LSTM and n meta-models.But the vocabulary of the LSTM with SE is rebuild without re -training.And then training LSTM and LSTM with SE is used to redefine by the n-best lists of new n meta-models generation.

The word error rate of sentence and the outer sentence of vocabulary compares and decomposes in table 4- vocabularys

Experimental result is as shown in table 4.The extension of vocabulary is benefited from, the LSTM of re -training is obtained absolutely in all sentences 0.38% CER is improved.The LSTM models (LSTM with SE) being embedded in structuring proposed reach optimum performance.For Research from the model proposed can obtain which type of sentence obtains maximum value, and whether we go out according to all words Subordinate sentence, which will be beaten again, in present Vs is divided into two classes, sentence (IVS) and the outer sentence (OOVS) of vocabulary referred to as in vocabulary.As shown in table 4, There is higher CER for sentence outside vocabulary with the Vs unk LSTM trained, because the n members built by Vs cannot generate these OOV Word.By enlarging one's vocabulary, the LSTM of retraining for vocabulary outside sentence obtained significantly improving for CER.With retraining LSTM compare, the model proposed on IV and OOV sentences be better than CER.Moreover, the CER of improvement to(for) OOV sentences (being absolutely 1.13%) is significantly higher than improves (being absolutely 0.13%) to the CER of IV sentences, it means that the LSTM with SE With the ability preferably modeled for OOV words.It note that the LSTM language by using the structuring word insertion proposed Model, can save the time of model retraining in conventional method, and obtain preferable performance.

It should be noted that for each method embodiment above-mentioned, for simple description, therefore it is all expressed as a series of Action merge, but those skilled in the art should understand that, the present invention is not limited by the described action sequence because According to the present invention, certain steps can be performed in other orders or simultaneously.Secondly, those skilled in the art should also know It knows, embodiment described in this description belongs to preferred embodiment, and involved action and module are not necessarily of the invention It is necessary.

In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment Point, it may refer to the associated description of other embodiment.

As shown in fig. 7, the embodiment of the present invention also provides a kind of word insertion language model training system 700, including：

Vocabulary generates program module 710, for determining the attribute of all words in corpus to generate vocabulary, the category Property includes that the parts of speech classification of all words, the probability distribution of all parts of speech classification and all words are general under affiliated parts of speech classification Rate is distributed；

Term vector generates program module 720, the term vector for generating all words in the vocabulary；

Parts of speech classification vector generator module 730, for generating the part of speech point corresponding to all words in the vocabulary The parts of speech classification vector of class；

Model training program module 740, for the word in the term vector of the word in the vocabulary and the vocabulary Parts of speech classification vector be input, in the probability distribution of the parts of speech classification belonging to the word in the vocabulary and the vocabulary Probability distribution of the word under affiliated parts of speech classification is that output is trained, and to obtain, predicate is embedded in language model.

As shown in figure 8, in some embodiments, the term vector generates program module 720 and includes：

Frequency word determining program unit 721, whether the word for judging to be obtained from the vocabulary is low-frequency word；

First term vector generates program unit 722, for when judge to be obtained from the word of the vocabulary as low-frequency word when, general The word for being obtained from the vocabulary is disassembled into word, and the word obtained to dismantling is encoded for the corresponding term vector of determination；

Second term vector generates program unit 723, for when judge to be obtained from the word of the vocabulary as high frequency words when, carry Take the vector of the word for being obtained from the vocabulary as term vector.

As shown in figure 9, the embodiment of the present invention also provides a kind of words recognition system 900, including：

Word described in the above embodiment of the present invention is embedded in language model 910；

Term vector generates program module 920, the term vector for generating word to be identified；

Vocabulary generates program module 930, the parts of speech classification vector of the parts of speech classification for determining the word to be identified；

Words recognition program module 940, for the term vector of the word to be identified and parts of speech classification vector to be inputted institute Predicate is embedded in language model, with the probability distribution for obtaining the parts of speech classification belonging to the word to be identified and the word to be identified Probability distribution under affiliated parts of speech classification.

As shown in Figure 10, in some embodiments, it is embedded in language when the word to be identified belongs to the predicate for training When the vocabulary of model, the term vector generates program module 920 and includes：

Frequency word determining program unit 921, for judging whether the word to be identified is low-frequency word；

First term vector generates program unit 922, for when judge to be obtained from the word of the vocabulary as low-frequency word when, general The word dismantling to be identified is at word, and the word obtained to dismantling is encoded for the corresponding term vector of determination；

Second term vector generates program unit 923, for when judge to be obtained from the word of the vocabulary as high frequency words when, carry Take the vector of the word to be identified as term vector.

As shown in figure 11, in some embodiments, it is embedded in language when the word to be identified is not belonging to the predicate for training When saying the vocabulary of model, the term vector generates program module 920 and includes：

Vocabulary updates program unit 921 ', for determining the attribute of the word to be identified to update the vocabulary；

Term vector generates program unit 922 ', for disassembling the word to be identified at word, and to disassembling obtained word It is encoded for the corresponding term vector of determination.

In some embodiments, the embodiment of the present invention provides a kind of non-volatile computer readable storage medium storing program for executing, described to deposit It includes the programs executed instruction to be stored in storage media one or more, it is described execute instruction can by electronic equipment (including but It is not limited to computer, server or the network equipment etc.) it reads and executes, it is embedding for executing any of the above-described word of the present invention Enter language model training method and/or words recognition method.

In some embodiments, the embodiment of the present invention also provides a kind of computer program product, the computer program production Product include the computer program being stored on non-volatile computer readable storage medium storing program for executing, and the computer program includes that program refers to It enables, when described program instruction is computer-executed, the computer is made to execute the insertion language model training of any of the above-described word Method and/or words recognition method.

In some embodiments, the embodiment of the present invention also provides a kind of electronic equipment comprising：At least one processor, And the memory being connect at least one processor communication, wherein the memory is stored with can be by described at least one The instruction that a processor executes, described instruction is executed by least one processor, so that at least one processor energy The step of enough executing word insertion language model training method and/or words recognition method.

In some embodiments, the embodiment of the present invention also provides a kind of storage medium, is stored thereon with computer program, It is characterized in that, which is executed by processor the step of word is embedded in language model training method and/or words recognition method.

The system and/or input method system of the realization structure language model of the embodiments of the present invention can be used for executing this hair The method and/or input method of the realization structure language model of bright embodiment, and reach the reality of the embodiments of the present invention accordingly The technique effect that now method and/or input method of structure language model are reached, which is not described herein again.

Correlation function mould can be realized in the embodiment of the present invention by hardware processor (hardware processor) Block.

Figure 12 is the word insertion language model training method and/or words recognition method that another embodiment of the application provides The hardware architecture diagram of electronic equipment, as shown in figure 12, which includes：

One or more processors 1210 and memory 1220, in Figure 12 by taking a processor 1210 as an example.

Executing the equipment for realizing word insertion language model training method and/or words recognition method can also include：Input Device 1230 and output device 1240.

Processor 1210, memory 1220, input unit 1230 and output device 1240 can by bus or other Mode connects, in Figure 12 for being connected by bus.

Memory 1220 is used as a kind of non-volatile computer readable storage medium storing program for executing, can be used for storing non-volatile software journey Sequence, non-volatile computer executable program and module, such as the realization word insertion language model training in the embodiment of the present application Method and/or the corresponding program instruction/module of words recognition method.Processor 1210 is stored in by operation in memory 1220 Non-volatile software program, instruction and module, to execute server various function application and data processing, i.e., in fact Existing above method embodiment realizes word insertion language model training method and/or words recognition method.

Memory 1220 may include storing program area and storage data field, wherein storing program area can store operation system System, the required application program of at least one function；Storage data field can be stored is embedded in language model training cartridge according to realization word It sets and/or words recognition device uses created data etc..It is deposited in addition, memory 1220 may include high random access Reservoir, can also include nonvolatile memory, a for example, at least disk memory, flush memory device or other are non-volatile Property solid-state memory.In some embodiments, it includes remotely located relative to processor 1210 deposit that memory 1220 is optional Reservoir, these remote memories can be embedded in language model training device and/or words recognition device by network connection to word. The example of above-mentioned network includes but not limited to internet, intranet, LAN, mobile radio communication and combinations thereof.

Input unit 1230 can receive the number or character information of input, and generates and be embedded in language model training cartridge with word It sets and/or the user setting of words recognition device and the related signal of function control.Output device 1240 may include display screen Deng display equipment.

One or more of modules are stored in the memory 1220, when by one or more of processors When 1210 execution, word insertion language model training method and/or the words recognition method in above-mentioned any means embodiment are executed.

The said goods can perform the method that the embodiment of the present application is provided, and has the corresponding function module of execution method and has Beneficial effect.The not technical detail of detailed description in the present embodiment, reference can be made to the method that the embodiment of the present application is provided.

The electronic equipment of the embodiment of the present application exists in a variety of forms, including but not limited to:

(1) mobile communication equipment:The characteristics of this kind of equipment is that have mobile communication function, and to provide speech, data Communication is main target.This Terminal Type includes:Smart mobile phone (such as iPhone), multimedia handset, functional mobile phone and low Hold mobile phone etc..

(2) super mobile personal computer equipment:This kind of equipment belongs to the scope of personal computer, there is calculating and processing work( Can, generally also have mobile Internet access characteristic.This Terminal Type includes:PDA, MID and UMPC equipment etc., such as iPad.

(3) portable entertainment device:This kind of equipment can show and play multimedia content.Such equipment includes:Audio, Video player (such as iPod), handheld device, e-book and intelligent toy and portable car-mounted navigation equipment.

(4) server:The equipment for providing the service of calculating, the composition of server include that processor, hard disk, memory, system are total Line etc., server is similar with general computer architecture, but due to needing to provide highly reliable service, in processing energy Power, stability, reliability, safety, scalability, manageability etc. are more demanding.

(5) other electronic devices with data interaction function.

The apparatus embodiments described above are merely exemplary, wherein the unit illustrated as separating component can It is physically separated with being or may not be, the component shown as unit may or may not be physics list Member, you can be located at a place, or may be distributed over multiple network units.It can be selected according to the actual needs In some or all of module achieve the purpose of the solution of this embodiment.

Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can It is realized by the mode of software plus general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, above-mentioned technology Scheme substantially in other words can be expressed in the form of software products the part that the relevant technologies contribute, the computer Software product can store in a computer-readable storage medium, such as ROM/RAM, magnetic disc, CD, including some instructions to So that computer equipment (can be personal computer, server either network equipment etc.) execute each embodiment or Method described in certain parts of embodiment.

Finally it should be noted that：Above example is only to illustrate the technical solution of the application, rather than its limitations；Although The application is described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that：It still may be used With technical scheme described in the above embodiments is modified or equivalent replacement of some of the technical features； And these modifications or replacements, each embodiment technical solution of the application that it does not separate the essence of the corresponding technical solution spirit and Range.

Claims

1. a kind of word is embedded in language model training method, including：

The attribute of all words in corpus is determined to generate vocabulary, the attribute includes the parts of speech classification of all words, institute There is the probability distribution of the probability distribution and all words of parts of speech classification under affiliated parts of speech classification；

Generate the term vector of all words in the vocabulary；

It is input with the parts of speech classification vector of the word in the term vector of the word in the vocabulary and the vocabulary, with institute's predicate Probability of the word under affiliated parts of speech classification point in the probability distribution and the vocabulary of the parts of speech classification belonging to word in table Cloth is that output is trained, and to obtain, predicate is embedded in language model.

2. according to the method described in claim 1, wherein, the term vector for generating all words in the vocabulary includes：

Judgement is obtained from whether the word of the vocabulary is low-frequency word；

If it is, the word for being obtained from the vocabulary is disassembled into word, and the word obtained to dismantling is encoded for true Fixed corresponding term vector；

If it is not, then being obtained from the vector of the word of the vocabulary described in extraction as term vector.

3. a kind of words recognition method, the method uses the insertion language model of the word described in claims 1 or 2, described Method includes：

Generate the term vector of word to be identified；

The term vector of the word to be identified and parts of speech classification vector input institute predicate are embedded in language model, to be waited for described in acquisition Identify the probability distribution of the probability distribution and the word to be identified of the parts of speech classification belonging to word under affiliated parts of speech classification.

4. according to the method described in claim 3, wherein, language is embedded in when the word to be identified belongs to the predicate for training When the vocabulary of model, the term vector for generating the word to be identified includes：

Judge whether the word to be identified is low-frequency word；

If it is, by the word dismantling to be identified at word, and the word obtained to dismantling is encoded for determining accordingly Term vector；

If it is not, then the vector of the extraction word to be identified is as term vector.

5. according to the method described in claim 3, wherein, language is embedded in when the word to be identified is not belonging to the predicate for training When saying the vocabulary of model, the term vector for generating the word to be identified includes：

Determine the attribute of the word to be identified to update the vocabulary；

By the word dismantling to be identified at word, and the word obtained to dismantling is encoded for the corresponding term vector of determination.

6. a kind of word is embedded in language model training system, including：

Vocabulary generates program module, and for determining the attribute of all words in corpus to generate vocabulary, the attribute includes Probability distribution and all words probability under affiliated parts of speech classification point of the parts of speech classification of all words, all parts of speech classification Cloth；

Parts of speech classification vector generator module, for generating the part of speech corresponding to the parts of speech classification of all words in the vocabulary Class vector；

Model training program module, for the part of speech of the word in the term vector of the word in the vocabulary and the vocabulary point Class vector is input, with the word in the probability distribution of the parts of speech classification belonging to the word in the vocabulary and the vocabulary in institute It is that output is trained to belong to the probability distribution under parts of speech classification, and to obtain, predicate is embedded in language model.

7. system according to claim 6, wherein the term vector generates program module and includes：

Frequency word determining program unit, whether the word for judging to be obtained from the vocabulary is low-frequency word；

First term vector generates program unit, for when judge to be obtained from the word of the vocabulary as low-frequency word when, will be obtained from The word of the vocabulary is disassembled into word, and the word obtained to dismantling is encoded for the corresponding term vector of determination；

Second term vector generates program unit, for when judge to be obtained from the word of the vocabulary as high frequency words when, described in extraction The vector of the word of the vocabulary is obtained from as term vector.

8. a kind of words recognition system, including：

Word described in claim 6 or 7 is embedded in language model；

Words recognition program module, for the term vector of the word to be identified and parts of speech classification vector input institute predicate to be embedded in Language model, with obtain the parts of speech classification belonging to the word to be identified probability distribution and the word to be identified in affiliated word Property classification under probability distribution.

9. system according to claim 8, wherein be embedded in language when the word to be identified belongs to the predicate for training When the vocabulary of model, the term vector generates program module and includes：

Frequency word determining program unit, for judging whether the word to be identified is low-frequency word；

First term vector generates program unit, for when judge to be obtained from the word of the vocabulary as low-frequency word when, waited for described Word dismantling is identified into word, and the word obtained to dismantling is encoded for the corresponding term vector of determination；

Second term vector generates program unit, for when judge to be obtained from the word of the vocabulary as high frequency words when, described in extraction The vector of word to be identified is as term vector.

10. system according to claim 8, wherein be embedded in when the word to be identified is not belonging to the predicate for training When the vocabulary of language model, the term vector generates program module and includes：

Vocabulary updates program unit, for determining the attribute of the word to be identified to update the vocabulary；

Term vector generates program unit, for encoding the word dismantling to be identified at word, and to disassembling obtained word For the corresponding term vector of determination.

11. a kind of electronic equipment comprising：At least one processor, and connect at least one processor communication Memory, wherein the memory is stored with the instruction that can be executed by least one processor, described instruction by it is described extremely A few processor executes, so that at least one processor is able to carry out any one of claim 1-5 the methods The step of.

12. a kind of storage medium, is stored thereon with computer program, which is characterized in that the program is realized when being executed by processor The step of any one of claim 1-5 the methods.