Nothing Special   »   [go: up one dir, main page]

CN108417210A - A kind of word insertion language model training method, words recognition method and system - Google Patents

A kind of word insertion language model training method, words recognition method and system Download PDF

Info

Publication number
CN108417210A
CN108417210A CN201810022130.3A CN201810022130A CN108417210A CN 108417210 A CN108417210 A CN 108417210A CN 201810022130 A CN201810022130 A CN 201810022130A CN 108417210 A CN108417210 A CN 108417210A
Authority
CN
China
Prior art keywords
word
vocabulary
parts
words
identified
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810022130.3A
Other languages
Chinese (zh)
Other versions
CN108417210B (en
Inventor
俞凯
陈瑞年
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sipic Technology Co Ltd
Original Assignee
Shanghai Jiaotong University
Suzhou Speech Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University, Suzhou Speech Information Technology Co Ltd filed Critical Shanghai Jiaotong University
Priority to CN201810022130.3A priority Critical patent/CN108417210B/en
Publication of CN108417210A publication Critical patent/CN108417210A/en
Application granted granted Critical
Publication of CN108417210B publication Critical patent/CN108417210B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The present invention discloses a kind of word insertion language model training method, including:The attribute of all words in corpus is determined to generate vocabulary, the attribute includes the probability distribution of the probability distribution and all words of the parts of speech classification of all words, all parts of speech classification under affiliated parts of speech classification;Generate the term vector of all words in the vocabulary;Generate the parts of speech classification vector for corresponding to the parts of speech classification of all words in the vocabulary;It is input with the parts of speech classification vector of the word in the term vector of the word in the vocabulary and the vocabulary, it is that output is trained with probability distribution of the word in the probability distribution of the parts of speech classification belonging to the word in the vocabulary and the vocabulary under affiliated parts of speech classification, to obtain, predicate is embedded in language model.Language model is being carried out in the embodiment of the present invention, is even encountering OOV words, also can accurately identified by the morphologic informations of the OOV words and the lexical level information of the parts of speech classification.

Description

A kind of word insertion language model training method, words recognition method and system
Technical field
The present invention relates to field of artificial intelligence more particularly to a kind of word insertion language model training method, word to know Other method and system.
Background technology
At present applied to the language model in speech recognition system be mainly used for the word gone out for speech recognition and sentence into Row marking, and plus the marking of acoustic model, with the optimal result identified.The existing language model based on neural network Training speed is slower, and while training needs fixed known vocabulary.In traditional language model, in the vocabulary for training Each word represented with an one-hot vector, for example, vocabulary size is 10,000 (that is, having 10,000 words in this table), that It indicates to use the vector of 10,000 dimensions when word and only corresponds to that of the word for 1.Then by this to Amount is multiplied after being input to neural network with a word embeded matrix, is converted into the vector of a real number, final to realize language mould The training of type;It is also that word is converted to real vector to be identified when same progress words recognition.
However, inventor has found in the implementation of the present invention, the vocabulary for train language model often can not All words are enough covered, so when using conventional language model, if encountering the word that word to be identified is non-typing vocabulary When language (for example, the word out-of-vocabulary, OOV except the vocabulary appeared in practical application later), conventional language Model correctly can not be reliably identified to word progress (because the word to be identified is not present in vocabulary, so at all With regard to there is no the vectors corresponding to the word to be identified, so that cannot also be embedded in multiplication of vectors with word should to obtain corresponding to The real vector of OOV), unless the word to be identified is added to vocabulary, and use new one new language of vocabulary re -training Say model.
In the presence of solving the problems, such as conventional language model, most common at present is exactly with a special label< unk>Represent all words beyond vocabulary.Obtain a determining vocabulary first, and be added one it is special<unk>Character.So All OOV words in training set are all replaced with afterwards<unk>It is trained.In use, all OOV words are equally used<unk >Instead of.
Inventor has found in the implementation of the present invention, and the word beyond vocabulary is usually all rare words, therefore these words Training data it is inherently seldom, way has given up the information in linguistics possessed by these OOV words in the prior art, because This, is also finally very inaccurate to the recognition result of these words.
Invention content
The embodiment of the present invention provides a kind of method and system for adding neologisms in neural network language model, at least Solve one of above-mentioned technical problem.
In a first aspect, the embodiment of the present invention provides a kind of word insertion language model training method, including:
The attribute of all words in corpus is determined to generate vocabulary, the attribute includes the part of speech point of all words The probability distribution of class, the probability distribution and all words of all parts of speech classification under affiliated parts of speech classification;
Generate the term vector of all words in the vocabulary;
Generate the parts of speech classification vector for corresponding to the parts of speech classification of all words in the vocabulary;
It is input with the parts of speech classification vector of the word in the term vector of the word in the vocabulary and the vocabulary, with institute Word in the probability distribution and the vocabulary of the parts of speech classification belonging to word in predicate table is general under affiliated parts of speech classification Rate is distributed as output and is trained, and to obtain, predicate is embedded in language model.
Second aspect, the embodiment of the present invention also provide a kind of words recognition method, and it is above-mentioned that the method uses the present invention Word in embodiment is embedded in language model, the method includes:
Generate the term vector of word to be identified;
Determine the parts of speech classification vector of the parts of speech classification of the word to be identified;
The term vector of the word to be identified and parts of speech classification vector input institute predicate are embedded in language model, to obtain State the probability of the probability distribution and the word to be identified of the parts of speech classification belonging to word to be identified under affiliated parts of speech classification point Cloth.
The third aspect, the embodiment of the present invention also provide a kind of word insertion language model training system, including:
Vocabulary generates program module, for determining the attribute of all words in corpus to generate vocabulary, the attribute Probability of the probability distribution and all words of parts of speech classification, all parts of speech classification including all words under affiliated parts of speech classification Distribution;
Term vector generates program module, the term vector for generating all words in the vocabulary;
Parts of speech classification vector generator module, for generating corresponding to the parts of speech classification of all words in the vocabulary Parts of speech classification vector;
Model training program module, for the word of the word in the term vector of the word in the vocabulary and the vocabulary Property class vector be input, with the word in the probability distribution of the parts of speech classification belonging to the word in the vocabulary and the vocabulary Probability distribution under affiliated parts of speech classification is that output is trained, and to obtain, predicate is embedded in language model.
Fourth aspect, the embodiment of the present invention also provide a kind of words recognition system, including:
Word is embedded in language model;
Term vector generates program module, the term vector for generating word to be identified;
Vocabulary generates program module, the parts of speech classification vector of the parts of speech classification for determining the word to be identified;
Words recognition program module, for the term vector of the word to be identified and parts of speech classification vector to be inputted institute's predicate Embedded language model, with obtain the parts of speech classification belonging to the word to be identified probability distribution and the word to be identified in institute Belong to the probability distribution under parts of speech classification.
5th aspect, the embodiment of the present invention provide a kind of non-volatile computer readable storage medium storing program for executing, the storage medium In to be stored with one or more include the programs executed instruction, described execute instruction can (include but not limited to by electronic equipment Computer, server or network equipment etc.) it reads and executes, it is embedded in language for executing any of the above-described word of the present invention Model training method and/or words recognition method.
6th aspect, provides a kind of electronic equipment comprising:At least one processor, and at least one place Manage the memory of device communication connection, wherein the memory is stored with the instruction that can be executed by least one processor, institute It states instruction to be executed by least one processor, so that at least one processor is able to carry out any of the above-described of the present invention Word is embedded in language model training method and/or words recognition method.
7th aspect, the embodiment of the present invention also provide a kind of computer program product, and the computer program product includes The computer program being stored on non-volatile computer readable storage medium storing program for executing, the computer program include program instruction, when When described program instruction is computer-executed, the computer is made to execute any of the above-described word insertion language model training method And/or words recognition method.
The advantageous effect of the embodiment of the present invention is:It is when carrying out the training of language model and non-straight in the embodiment of the present invention Connect the attribute that the word in corpus is brought and is directly used in training, but all words is determined first, including all words Parts of speech classification, all parts of speech classification probability distribution under affiliated parts of speech classification of probability distribution and all words;And Considered when language model training the morphologic information and syntax class information of word, especially syntax class information Introducing, the general character for the word for belonging to same parts of speech classification has been considered, so that the obtained language model of training is in reality In the application of border, OOV words are even encountered, the syntax of the morphologic information and the parts of speech classification of the OOV words can be also passed through Grade information is accurately identified.
Description of the drawings
In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment Attached drawing be briefly described, it should be apparent that, drawings in the following description are some embodiments of the invention, for ability For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to these attached drawings other attached Figure.
Fig. 1 is the flow chart that the word of the present invention is embedded in an embodiment of language model training method;
Fig. 2 is the flow chart that the word of the present invention is embedded in another embodiment of language model training method;
Fig. 3 is the flow chart of an embodiment of the words recognition method of the present invention;
Fig. 4 is the flow chart of another embodiment of the words recognition method of the present invention;
Fig. 5 is the flow chart of the another embodiment of the words recognition method of the present invention;
Fig. 6 is the structural schematic diagram that the word of the present invention is embedded in an embodiment of language model;
Fig. 7 is the functional block diagram that the word of the present invention is embedded in an embodiment of language model training system;
Fig. 8 is the functional block diagram that the word of the present invention is embedded in another embodiment of language model training system;
Fig. 9 is the functional block diagram of an embodiment of the words recognition system of the present invention;
Figure 10 is the functional block diagram of another embodiment of the words recognition system of the present invention;
Figure 11 is the functional block diagram of the another embodiment of the words recognition system of the present invention;
Figure 12 is the structural schematic diagram of an embodiment of the electronic equipment of the present invention.
Specific implementation mode
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art The every other embodiment obtained without creative efforts, shall fall within the protection scope of the present invention.
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.
The present invention can describe in the general context of computer-executable instructions executed by a computer, such as program Module.Usually, program module includes routines performing specific tasks or implementing specific abstract data types, program, object, member Part, data structure etc..The present invention can also be put into practice in a distributed computing environment, in these distributed computing environments, by Task is executed by the connected remote processing devices of communication network.In a distributed computing environment, program module can be with In the local and remote computer storage media including storage device.
In the present invention, " module ", " device ", " system " etc. refer to the related entities applied to computer, such as hardware, firmly Combination, software or software in execution of part and software etc..In detail, for example, element can with but be not limited to run on place Manage process, processor, object, executable element, execution thread, program and/or the computer of device.In addition, running on server On application program or shell script, server can be element.One or more elements can be in the process and/or line of execution Cheng Zhong, and element can be localized and/or be distributed between two or multiple stage computers on one computer, and can be by Various computer-readable medium operations.Element can also be according to the signal with one or more data packets, for example, coming from one It is interacted with another element in local system, distributed system, and/or the network in internet is handed over by signal and other systems The signal of mutual data is communicated by locally and/or remotely process.
Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or orders.Moreover, the terms "include", "comprise", include not only those elements, and And further include other elements that are not explicitly listed, or further include for this process, method, article or equipment institute it is intrinsic Element.In the absence of more restrictions, the element limited by sentence " including ... ", it is not excluded that wanted including described There is also other identical elements in the process, method, article or equipment of element.
As shown in Figure 1, the embodiment of the present invention provides a kind of word insertion language model training method, this method includes:
S11, the attributes of all words in corpus is determined to generate vocabulary, the attribute includes the part of speech of all words Classify, the probability distribution of the probability distribution and all words of all parts of speech classification under affiliated parts of speech classification.
All words in corpus are stored according to parts of speech classification in vocabulary in the present embodiment, parts of speech classification can To include noun, adjective, verb, adverbial word etc., determine that the word that all parts of speech classification are included is accounted for from language material by counting The ratio of all words in library, that is, determine the probability distribution of all parts of speech classification.
Then the word that further statistics belongs in each parts of speech classification accounts for the ratio of all words under the parts of speech classification, Determine probability distribution of all words under the parts of speech classification.
S12, the term vector for generating all words in the vocabulary.Term vector by obtaining word has obtained the shape of word State information.
S13, the parts of speech classification vector for corresponding to the parts of speech classification of all words in the vocabulary is generated;By determining part of speech The mode of class vector can determine the syntax class information (that is, semantic information) for the word for belonging to corresponding parts of speech classification so that All words for belonging to a parts of speech classification share same parts of speech classification vector.
S14, with the parts of speech classification vector of the word in the term vector of the word in the vocabulary and the vocabulary be input, With the word in the probability distribution of the parts of speech classification belonging to the word in the vocabulary and the vocabulary under affiliated parts of speech classification Probability distribution be output be trained, to obtain predicate be embedded in language model.
Directly the two vectors are stitched together as input, in output layer, are referred to using a kind of normalization of factorization Function is counted to calculate the probability of each word.The probability distribution of each semantic classification is calculated first, then calculates word under these parts of speech Probability distribution.This two probability distribution multiplications finally can be obtained into required probability distribution.
In the embodiment of the present invention carry out language model training when and the indirect word by corpus bring directly For training, but the attribute of all words is determined first, including the parts of speech classification of all words, all parts of speech classification is general The probability distribution of rate distribution and all words under affiliated parts of speech classification;And consider when carrying out language model training The morphologic information and syntax class information of word, the especially introducing of syntax class information, the word utilized is in semantics Some general character allow OOV words that can also utilize the parameter of the word in vocabulary, consider by the method for parameter sharing Belong to the general character of the word of same parts of speech classification, so that the language model that training obtains is in practical applications, even meets It, also can be accurate to carry out by the morphologic information of the OOV words and the lexical level information of the parts of speech classification to OOV words Identification.
In addition, because we introduce additional information (semantic classification, morphology decompose) so that our method Directly unseen neologisms can be modeled, and modeling is more accurate.However in conventional methods where, being modeled to neologisms needs New data then re -training model is collected, this takes very much.Therefore in actual use, our model can be with Greatly save the time that neologisms are added.This method can be further integrated into speech recognition system, to realize quick vocabulary Update, while also having promotion on the recognition correct rate of low-frequency word.
As shown in Fig. 2, in some embodiments, the term vector for generating all words in the vocabulary includes:
S21, judgement are obtained from whether the word of the vocabulary is low-frequency word;
S22, if it is, disassembling the word for being obtained from the vocabulary at word, and the obtained word of dismantling is encoded For the corresponding term vector of determination;
S23, if it is not, then extraction described in be obtained from the vocabulary word vector as term vector.
In the embodiment of the present invention, different processing methods is used for high frequency words and low-frequency word:For high frequency words, each word There is its independent term vector (for example, one-hot vectors may be used);For low-frequency word, it is carried out on morphology first Dismantling (being word for Chinese i.e. dismantling), the method for then further using some sequential codings is converted into one and determines Long vector, common coding method have character level fixed size routinely to forget coding (FOFE, Fixed-size Ordinally Forgetting Encoding), directly by word addition of vectors, using cycle or convolutional neural networks encode etc..This In inventive embodiments, a word not merely regards an one-hot vector as, but is classified (pos from syntax level Tag, part-of-speech tag) classified (that is, expression of the syntax level of one-hot), then from morphology level, it will One coding for carrying out FOFE code.
Why high frequency words and low-frequency word are distinguished, is because the meaning of not all word can be with its word come very It indicates well, therefore the embodiment of the present invention avoids the influence under the above situation to language model performance.
In addition, needing that one group of parameter is arranged to each word in conventional language model.And due in the embodiment of the present invention In method, low-frequency word has all been disassembled at word (morphology decomposition), therefore we only need to be to all high frequency words and word setting ginseng Number, caused effect is exactly that required parameter amount greatly reduces and (can usually reduce 80% or so), and parameter amount, which has been lacked, to be brought Benefit be exactly that obtained word insertion language model of the embodiment of the present invention can be embedded into some smaller equipment (such as hands Machine) in.
The morphology decomposition of also word can also be decomposed at the beginning with phoneme, but due to the meaning gap of certain homonyms It is very big, therefore the method effect that phoneme decomposes not is fine, and just overcome this by using the method for the embodiment of the present invention A little problems.
In some embodiments, the part of speech with the word in the term vector of the word in the vocabulary and the vocabulary Class vector is input, is existed with the word in the probability distribution of the parts of speech classification belonging to the word in the vocabulary and the vocabulary Probability distribution under affiliated parts of speech classification is that output is trained, and to obtain, predicate insertion language model includes:
The parts of speech classification vector of word in the term vector of word in the vocabulary and the vocabulary is input to length When memory network;
By the length, memory network is input to parts of speech classification device to obtain belonging to the word in the vocabulary in short-term Parts of speech classification probability distribution;
By the length, memory network is input to word grader to obtain the word in the vocabulary affiliated in short-term Parts of speech classification under probability distribution.
The word insertion language model that training obtains includes long memory network, parts of speech classification device and word grader in short-term.
As shown in figure 3, the embodiment of the present invention also provides a kind of words recognition method, the method uses implementation of the present invention Word described in example is embedded in language model, the method includes:
S31, the term vector for generating word to be identified;
S32, determine that the parts of speech classification of the parts of speech classification of the word to be identified is vectorial;
S33, the term vector of the word to be identified and parts of speech classification vector input institute predicate are embedded in language model, to obtain Take probability distribution and the word to be identified of the parts of speech classification belonging to the word to be identified general under affiliated parts of speech classification Rate is distributed.
Employed in the embodiment of the present invention language model is when being trained and the indirect word by corpus is taken It is directly used in training, but the attribute of all words is determined first, including the parts of speech classification of all words, all parts of speech point The probability distribution of the probability distribution of class and all words under affiliated parts of speech classification;And it is integrated when carrying out language model training The morphologic information and syntax class information of word are considered, the especially introducing of syntax class information, the word utilized is in language Some general character that justice is learned allow OOV words that can also utilize the parameter of the word in vocabulary by the method for parameter sharing, comprehensive The general character for the word for belonging to same parts of speech classification is considered, so that the language model that training obtains is in practical applications, i.e., It is to encounter OOV words, can be also carried out by the morphologic informations of the OOV words and the lexical level information of the parts of speech classification Accurately identification.
In addition, because we introduce additional information (semantic classification, morphology decompose) so that our method Directly unseen neologisms can be modeled, and modeling is more accurate.However in conventional methods where, being modeled to neologisms needs New data then re -training model is collected, this takes very much.Therefore in actual use, our model can be with Greatly save the time that neologisms are added.This method can be further integrated into speech recognition system, to realize quick vocabulary Update, while also having promotion on the recognition correct rate of low-frequency word.
As shown in figure 4, in some embodiments, language mould is embedded in when the word to be identified belongs to the predicate for training When the vocabulary of type, the term vector for generating the word to be identified includes:
S41, judge whether the word to be identified is low-frequency word;
S42, if it is, by the word to be identified dismantling at word, and the word that dismantling is obtained encoded for Determine corresponding term vector;
S43, if it is not, then the vector of the extraction word to be identified as term vector.
In the embodiment of the present invention, different processing methods is used for high frequency words and low-frequency word:For high frequency words, each word There is its independent term vector (for example, one-hot vectors may be used);For low-frequency word, it is carried out on morphology first Dismantling (being word for Chinese i.e. dismantling), the method for then further using some sequential codings is converted into one and determines Long vector, common coding method have character level fixed size routinely to forget coding (FOFE, Fixed-size Ordinally Forgetting Encoding), directly by word addition of vectors, using cycle or convolutional neural networks encode etc..This In inventive embodiments, a word not merely regards an one-hot vector as, but is classified (pos from syntax level Tag, part-of-speech tag) classified (that is, expression of the syntax level of one-hot), then from morphology level, it will One coding for carrying out FOFE code.
Why high frequency words and low-frequency word are distinguished, is because the meaning of not all word can be with its word come very It indicates well, therefore the embodiment of the present invention avoids the influence under the above situation to language model performance.
In addition, needing that one group of parameter is arranged to each word in conventional language model.And due in the embodiment of the present invention In method, low-frequency word has all been disassembled at word (morphology decomposition), therefore we only need to be to all high frequency words and word setting ginseng Number, caused effect is exactly that required parameter amount greatly reduces and (can usually reduce 80% or so), and parameter amount, which has been lacked, to be brought Benefit be exactly that obtained word insertion language model of the embodiment of the present invention can be embedded into some smaller equipment (such as hands Machine) in.
As shown in figure 5, in some embodiments, language is embedded in when the word to be identified is not belonging to the predicate for training When the vocabulary of model, the term vector for generating the word to be identified includes:
S51, determine the attribute of the word to be identified to update the vocabulary;
S52, the word dismantling to be identified is encoded at word, and to disassembling obtained word for determining corresponding Term vector.
The method for quickly increasing neologisms in word insertion language model is realized in the present embodiment.This method can be into one Step is integrated into speech recognition system, to realize quick vocabulary update, while also having promotion on the recognition correct rate of low-frequency word.
Below by way of comparison tradition LSTM (Long Short-Term Memory) language models to the processing of OOV words and The technical solution of the embodiment of the present invention further describes the embodiment of the present invention.
LSTM language model introductions:
Deep learning method is widely used in language model, achieves prodigious success.Shot and long term is remembered (LSTM) network is a kind of recurrent neural network (RNN) architecture especially suitable for sequence.If V is word finder, each At timestamp t, word w is inputtedtBy single hot (one-hot) vector etIt indicates, then word insertion can be obtained as xt
xt=Eiet (l)
Wherein Ei∈Rm×|V|, it is input word embeded matrix, m indicates the gray scale of input word insertion.Specifically, LSTM A step by xt, ht-1, ct-1As input, and generate ht, ct.Details is calculated to omit herein.Next word it is general Rate distribution is on output layer by calculating hidden layer progress affine transformation followed by softmax functions:
Wherein, Eo jIt is Eo ∈ Rm×|V|Jth row, also referred to as output insertion, and bjIt is bias term.It was found that output The bias term of layer plays an important role, highly relevant with the frequency of word.
Since most of cost that calculates depends on output layer, so proposing the softmax output layers of factorization to carry The speed of high language model.This method, which is based on the assumption that, to be mapped to class by word.If S is class set.Different from equation (2), the probability distribution of next word of factorization output layer calculates as follows:
P(wt+1=j | w1:t)=P (st+1=sj|ht)P(wt+1=j | sj, ht) (3)
Wherein, sjIndicate word wt+1Class, VsjIt is to belong to class sjAll words set.Here, the probability meter of word Point counting is two stages:We estimate the probability distribution of class first, and the probability of certain words is then calculated from desired class.Its A real word may belong to multiple classes.But herein, each word is mapped to a different class, i.e., all classes are mutual Reprimand.Common part of speech is the class that the class based on frequency is either obtained from the method for data-driven.
The processing of OOV words
As previously mentioned, being used for two methods in classical LSTM language models to handle OOV word problems:
1. a kind of special<UNK>For replacing all OOV words, another measurement for being known as adjusting complexity is used:
Wherein VOOVIt is the lexical set of all OOV words.This method is known as " unk " by we in experiment.
2. with newer vocabulary retraining model.Due to OOV words in training set without or few positive examples, after training Its probability will be assigned to the value of a very little.This method can be similar to the smoothing method used in n gram language models. This method is known as " retraining " by us in an experiment.
Both conventional methods have disadvantage is that:In unk LSTM language models, due to the frequency of OOV words Rate is mismatched with training data and test data, therefore judges the probability of OOV words by accident.In addition, this method has ignored OOV words Language message.The main problem of the LSTM language models of retraining is time-consuming.
In traditional LSTM language models, the word insertion of each word is independent, and this generates two problems. First, neologisms cannot use training word to be embedded in.Second, due to the data that lack training, this is rarely found.The insertion of structuring word is moved Machine is to solve both of these problems using parameter sharing.Different from the method for data-driven, parameter sharing method must be based on bright True rule.By using syntax and form rule, we can easily find the shared parameter of OOV words, and at me Model in build oneself structuring word insertion.
Form syntactic structureization is embedded in:
In syntax level, each word is assigned to part of speech (POS) class.Same POS (part-of- Speech) all words in class share identical POS classes insertion, and referred to as syntax is embedded in.Part of speech is a kind of with similar grammer The word of feature.Therefore we assume that syntax insertion represents the basic syntactic function of word.
For each word, we mark its POS labels using several example sentences, and select most common label as Part of speech (POS labels can also be obtained from dictionary).The example sentence of (IV) word is selected from training set in vocabulary.For OOV words, Example sentence can form or be selected from other data sources, such as network data by other data sources.It is different from the method for data-driven, base It is that OOV words generate that can use rule easily in the syntax insertion of POS labels.
Character (or sub-word) indicates extensive in many NLP (Natural Language Processing) task It uses, as a supplementary features of the performance for improving low frequency word, especially in the abundant language of form.But for high frequency For word, improvement is limited.Herein in order to further capture the semanteme of low frequency word, form insertion is established.This is base In the Deta sparseness of the low frequency word hypothesis less serious in character level.For high frequency word, retain word insertion.Therefore, Mixing insertion, that is, the form insertion and the word insertion of high frequency word of low frequency word should be in same dimension.
In pervious document, word insertion is combined with sub-word grade feature to obtain the enhancing insertion of all words. On the contrary, the form insertion of low frequency word proposed in this paper only depends on character level feature.Therefore, it have the ability to OOV words into Row modeling.
Coding (FOFE) character information is routinely forgotten in the form insertion proposed using character level fixed size.Ours In model, all low frequency words are by character string e1:T is indicated, wherein etIt is single hot (one-hot) of character at timestamp t It indicates.FOFE is based on a simple recurrence formula (z0=0) entire sequence is encoded:
zt=α zt-1+et(1≤t≤T) (7)
Wherein 0<α<1 is the constant forgetting factor for controlling history and being influenced on final time step.In addition, feedforward neural network (FNN, Feedword Netural Network) is used to the FOFE code conversions of character level being embedded at final form.
Structuring is embedded in and is combined together with LSTM language models:
As shown in fig. 6, being embedded in the structural schematic diagram of language model for the word in one embodiment of the invention.
In input layer, the structuring insertion for inputting word is by the way that the insertion of its syntax and word insertion (are directed to high frequency list Word) or form insertion (be directed to low frequency word) connection and obtain.
The use of factorization softmax structures is easily in output layer.Output class embeded matrix in formula (4) Ec is embedded in by syntax and replaces, and the output embeded matrix Eo in formula (5) is replaced by word and form insertion.
Once training is completed, the syntax of OOV words and form insertion can obtain at any time.In order to calculate OOV words Probability, it would be desirable to the output layer parameter in reconstruction formula (5):Eo, b.All insertions and bias term b in Eo are reserved for IV Word, and the insertion of the OOV words in Eo is embedded in by its form and fills.In an experiment, it has been found that have bias term and word frequency It is highly relevant, it means that word frequency is higher, and bias is bigger.Herein, we use the bias term of OOV words as one The small steady state value of experience.
It is embedded in, OOV words can be attached in LSTM language models without instructing again by using structuring word Practice.As we are above-mentioned, in the training process by the parameter in shared proposed model, OOV can also be mitigated The Deta sparseness of word.
Structuring word insertion language model proposed in the embodiment of the present invention also achieves compression of parameters.In LSTM languages It says in model, the word insertion of low frequency word occupies a big chunk model parameter but is undertrained.By with word Symbol indicates that, instead of low frequency word, the quantity of parameter can be greatly reduced.
If V is the word finder of all words, H is hidden layer size.In LSTM language models, the parameter number of word insertion Amount is 2 × | V | × H, however, in the LSTM language models of structuring insertion, parameter sum be (| Vh |+| Vchar |+| S |) ×H3, wherein Vh indicates that high frequency word, Vchar indicate that character set, S indicate POS set of tags.Experiment shows as | V |= 60000, | Vh |=8000, | Vchar |=5000, | S | when=32, it is possible to reduce nearly 90% parameter.
Inventor can get a desired effect for the method and system for verifying the present invention, based on Chinese short message breath clothes The word being engaged in (SMS, short message service) data set progress test assessment embodiment of the present invention is embedded in language mould Type (follow-up to be referred to as structuring word insertion LSTM language models).
Table 1- data set informations
Table 1 gives the details of data set.Two different size of vocabularies are used for each data set.Complete word Collect VfCover all vocabulary appeared in corpus.Small word finder VsIt is VfA subset.Here (IV) is determined in vocabulary Justice is VsIn word, vocabulary outer (OOV) means VfRather than VsIn word.Sms-30m data sets also be used as training set and The spontaneous exchange test collection of mandarin re-executes task in (about 25 hours, 3K language) for ASR (automatic speech recognition).
1. with small word finder VsTrain LSTM language models, and all OOV words are considered as one individually<UNK> Symbol, referred to as " UNK ".
2. with whole word finder VfRetraining LSTM language models, referred to as " retraining ".
For the LSTM language models being embedded in structuring word, small word finder V is used in the training stages, testing Model vocabulary is updated to V by the stagef
In order to be consistent with the size of the model proposed, the inputs of LSTM baselines insertion is dimensioned to 600, and And output insertion is dimensioned to 300.In the LSTM language models being embedded in structuring, syntax insertion size is set To 300.We carry out FOFE codings using 1 layer of 5000-300FNN, wherein 5000 be character set VcSize, 300 be that form is embedding The dimension entered.The α of FOFE is set as 0.7, the bias term of neologisms is set as 0, in the active set by the two empirical parameters It is finely adjusted.8192 most frequent words are chosen as high frequency word, other words are considered as low frequency in our model Word.All models are all trained using identical hyper parameter stochastic gradient descent (SGD).
Complexity evaluations
Degree of aliasing assessment result is shown in Table 2.In particular, the PPL for " unk " LSTM, OOV words was calculated by equation (6) generation It replaces.The result shows that structure insertion (SE) method proposed has similar performance with unk LSTM.However, retraining LSTM performances are worse.In order to further investigate, we respectively carry out in vocabulary each model (IV) respectively and vocabulary is outer (OOV) The PPL of word is calculated.Experimental result is as shown in table 3.Unk LSTM are showed most using sacrificing OOV words as cost in IV words It is good, because the PPL of its OOV words is very high.Relative to unk LSTM, the LSTM of retraining substantially increases OOV words PPL, and decrease in IV words.Our method further improves the PPL in OOV words, in IV words With similar performance.
Complexity between table 2- difference OOV combined methods compares
The complexity of word and the outer word of vocabulary is decomposed in table 3- vocabularys
Quick vocabulary update is in ASR
In automatic speech recognition (ASR) system, rollback n meta-models are used as generating the language model of grid, Cong Zhongsheng At n-best lists.Then n-best lists can be readjusted using neural network language model to obtain better performance. In general, n members and neural network language model share identical vocabulary.Therefore, when vocabulary updates, n members and neural network language Model is required for retraining.Compared with neural network language model, the training time of n gram language models can be ignored.
This experiment is divided into two stages.In the first phase, it is unk LSTM and LSTM to use small lexical representation respectively LSTM language models are trained with SE and the LSTM language models of (SE, Structure embedding) are embedded in structuring. It is used to generate n-best lists with the Vs n gram language models trained.Then we carry out n-best lists using unk LSTM It beats again point.In second stage, vocabulary Vs expands as the vocabulary Vf of bigger.Since vocabulary changes, it would be desirable to retraining unk LSTM and n meta-models.But the vocabulary of the LSTM with SE is rebuild without re -training.And then training LSTM and LSTM with SE is used to redefine by the n-best lists of new n meta-models generation.
The word error rate of sentence and the outer sentence of vocabulary compares and decomposes in table 4- vocabularys
Experimental result is as shown in table 4.The extension of vocabulary is benefited from, the LSTM of re -training is obtained absolutely in all sentences 0.38% CER is improved.The LSTM models (LSTM with SE) being embedded in structuring proposed reach optimum performance.For Research from the model proposed can obtain which type of sentence obtains maximum value, and whether we go out according to all words Subordinate sentence, which will be beaten again, in present Vs is divided into two classes, sentence (IVS) and the outer sentence (OOVS) of vocabulary referred to as in vocabulary.As shown in table 4, There is higher CER for sentence outside vocabulary with the Vs unk LSTM trained, because the n members built by Vs cannot generate these OOV Word.By enlarging one's vocabulary, the LSTM of retraining for vocabulary outside sentence obtained significantly improving for CER.With retraining LSTM compare, the model proposed on IV and OOV sentences be better than CER.Moreover, the CER of improvement to(for) OOV sentences (being absolutely 1.13%) is significantly higher than improves (being absolutely 0.13%) to the CER of IV sentences, it means that the LSTM with SE With the ability preferably modeled for OOV words.It note that the LSTM language by using the structuring word insertion proposed Model, can save the time of model retraining in conventional method, and obtain preferable performance.
It should be noted that for each method embodiment above-mentioned, for simple description, therefore it is all expressed as a series of Action merge, but those skilled in the art should understand that, the present invention is not limited by the described action sequence because According to the present invention, certain steps can be performed in other orders or simultaneously.Secondly, those skilled in the art should also know It knows, embodiment described in this description belongs to preferred embodiment, and involved action and module are not necessarily of the invention It is necessary.
In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment Point, it may refer to the associated description of other embodiment.
As shown in fig. 7, the embodiment of the present invention also provides a kind of word insertion language model training system 700, including:
Vocabulary generates program module 710, for determining the attribute of all words in corpus to generate vocabulary, the category Property includes that the parts of speech classification of all words, the probability distribution of all parts of speech classification and all words are general under affiliated parts of speech classification Rate is distributed;
Term vector generates program module 720, the term vector for generating all words in the vocabulary;
Parts of speech classification vector generator module 730, for generating the part of speech point corresponding to all words in the vocabulary The parts of speech classification vector of class;
Model training program module 740, for the word in the term vector of the word in the vocabulary and the vocabulary Parts of speech classification vector be input, in the probability distribution of the parts of speech classification belonging to the word in the vocabulary and the vocabulary Probability distribution of the word under affiliated parts of speech classification is that output is trained, and to obtain, predicate is embedded in language model.
As shown in figure 8, in some embodiments, the term vector generates program module 720 and includes:
Frequency word determining program unit 721, whether the word for judging to be obtained from the vocabulary is low-frequency word;
First term vector generates program unit 722, for when judge to be obtained from the word of the vocabulary as low-frequency word when, general The word for being obtained from the vocabulary is disassembled into word, and the word obtained to dismantling is encoded for the corresponding term vector of determination;
Second term vector generates program unit 723, for when judge to be obtained from the word of the vocabulary as high frequency words when, carry Take the vector of the word for being obtained from the vocabulary as term vector.
As shown in figure 9, the embodiment of the present invention also provides a kind of words recognition system 900, including:
Word described in the above embodiment of the present invention is embedded in language model 910;
Term vector generates program module 920, the term vector for generating word to be identified;
Vocabulary generates program module 930, the parts of speech classification vector of the parts of speech classification for determining the word to be identified;
Words recognition program module 940, for the term vector of the word to be identified and parts of speech classification vector to be inputted institute Predicate is embedded in language model, with the probability distribution for obtaining the parts of speech classification belonging to the word to be identified and the word to be identified Probability distribution under affiliated parts of speech classification.
As shown in Figure 10, in some embodiments, it is embedded in language when the word to be identified belongs to the predicate for training When the vocabulary of model, the term vector generates program module 920 and includes:
Frequency word determining program unit 921, for judging whether the word to be identified is low-frequency word;
First term vector generates program unit 922, for when judge to be obtained from the word of the vocabulary as low-frequency word when, general The word dismantling to be identified is at word, and the word obtained to dismantling is encoded for the corresponding term vector of determination;
Second term vector generates program unit 923, for when judge to be obtained from the word of the vocabulary as high frequency words when, carry Take the vector of the word to be identified as term vector.
As shown in figure 11, in some embodiments, it is embedded in language when the word to be identified is not belonging to the predicate for training When saying the vocabulary of model, the term vector generates program module 920 and includes:
Vocabulary updates program unit 921 ', for determining the attribute of the word to be identified to update the vocabulary;
Term vector generates program unit 922 ', for disassembling the word to be identified at word, and to disassembling obtained word It is encoded for the corresponding term vector of determination.
In some embodiments, the embodiment of the present invention provides a kind of non-volatile computer readable storage medium storing program for executing, described to deposit It includes the programs executed instruction to be stored in storage media one or more, it is described execute instruction can by electronic equipment (including but It is not limited to computer, server or the network equipment etc.) it reads and executes, it is embedding for executing any of the above-described word of the present invention Enter language model training method and/or words recognition method.
In some embodiments, the embodiment of the present invention also provides a kind of computer program product, the computer program production Product include the computer program being stored on non-volatile computer readable storage medium storing program for executing, and the computer program includes that program refers to It enables, when described program instruction is computer-executed, the computer is made to execute the insertion language model training of any of the above-described word Method and/or words recognition method.
In some embodiments, the embodiment of the present invention also provides a kind of electronic equipment comprising:At least one processor, And the memory being connect at least one processor communication, wherein the memory is stored with can be by described at least one The instruction that a processor executes, described instruction is executed by least one processor, so that at least one processor energy The step of enough executing word insertion language model training method and/or words recognition method.
In some embodiments, the embodiment of the present invention also provides a kind of storage medium, is stored thereon with computer program, It is characterized in that, which is executed by processor the step of word is embedded in language model training method and/or words recognition method.
The system and/or input method system of the realization structure language model of the embodiments of the present invention can be used for executing this hair The method and/or input method of the realization structure language model of bright embodiment, and reach the reality of the embodiments of the present invention accordingly The technique effect that now method and/or input method of structure language model are reached, which is not described herein again.
Correlation function mould can be realized in the embodiment of the present invention by hardware processor (hardware processor) Block.
Figure 12 is the word insertion language model training method and/or words recognition method that another embodiment of the application provides The hardware architecture diagram of electronic equipment, as shown in figure 12, which includes:
One or more processors 1210 and memory 1220, in Figure 12 by taking a processor 1210 as an example.
Executing the equipment for realizing word insertion language model training method and/or words recognition method can also include:Input Device 1230 and output device 1240.
Processor 1210, memory 1220, input unit 1230 and output device 1240 can by bus or other Mode connects, in Figure 12 for being connected by bus.
Memory 1220 is used as a kind of non-volatile computer readable storage medium storing program for executing, can be used for storing non-volatile software journey Sequence, non-volatile computer executable program and module, such as the realization word insertion language model training in the embodiment of the present application Method and/or the corresponding program instruction/module of words recognition method.Processor 1210 is stored in by operation in memory 1220 Non-volatile software program, instruction and module, to execute server various function application and data processing, i.e., in fact Existing above method embodiment realizes word insertion language model training method and/or words recognition method.
Memory 1220 may include storing program area and storage data field, wherein storing program area can store operation system System, the required application program of at least one function;Storage data field can be stored is embedded in language model training cartridge according to realization word It sets and/or words recognition device uses created data etc..It is deposited in addition, memory 1220 may include high random access Reservoir, can also include nonvolatile memory, a for example, at least disk memory, flush memory device or other are non-volatile Property solid-state memory.In some embodiments, it includes remotely located relative to processor 1210 deposit that memory 1220 is optional Reservoir, these remote memories can be embedded in language model training device and/or words recognition device by network connection to word. The example of above-mentioned network includes but not limited to internet, intranet, LAN, mobile radio communication and combinations thereof.
Input unit 1230 can receive the number or character information of input, and generates and be embedded in language model training cartridge with word It sets and/or the user setting of words recognition device and the related signal of function control.Output device 1240 may include display screen Deng display equipment.
One or more of modules are stored in the memory 1220, when by one or more of processors When 1210 execution, word insertion language model training method and/or the words recognition method in above-mentioned any means embodiment are executed.
The said goods can perform the method that the embodiment of the present application is provided, and has the corresponding function module of execution method and has Beneficial effect.The not technical detail of detailed description in the present embodiment, reference can be made to the method that the embodiment of the present application is provided.
The electronic equipment of the embodiment of the present application exists in a variety of forms, including but not limited to:
(1) mobile communication equipment:The characteristics of this kind of equipment is that have mobile communication function, and to provide speech, data Communication is main target.This Terminal Type includes:Smart mobile phone (such as iPhone), multimedia handset, functional mobile phone and low Hold mobile phone etc..
(2) super mobile personal computer equipment:This kind of equipment belongs to the scope of personal computer, there is calculating and processing work( Can, generally also have mobile Internet access characteristic.This Terminal Type includes:PDA, MID and UMPC equipment etc., such as iPad.
(3) portable entertainment device:This kind of equipment can show and play multimedia content.Such equipment includes:Audio, Video player (such as iPod), handheld device, e-book and intelligent toy and portable car-mounted navigation equipment.
(4) server:The equipment for providing the service of calculating, the composition of server include that processor, hard disk, memory, system are total Line etc., server is similar with general computer architecture, but due to needing to provide highly reliable service, in processing energy Power, stability, reliability, safety, scalability, manageability etc. are more demanding.
(5) other electronic devices with data interaction function.
The apparatus embodiments described above are merely exemplary, wherein the unit illustrated as separating component can It is physically separated with being or may not be, the component shown as unit may or may not be physics list Member, you can be located at a place, or may be distributed over multiple network units.It can be selected according to the actual needs In some or all of module achieve the purpose of the solution of this embodiment.
Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can It is realized by the mode of software plus general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, above-mentioned technology Scheme substantially in other words can be expressed in the form of software products the part that the relevant technologies contribute, the computer Software product can store in a computer-readable storage medium, such as ROM/RAM, magnetic disc, CD, including some instructions to So that computer equipment (can be personal computer, server either network equipment etc.) execute each embodiment or Method described in certain parts of embodiment.
Finally it should be noted that:Above example is only to illustrate the technical solution of the application, rather than its limitations;Although The application is described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that:It still may be used With technical scheme described in the above embodiments is modified or equivalent replacement of some of the technical features; And these modifications or replacements, each embodiment technical solution of the application that it does not separate the essence of the corresponding technical solution spirit and Range.

Claims (12)

1. a kind of word is embedded in language model training method, including:
The attribute of all words in corpus is determined to generate vocabulary, the attribute includes the parts of speech classification of all words, institute There is the probability distribution of the probability distribution and all words of parts of speech classification under affiliated parts of speech classification;
Generate the term vector of all words in the vocabulary;
Generate the parts of speech classification vector for corresponding to the parts of speech classification of all words in the vocabulary;
It is input with the parts of speech classification vector of the word in the term vector of the word in the vocabulary and the vocabulary, with institute's predicate Probability of the word under affiliated parts of speech classification point in the probability distribution and the vocabulary of the parts of speech classification belonging to word in table Cloth is that output is trained, and to obtain, predicate is embedded in language model.
2. according to the method described in claim 1, wherein, the term vector for generating all words in the vocabulary includes:
Judgement is obtained from whether the word of the vocabulary is low-frequency word;
If it is, the word for being obtained from the vocabulary is disassembled into word, and the word obtained to dismantling is encoded for true Fixed corresponding term vector;
If it is not, then being obtained from the vector of the word of the vocabulary described in extraction as term vector.
3. a kind of words recognition method, the method uses the insertion language model of the word described in claims 1 or 2, described Method includes:
Generate the term vector of word to be identified;
Determine the parts of speech classification vector of the parts of speech classification of the word to be identified;
The term vector of the word to be identified and parts of speech classification vector input institute predicate are embedded in language model, to be waited for described in acquisition Identify the probability distribution of the probability distribution and the word to be identified of the parts of speech classification belonging to word under affiliated parts of speech classification.
4. according to the method described in claim 3, wherein, language is embedded in when the word to be identified belongs to the predicate for training When the vocabulary of model, the term vector for generating the word to be identified includes:
Judge whether the word to be identified is low-frequency word;
If it is, by the word dismantling to be identified at word, and the word obtained to dismantling is encoded for determining accordingly Term vector;
If it is not, then the vector of the extraction word to be identified is as term vector.
5. according to the method described in claim 3, wherein, language is embedded in when the word to be identified is not belonging to the predicate for training When saying the vocabulary of model, the term vector for generating the word to be identified includes:
Determine the attribute of the word to be identified to update the vocabulary;
By the word dismantling to be identified at word, and the word obtained to dismantling is encoded for the corresponding term vector of determination.
6. a kind of word is embedded in language model training system, including:
Vocabulary generates program module, and for determining the attribute of all words in corpus to generate vocabulary, the attribute includes Probability distribution and all words probability under affiliated parts of speech classification point of the parts of speech classification of all words, all parts of speech classification Cloth;
Term vector generates program module, the term vector for generating all words in the vocabulary;
Parts of speech classification vector generator module, for generating the part of speech corresponding to the parts of speech classification of all words in the vocabulary Class vector;
Model training program module, for the part of speech of the word in the term vector of the word in the vocabulary and the vocabulary point Class vector is input, with the word in the probability distribution of the parts of speech classification belonging to the word in the vocabulary and the vocabulary in institute It is that output is trained to belong to the probability distribution under parts of speech classification, and to obtain, predicate is embedded in language model.
7. system according to claim 6, wherein the term vector generates program module and includes:
Frequency word determining program unit, whether the word for judging to be obtained from the vocabulary is low-frequency word;
First term vector generates program unit, for when judge to be obtained from the word of the vocabulary as low-frequency word when, will be obtained from The word of the vocabulary is disassembled into word, and the word obtained to dismantling is encoded for the corresponding term vector of determination;
Second term vector generates program unit, for when judge to be obtained from the word of the vocabulary as high frequency words when, described in extraction The vector of the word of the vocabulary is obtained from as term vector.
8. a kind of words recognition system, including:
Word described in claim 6 or 7 is embedded in language model;
Term vector generates program module, the term vector for generating word to be identified;
Vocabulary generates program module, the parts of speech classification vector of the parts of speech classification for determining the word to be identified;
Words recognition program module, for the term vector of the word to be identified and parts of speech classification vector input institute predicate to be embedded in Language model, with obtain the parts of speech classification belonging to the word to be identified probability distribution and the word to be identified in affiliated word Property classification under probability distribution.
9. system according to claim 8, wherein be embedded in language when the word to be identified belongs to the predicate for training When the vocabulary of model, the term vector generates program module and includes:
Frequency word determining program unit, for judging whether the word to be identified is low-frequency word;
First term vector generates program unit, for when judge to be obtained from the word of the vocabulary as low-frequency word when, waited for described Word dismantling is identified into word, and the word obtained to dismantling is encoded for the corresponding term vector of determination;
Second term vector generates program unit, for when judge to be obtained from the word of the vocabulary as high frequency words when, described in extraction The vector of word to be identified is as term vector.
10. system according to claim 8, wherein be embedded in when the word to be identified is not belonging to the predicate for training When the vocabulary of language model, the term vector generates program module and includes:
Vocabulary updates program unit, for determining the attribute of the word to be identified to update the vocabulary;
Term vector generates program unit, for encoding the word dismantling to be identified at word, and to disassembling obtained word For the corresponding term vector of determination.
11. a kind of electronic equipment comprising:At least one processor, and connect at least one processor communication Memory, wherein the memory is stored with the instruction that can be executed by least one processor, described instruction by it is described extremely A few processor executes, so that at least one processor is able to carry out any one of claim 1-5 the methods The step of.
12. a kind of storage medium, is stored thereon with computer program, which is characterized in that the program is realized when being executed by processor The step of any one of claim 1-5 the methods.
CN201810022130.3A 2018-01-10 2018-01-10 Word embedding language model training method, word recognition method and system Active CN108417210B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810022130.3A CN108417210B (en) 2018-01-10 2018-01-10 Word embedding language model training method, word recognition method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810022130.3A CN108417210B (en) 2018-01-10 2018-01-10 Word embedding language model training method, word recognition method and system

Publications (2)

Publication Number Publication Date
CN108417210A true CN108417210A (en) 2018-08-17
CN108417210B CN108417210B (en) 2020-06-26

Family

ID=63125464

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810022130.3A Active CN108417210B (en) 2018-01-10 2018-01-10 Word embedding language model training method, word recognition method and system

Country Status (1)

Country Link
CN (1) CN108417210B (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109346064A (en) * 2018-12-13 2019-02-15 苏州思必驰信息科技有限公司 Training method and system for end-to-end speech identification model
CN109902309A (en) * 2018-12-17 2019-06-18 北京百度网讯科技有限公司 Interpretation method, device, equipment and storage medium
CN110010129A (en) * 2019-04-09 2019-07-12 山东师范大学 A kind of voice interactive system based on hexapod robot
CN110196975A (en) * 2019-02-27 2019-09-03 北京金山数字娱乐科技有限公司 Problem generation method, device, equipment, computer equipment and storage medium
CN110852112A (en) * 2019-11-08 2020-02-28 语联网(武汉)信息技术有限公司 Word vector embedding method and device
CN110895935A (en) * 2018-09-13 2020-03-20 阿里巴巴集团控股有限公司 Speech recognition method, system, device and medium
CN110909551A (en) * 2019-12-05 2020-03-24 北京知道智慧信息技术有限公司 Language pre-training model updating method and device, electronic equipment and storage medium
CN111324731A (en) * 2018-12-13 2020-06-23 百度(美国)有限责任公司 Computer implemented method for embedding words in corpus
CN111783431A (en) * 2019-04-02 2020-10-16 北京地平线机器人技术研发有限公司 Method and device for predicting word occurrence probability by using language model and training language model
CN111797631A (en) * 2019-04-04 2020-10-20 北京猎户星空科技有限公司 Information processing method and device and electronic equipment
CN112528682A (en) * 2020-12-23 2021-03-19 北京百度网讯科技有限公司 Language detection method and device, electronic equipment and storage medium
CN112612892A (en) * 2020-12-29 2021-04-06 达而观数据(成都)有限公司 Special field corpus model construction method, computer equipment and storage medium
CN112632999A (en) * 2020-12-18 2021-04-09 北京百度网讯科技有限公司 Named entity recognition model obtaining method, named entity recognition device and named entity recognition medium
CN112735380A (en) * 2020-12-28 2021-04-30 苏州思必驰信息科技有限公司 Scoring method and voice recognition method for re-scoring language model
CN113330510A (en) * 2019-02-05 2021-08-31 国际商业机器公司 Out-of-vocabulary word recognition in direct acoustic-to-word speech recognition using acoustic word embedding
CN113326693A (en) * 2021-05-28 2021-08-31 智者四海(北京)技术有限公司 Natural language model training method and system based on word granularity
CN113646834A (en) * 2019-04-08 2021-11-12 微软技术许可有限责任公司 Automatic speech recognition confidence classifier
US11422798B2 (en) 2020-02-26 2022-08-23 International Business Machines Corporation Context-based word embedding for programming artifacts
WO2023065544A1 (en) * 2021-10-18 2023-04-27 平安科技(深圳)有限公司 Intention classification method and apparatus, electronic device, and computer-readable storage medium
US11663402B2 (en) 2020-07-21 2023-05-30 International Business Machines Corporation Text-to-vectorized representation transformation

Citations (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101197083A (en) * 2006-12-06 2008-06-11 英业达股份有限公司 Method and apparatus for learning English vocabulary and computer readable memory medium
EP2624149A2 (en) * 2012-02-02 2013-08-07 Xerox Corporation Document processing employing probabilistic topic modeling of documents represented as text words transformed to a continuous space
US8515745B1 (en) * 2012-06-20 2013-08-20 Google Inc. Selecting speech data for speech recognition vocabulary
CN103810999A (en) * 2014-02-27 2014-05-21 清华大学 Linguistic model training method and system based on distributed neural networks
CN104598441A (en) * 2014-12-25 2015-05-06 上海科阅信息技术有限公司 Method for splitting Chinese sentences through computer
CN105244029A (en) * 2015-08-28 2016-01-13 科大讯飞股份有限公司 Voice recognition post-processing method and system
CN105654946A (en) * 2014-12-02 2016-06-08 三星电子株式会社 Method and apparatus for speech recognition
CN106126596A (en) * 2016-06-20 2016-11-16 中国科学院自动化研究所 A kind of answering method based on stratification memory network
CN106383877A (en) * 2016-09-12 2017-02-08 电子科技大学 On-line short text clustering and topic detection method of social media
CN106503146A (en) * 2016-10-21 2017-03-15 江苏理工学院 Computer text feature selection method, classification feature selection method and system
CN106782560A (en) * 2017-03-06 2017-05-31 海信集团有限公司 Determine the method and device of target identification text
CN107180026A (en) * 2017-05-02 2017-09-19 苏州大学 The event phrase learning method and device of a kind of word-based embedded Semantic mapping
CN107221325A (en) * 2016-03-22 2017-09-29 华硕电脑股份有限公司 Aeoplotropism keyword verification method and the electronic installation using this method
US9779722B2 (en) * 2013-11-05 2017-10-03 GM Global Technology Operations LLC System for adapting speech recognition vocabulary
CN107291795A (en) * 2017-05-03 2017-10-24 华南理工大学 A kind of dynamic word insertion of combination and the file classification method of part-of-speech tagging
US9799327B1 (en) * 2016-02-26 2017-10-24 Google Inc. Speech recognition with attention-based recurrent neural networks
CN107357789A (en) * 2017-07-14 2017-11-17 哈尔滨工业大学 Merge the neural machine translation method of multi-lingual coding information
CN107391575A (en) * 2017-06-20 2017-11-24 浙江理工大学 A kind of implicit features recognition methods of word-based vector model
CN107464559A (en) * 2017-07-11 2017-12-12 中国科学院自动化研究所 Joint forecast model construction method and system based on Chinese rhythm structure and stress
CN107491531A (en) * 2017-08-18 2017-12-19 华南师范大学 Chinese network comment sensibility classification method based on integrated study framework
CN107544958A (en) * 2017-07-12 2018-01-05 清华大学 Terminology extraction method and apparatus
CN107562715A (en) * 2017-07-18 2018-01-09 阿里巴巴集团控股有限公司 Term vector processing method, device and electronic equipment

Patent Citations (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101197083A (en) * 2006-12-06 2008-06-11 英业达股份有限公司 Method and apparatus for learning English vocabulary and computer readable memory medium
EP2624149A2 (en) * 2012-02-02 2013-08-07 Xerox Corporation Document processing employing probabilistic topic modeling of documents represented as text words transformed to a continuous space
US8515745B1 (en) * 2012-06-20 2013-08-20 Google Inc. Selecting speech data for speech recognition vocabulary
US9779722B2 (en) * 2013-11-05 2017-10-03 GM Global Technology Operations LLC System for adapting speech recognition vocabulary
CN103810999A (en) * 2014-02-27 2014-05-21 清华大学 Linguistic model training method and system based on distributed neural networks
CN105654946A (en) * 2014-12-02 2016-06-08 三星电子株式会社 Method and apparatus for speech recognition
CN104598441A (en) * 2014-12-25 2015-05-06 上海科阅信息技术有限公司 Method for splitting Chinese sentences through computer
CN105244029A (en) * 2015-08-28 2016-01-13 科大讯飞股份有限公司 Voice recognition post-processing method and system
US9799327B1 (en) * 2016-02-26 2017-10-24 Google Inc. Speech recognition with attention-based recurrent neural networks
CN107221325A (en) * 2016-03-22 2017-09-29 华硕电脑股份有限公司 Aeoplotropism keyword verification method and the electronic installation using this method
CN106126596A (en) * 2016-06-20 2016-11-16 中国科学院自动化研究所 A kind of answering method based on stratification memory network
CN106383877A (en) * 2016-09-12 2017-02-08 电子科技大学 On-line short text clustering and topic detection method of social media
CN106503146A (en) * 2016-10-21 2017-03-15 江苏理工学院 Computer text feature selection method, classification feature selection method and system
CN106782560A (en) * 2017-03-06 2017-05-31 海信集团有限公司 Determine the method and device of target identification text
CN107180026A (en) * 2017-05-02 2017-09-19 苏州大学 The event phrase learning method and device of a kind of word-based embedded Semantic mapping
CN107291795A (en) * 2017-05-03 2017-10-24 华南理工大学 A kind of dynamic word insertion of combination and the file classification method of part-of-speech tagging
CN107391575A (en) * 2017-06-20 2017-11-24 浙江理工大学 A kind of implicit features recognition methods of word-based vector model
CN107464559A (en) * 2017-07-11 2017-12-12 中国科学院自动化研究所 Joint forecast model construction method and system based on Chinese rhythm structure and stress
CN107544958A (en) * 2017-07-12 2018-01-05 清华大学 Terminology extraction method and apparatus
CN107357789A (en) * 2017-07-14 2017-11-17 哈尔滨工业大学 Merge the neural machine translation method of multi-lingual coding information
CN107562715A (en) * 2017-07-18 2018-01-09 阿里巴巴集团控股有限公司 Term vector processing method, device and electronic equipment
CN107491531A (en) * 2017-08-18 2017-12-19 华南师范大学 Chinese network comment sensibility classification method based on integrated study framework

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TOMAS MIKOLOV等: ""Efficient Estimation of Word Representations in"", 《ARXIV PREPRINT》 *

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110895935A (en) * 2018-09-13 2020-03-20 阿里巴巴集团控股有限公司 Speech recognition method, system, device and medium
CN110895935B (en) * 2018-09-13 2023-10-27 阿里巴巴集团控股有限公司 Speech recognition method, system, equipment and medium
CN109346064B (en) * 2018-12-13 2021-07-27 思必驰科技股份有限公司 Training method and system for end-to-end speech recognition model
CN109346064A (en) * 2018-12-13 2019-02-15 苏州思必驰信息科技有限公司 Training method and system for end-to-end speech identification model
CN111324731A (en) * 2018-12-13 2020-06-23 百度(美国)有限责任公司 Computer implemented method for embedding words in corpus
CN111324731B (en) * 2018-12-13 2023-10-17 百度(美国)有限责任公司 Computer-implemented method for embedding words of corpus
CN109902309A (en) * 2018-12-17 2019-06-18 北京百度网讯科技有限公司 Interpretation method, device, equipment and storage medium
CN113330510A (en) * 2019-02-05 2021-08-31 国际商业机器公司 Out-of-vocabulary word recognition in direct acoustic-to-word speech recognition using acoustic word embedding
CN110196975A (en) * 2019-02-27 2019-09-03 北京金山数字娱乐科技有限公司 Problem generation method, device, equipment, computer equipment and storage medium
CN111783431B (en) * 2019-04-02 2024-05-24 北京地平线机器人技术研发有限公司 Method and device for training predicted word occurrence probability and language model by using language model
CN111783431A (en) * 2019-04-02 2020-10-16 北京地平线机器人技术研发有限公司 Method and device for predicting word occurrence probability by using language model and training language model
CN111797631A (en) * 2019-04-04 2020-10-20 北京猎户星空科技有限公司 Information processing method and device and electronic equipment
CN113646834A (en) * 2019-04-08 2021-11-12 微软技术许可有限责任公司 Automatic speech recognition confidence classifier
CN110010129A (en) * 2019-04-09 2019-07-12 山东师范大学 A kind of voice interactive system based on hexapod robot
CN110852112A (en) * 2019-11-08 2020-02-28 语联网(武汉)信息技术有限公司 Word vector embedding method and device
CN110852112B (en) * 2019-11-08 2023-05-05 语联网(武汉)信息技术有限公司 Word vector embedding method and device
CN110909551B (en) * 2019-12-05 2023-10-27 北京知道创宇信息技术股份有限公司 Language pre-training model updating method and device, electronic equipment and storage medium
CN110909551A (en) * 2019-12-05 2020-03-24 北京知道智慧信息技术有限公司 Language pre-training model updating method and device, electronic equipment and storage medium
US11422798B2 (en) 2020-02-26 2022-08-23 International Business Machines Corporation Context-based word embedding for programming artifacts
US11663402B2 (en) 2020-07-21 2023-05-30 International Business Machines Corporation Text-to-vectorized representation transformation
CN112632999A (en) * 2020-12-18 2021-04-09 北京百度网讯科技有限公司 Named entity recognition model obtaining method, named entity recognition device and named entity recognition medium
CN112528682A (en) * 2020-12-23 2021-03-19 北京百度网讯科技有限公司 Language detection method and device, electronic equipment and storage medium
CN112735380A (en) * 2020-12-28 2021-04-30 苏州思必驰信息科技有限公司 Scoring method and voice recognition method for re-scoring language model
CN112612892A (en) * 2020-12-29 2021-04-06 达而观数据(成都)有限公司 Special field corpus model construction method, computer equipment and storage medium
CN112612892B (en) * 2020-12-29 2022-11-01 达而观数据(成都)有限公司 Special field corpus model construction method, computer equipment and storage medium
CN113326693B (en) * 2021-05-28 2024-04-16 智者四海(北京)技术有限公司 Training method and system of natural language model based on word granularity
CN113326693A (en) * 2021-05-28 2021-08-31 智者四海(北京)技术有限公司 Natural language model training method and system based on word granularity
WO2023065544A1 (en) * 2021-10-18 2023-04-27 平安科技(深圳)有限公司 Intention classification method and apparatus, electronic device, and computer-readable storage medium

Also Published As

Publication number Publication date
CN108417210B (en) 2020-06-26

Similar Documents

Publication Publication Date Title
CN108417210A (en) A kind of word insertion language model training method, words recognition method and system
Gao et al. Retrieval-augmented generation for large language models: A survey
CN110334339B (en) Sequence labeling model and labeling method based on position perception self-attention mechanism
CN104143327B (en) A kind of acoustic training model method and apparatus
Balaraman et al. Recent neural methods on dialogue state tracking for task-oriented dialogue systems: A survey
CN108962224A (en) Speech understanding and language model joint modeling method, dialogue method and system
CN109887484A (en) A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device
CN106407113B (en) A kind of bug localization method based on the library Stack Overflow and commit
CN113268609A (en) Dialog content recommendation method, device, equipment and medium based on knowledge graph
CN105843801A (en) Multi-translation parallel corpus construction system
CN108664465A (en) One kind automatically generating text method and relevant apparatus
CN110349597A (en) A kind of speech detection method and device
US20230094730A1 (en) Model training method and method for human-machine interaction
CN110162789A (en) A kind of vocabulary sign method and device based on the Chinese phonetic alphabet
CN110334186A (en) Data query method, apparatus, computer equipment and computer readable storage medium
Dinarelli et al. Discriminative reranking for spoken language understanding
Dethlefs Domain transfer for deep natural language generation from abstract meaning representations
CN115357719A (en) Power audit text classification method and device based on improved BERT model
CN105868187A (en) A multi-translation version parallel corpus establishing method
CN116245097A (en) Method for training entity recognition model, entity recognition method and corresponding device
CN115510188A (en) Text keyword association method, device, equipment and storage medium
CN116881457A (en) Small sample text classification method based on knowledge contrast enhancement prompt
CN112417890B (en) Fine granularity entity classification method based on diversified semantic attention model
CN109614541A (en) A kind of event recognition method, medium, device and calculate equipment
CN110197521B (en) Visual text embedding method based on semantic structure representation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20200617

Address after: 215123 14 Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou, Jiangsu.

Co-patentee after: Shanghai Jiaotong University Intellectual Property Management Co.,Ltd.

Patentee after: AI SPEECH Co.,Ltd.

Address before: Suzhou City, Jiangsu Province, Suzhou Industrial Park 215123 Xinghu Street No. 328 Creative Industry Park 9-703

Co-patentee before: SHANGHAI JIAO TONG University

Patentee before: AI SPEECH Co.,Ltd.

TR01 Transfer of patent right

Effective date of registration: 20201027

Address after: 215123 14 Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou, Jiangsu.

Patentee after: AI SPEECH Co.,Ltd.

Address before: 215123 14 Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou, Jiangsu.

Patentee before: AI SPEECH Co.,Ltd.

Patentee before: Shanghai Jiaotong University Intellectual Property Management Co.,Ltd.

TR01 Transfer of patent right
CP01 Change in the name or title of a patent holder

Address after: 215123 14 Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou, Jiangsu.

Patentee after: Sipic Technology Co.,Ltd.

Address before: 215123 14 Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou, Jiangsu.

Patentee before: AI SPEECH Co.,Ltd.

CP01 Change in the name or title of a patent holder
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A Method for Training a Word Embedding Language Model, a Method for Word Recognition, and a System

Effective date of registration: 20230726

Granted publication date: 20200626

Pledgee: CITIC Bank Limited by Share Ltd. Suzhou branch

Pledgor: Sipic Technology Co.,Ltd.

Registration number: Y2023980049433

PE01 Entry into force of the registration of the contract for pledge of patent right