US20190318732A1 - Implementing a whole sentence recurrent neural network language model for natural language processing - Google Patents
Implementing a whole sentence recurrent neural network language model for natural language processing Download PDFInfo
- Publication number
- US20190318732A1 US20190318732A1 US15/954,399 US201815954399A US2019318732A1 US 20190318732 A1 US20190318732 A1 US 20190318732A1 US 201815954399 A US201815954399 A US 201815954399A US 2019318732 A1 US2019318732 A1 US 2019318732A1
- Authority
- US
- United States
- Prior art keywords
- sentence
- neural network
- recurrent neural
- whole
- whole sentence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 71
- 238000003058 natural language processing Methods 0.000 title claims abstract description 71
- 230000000306 recurrent effect Effects 0.000 title claims abstract description 57
- 238000012549 training Methods 0.000 claims description 78
- 238000000034 method Methods 0.000 claims description 29
- 238000003860 storage Methods 0.000 claims description 27
- 238000004590 computer program Methods 0.000 claims description 19
- 230000015654 memory Effects 0.000 claims description 13
- 238000012217 deletion Methods 0.000 claims description 9
- 230000037430 deletion Effects 0.000 claims description 9
- 238000003780 insertion Methods 0.000 claims description 9
- 230000037431 insertion Effects 0.000 claims description 9
- 238000006467 substitution reaction Methods 0.000 claims description 9
- 230000006403 short-term memory Effects 0.000 claims description 8
- 230000003993 interaction Effects 0.000 claims description 7
- 238000013518 transcription Methods 0.000 claims description 5
- 230000035897 transcription Effects 0.000 claims description 5
- 238000013519 translation Methods 0.000 claims description 5
- 238000010586 diagram Methods 0.000 description 37
- 238000012360 testing method Methods 0.000 description 26
- 230000006870 function Effects 0.000 description 22
- 238000012545 processing Methods 0.000 description 19
- 230000008569 process Effects 0.000 description 13
- 238000004891 communication Methods 0.000 description 7
- 238000005070 sampling Methods 0.000 description 7
- 230000005540 biological transmission Effects 0.000 description 6
- 238000011176 pooling Methods 0.000 description 6
- 230000002093 peripheral effect Effects 0.000 description 5
- 238000010606 normalization Methods 0.000 description 4
- 230000002457 bidirectional effect Effects 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 238000009826 distribution Methods 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 3
- 230000005236 sound signal Effects 0.000 description 3
- 241000282326 Felis catus Species 0.000 description 2
- CDBYLPFSWZWCQE-UHFFFAOYSA-L Sodium Carbonate Chemical compound [Na+].[Na+].[O-]C([O-])=O CDBYLPFSWZWCQE-UHFFFAOYSA-L 0.000 description 2
- 238000003491 array Methods 0.000 description 2
- 230000001143 conditioned effect Effects 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 239000000835 fiber Substances 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000001902 propagating effect Effects 0.000 description 2
- 230000002441 reversible effect Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- RYGMFSIKBFXOCR-UHFFFAOYSA-N Copper Chemical compound [Cu] RYGMFSIKBFXOCR-UHFFFAOYSA-N 0.000 description 1
- 241000989913 Gunnera petaloidea Species 0.000 description 1
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000001994 activation Methods 0.000 description 1
- 238000000137 annealing Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 229910052802 copper Inorganic materials 0.000 description 1
- 239000010949 copper Substances 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012015 optical character recognition Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G06N7/005—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N7/00—Computing arrangements based on specific mathematical models
- G06N7/01—Probabilistic graphical models, e.g. probabilistic networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/183—Speech classification or search using natural language modelling using context dependencies, e.g. language models
- G10L15/19—Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
- G10L15/197—Probabilistic grammars, e.g. word n-grams
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
Definitions
- This invention relates in general to computing systems and more particularly to implementing a whole sentence a recurrent neural network language model for natural language processing.
- a recurrent neural network is a class of neural networks that includes weighted connections within a layer, in comparison to a traditional feed-forward network, where connections feed only to subsequent layers.
- RNNs can also include loops, which enables an RNN to store information while processing new inputs, facilitating use of RNNs for processing tasks where prior inputs need to be considered, such as time series data implemented for speech recognition and natural language processing (NLP) tasks.
- NLP natural language processing
- a method is directed to providing, by a computer system, a whole sentence recurrent neural network language model for estimating a probability of likelihood of each whole sentence processed by natural language processing being correct.
- the method is directed to applying, by the computer system, a noise contrastive estimation sampler against at least one entire sentence from a corpus of multiple sentences to generate at least one incorrect sentence.
- the method is directed to training, by the computer system, the whole sentence recurrent neural network language model, using the at least one entire sentence from the corpus and the at least one incorrect sentence, to distinguish the at least one entire sentence as correct.
- the method is directed to applying, by the computer system, the whole sentence recurrent neural network language model to estimate the probability of likelihood of each whole sentence processed by natural language processing being correct.
- a computer system comprises one or more processors, one or more computer-readable memories, one or more computer-readable storage devices, and program instructions, stored on at least one of the one or more storage devices for execution by at least one of the one or more processors via at least one of the one or more memories.
- the stored program instructions comprise program instructions to provide a whole sentence recurrent neural network language model for estimating a probability of likelihood of each whole sentence processed by natural language processing being correct.
- the stored program instructions comprise program instructions to apply a noise contrastive estimation sampler against at least one entire sentence from a corpus of a plurality of sentences to generate at least one incorrect sentence.
- the stored program instructions comprise program instructions to train the whole sentence recurrent neural network language model, using the at least one entire sentence from the corpus and the at least one incorrect sentence, to distinguish the at least one entire sentence as correct.
- the stored program instructions comprise program instructions to apply the whole sentence recurrent neural network language model to estimate the probability of likelihood of each whole sentence processed by natural language processing being correct.
- a computer program product comprises a computer readable storage medium having program instructions embodied therewith, wherein the computer readable storage medium is not a transitory signal per se.
- the computer program product comprising the program instructions executable by a computer to cause the computer to provide, by a computer, a whole sentence recurrent neural network language model for estimating a probability of likelihood of each whole sentence processed by natural language processing being correct.
- the computer program product comprising the program instructions executable by a computer to cause the computer to apply, by the computer, a noise contrastive estimation sampler against at least one entire sentence from a corpus of a plurality of sentences to generate at least one incorrect sentence.
- the computer program product comprising the program instructions executable by a computer to cause the computer to train, by the computer, the whole sentence recurrent neural network language model, using the at least one entire sentence from the corpus and the at least one incorrect sentence, to distinguish the at least one entire sentence as correct.
- the computer program product comprising the program instructions executable by a computer to cause the computer to apply, by the computer, the whole sentence recurrent neural network language model to estimate the probability of likelihood of each whole sentence processed by natural language processing being correct.
- FIG. 1 is a block diagram illustrating one example of a system for utilizing a whole sentence RNN language model for improving the accuracy of natural language processing
- FIG. 2 is a block diagram illustrating a whole sentence RNN LM for natural language processing in comparison with locally-conditional models and non-RNN architecture models for whole sentence processing;
- FIG. 3 is a block diagram illustrating one example of components of noise contrastive estimation applied by a training controller to generate incorrect sentences to use with correct sentences to train a whole sentence RNN LM;
- FIG. 4 is a block diagram illustrating a training sequence for training a whole sentence RNN LM using correct sentences in training data and incorrect sentences generated from the training data through noise contrastive estimation;
- FIG. 5 is a block diagram illustrating a testing sequence for testing a whole sentence RNN language model using entire sentences
- FIG. 6 is a block diagram illustrating one example of a performance evaluation of the accuracy of sequence identification tasks performed in an NLP system implementing a whole sentence RNN LM;
- FIG. 7 is a block diagram illustrating one example of a one layer bidirectional LSTM (BiLSTM) configuration of a whole sentence RNN language model
- FIG. 8 is a block diagram illustrating an example of the classification accuracy of an n-gram LM compared with a whole sentence RNN LM implemented in NLP systems for performing sequence identification tasks;
- FIG. 9 is a block diagram illustrating one example of a one layer unidirectional LSTM configuration of a whole sentence RNN language model
- FIG. 10 is a block diagram illustrating an example of the word error rate of an n-gram LM compared with a whole sentence RNN LM implemented by a NLP system for speech recognition tasks, applied on a unidirectional LSTM;
- FIG. 11 is a block diagram illustrating one example of a computer system in which one embodiment of the invention may be implemented.
- FIG. 12 illustrates a high level logic flowchart of a process and computer program for training a whole sentence RNN LM on an RNN LSTM architecture
- FIG. 13 illustrates a high level logic flowchart of a process and computer program product for testing an NLP system function implementing a whole sentence RNN LM on an RNN LSTM architecture.
- FIG. 1 illustrates a block diagram of one example of a system for utilizing a whole sentence RNN language model for improving the accuracy of natural language processing.
- a natural language processing (NLP) system 100 may process a sequence of words in speech 112 , as input, and generate one or more types of outputs, such as processed sequence of words 116 .
- speech 112 may represent an entire sentence or utterance with multiple words.
- natural language processing system 100 may perform one or more types of language processing including, but not limited to, automatic speech recognition, machine translation, optical character recognition, spell checking, and additional or alternate types of processing of natural language inputs.
- automatic speech recognition may include, but is not limited to, conversational interaction, conversational telephony speech transcription, multimedia captioning, and translation.
- speech 112 may include, but are not limited to an audio signal with spoken words, an image containing a sequence of words, and a stream of text words.
- NLP system 100 may include a speech model 120 , for translating the audio signal, image, or stream of text into statistical representations of the sounds, images, or text that make up each word in a sequence of words.
- the statistical representations of a word sequence 122 may be represented by sentence s of T words w 1 , w 2 , . . . , w T , where each w is a statistical representation of a word, phrase, or utterance.
- speech model 120 may represent an acoustic model that is used to create statistical representations of the audio signal and the phonemes or other linguistic units within speech 112 .
- speech model 120 may be trained from a set of audio recordings and their corresponding transcripts, created by taking audio recordings of speech and their text transcriptions and using software to create statistical representations of the sounds that make up each word.
- NLP system 100 may implement a language model (LM) to generate a probability distribution over a sequence of words, such as a whole sentence.
- LM language model
- the accuracy at which the LM generates a probability distribution for word sequence 122 impacts the accuracy of NLP system 100 to accurately process speech 112 into processed sequence of words 116 .
- NLP system 100 may implement a whole sentence RNN language model (LM) 110 , which given a sequence of processed words of a whole sentence from speech 112 of word sequence 122 , assigns a probability to the whole sentence, illustrated as probability for entire word sequence 114 .
- LM whole sentence RNN language model
- providing whole sentence RNN LM 110 to estimate the relative likelihood of an entire phrase being correctly processed is useful in many natural language processing applications that may be performed by NLP system 100 .
- NLP system 100 tries to match the sounds within speech 112 with word sequences.
- whole sentence RNN LM 110 may provide context to distinguish between words and phrases that sound similar, to assign a probability that the correct sentence has been recognized.
- whole sentence RNN LM 110 directly models the probability for the whole sentence in word sequence 122 .
- whole sentence RNN LM 110 may be trained to predict the probability of a whole sentence directly, without partially computing conditional probabilities for each classified word in the sentence individually.
- whole sentence RNN LM 110 may represent a whole sentence model integrated with an RNN long short-term memory (LSTM) architecture 130 .
- LSTM long short-term memory
- the whole sentence model of whole sentence RNN LM 110 is not trained with a chain rule as a locally conditional model.
- a LM trained with a chain rule as a locally conditional model may be limited to the local conditional likelihood of generating the current word given the word context, thus making local decisions at each word, rather than exploiting whole sentence structures when computing a probability as performed by whole sentence RNN LM 110 .
- an LM run on a neural network or other type of architecture may be limited to computing probabilities for a set length of words selected when training the LM, in contrast to an RNN LSTM architecture 130 , which has a long memory and can compute the probability of a whole sentence of an arbitrary length.
- the addition of LSTM elements in the RNN within RNN LSTM architecture 130 increases the amount of time that data can remain in memory over arbitrary time intervals, increasing the ability of whole sentence RNN LM 110 to classify, process, and predict sequential series as a whole and to minimize the exploding and vanishing gradient problem that may be present when training a standard RNN.
- an RNN LSTM architecture has less relative sensitivity to gap length in comparison to a standard RNN, feedforward neural network or n-gram model.
- RNN LSTM architecture 130 may be implemented in one or more configurations including one or more layers and including unidirectional and bidirectional layers. While the present invention is described with reference to whole sentence RNN LM 110 implemented in RNN LSTM architecture 130 , in additional or alternate embodiments, whole sentence RNN LM 110 may also be implemented in additional or alternate neural network architectures, such as a conventional recurrent neural network or conventional neural network, may be implement. In addition, RNN LSTM architecture 130 may implement additional standard RNN and NN layers. In one example, an NN layer may represent a feedforward NN in which each layer feeds into the next layer in a chain connecting the inputs to the outputs.
- a feedforward NN at each iteration t, values of the inputs nodes are set and then the inputs are fed forward at each layer in a network, which overwrites previous activations.
- a standard RNN more efficiently manages inputs that may exhibit a sequential relationship, such as predicting the next word in a sentence.
- a hidden layer receives inputs from both the current inputs and from the same hidden layer at a previous time step.
- RNN LSTM architecture 130 further extends a standard RNN architecture by adding LSTM elements that increase the amount of time data can be held in memory over arbitrary periods of time.
- a training controller 132 may control training of whole sentence RNN LM 110 by applying noise contrastive estimation (NCE) 134 to training data.
- NCE 134 may represent a sampling-based approach for unnormalized training of statistical models.
- NCE 234 rather than maximize the likelihood of the training data, NCE 134 generates a number of noise samples for each training sample and implicitly constrains the normalization term to be “1”.
- Training controller 132 trains the parameters of whole sentence RNN LM 110 to maximize the likelihood of a binary prediction task that identifies the ground truth from the noise samples.
- NCE 134 may perform a nonlinear logistic regression to discriminate between the observed training data and the artificially-generated noise data.
- a density estimate of whole sentence RNN LM 110 may be denoted by p m (., ⁇ ).
- the NCE 134 loss may be defined as:
- the model p m may learn the probability density of X in the limit.
- NCE 134 may implicitly constrain the variance of the normalization term to be very small during training, which may make it feasible to use unnormalized probabilities during testing. With a sufficient number of noise samples, the solution to a binary prediction model of whole sentence RNN LM 110 converges to the maximum likelihood estimate on the training data.
- the results of whole sentence RNN LM 110 applied by NLP system 100 to perform processing tasks to output processed sequence of words 116 may be presented on a range of tasks from sequence identification tasks, such as palindrome detection, to large vocabulary automatic speech recognition (LVCSRT) and conversational interaction (CI).
- sequence identification tasks such as palindrome detection
- LVCSRT large vocabulary automatic speech recognition
- CI conversational interaction
- FIG. 2 illustrates a block diagram of a whole sentence RNN LM for natural language processing in comparison with locally-conditional models and non-RNN architecture models for whole sentence processing.
- NLP systems such as NLP system 100 may access one or more types of models for predicting a probability over a sequence of words, with different error rates.
- the error rate may indicate the error rate of a task performed by the NLP system, impacted by the probability predicted by the language model implemented by the NLP system.
- an NLP system implementing a whole sentence RNN LM 110 has a lowest error rate in comparison with an error rate of NLP systems implementing a whole sentence maximum entropy model 224 run in a non-RNN architecture 220 or locally conditioned models 210 .
- whole sentence RNN LM 110 represents a whole sentence recurrent language model that is not constrained by locally-conditional constraints.
- locally conditional models 210 may represent one or more types of models that are trained based on a chain rule or other locally-conditional constraints.
- a locally-conditional constraint may represent a training criteria that generates a local conditional likelihood of generating a current word given the word context, thus locally computing conditional probabilities for each word, rather than modeling the probability of a whole sentence or utterance.
- a locally-conditional design effectively limits the ability of the LM to exploit whole sentence structures and increases the error rate percentage of tasks performed by NLP systems based on the probabilities predicted by locally conditioned models 210 .
- whole sentence RNN LM 110 receives word sequence 122 and assigns a probability for entire word sequence 114 , for a whole sentence within word sequence 122 , to directly model the probability of a whole sentence or utterance and decrease the error rate percentage of tasks performed based on probabilities predicted by whole sentence RNN LM 110 .
- locally conditional models 210 may include n-gram LM 212 and standard RNN LM 214 .
- n-gram may refer to a contiguous sequence of n items from a given sample of text or speech and n-gram LM 212 may represent a probabilistic language model for predicting the next item in a sequence in the form of a n ⁇ 1 order Markov model.
- standard RNN LM 214 may represent a language model implemented on a standard RNN.
- N-gram LM 212 and standard RNN LM 214 may represent language models that are constrained by locally-conditional constraints.
- n-gram LM 212 and standard RNN LM 214 may represent statistical language models that are conditional models constrained by local-conditioned constraints by estimating the probability of a word given a previous word sequence.
- the probability of a sentence s of T words w 1 , w 2 , . . . , w T may be calculated as the product of word probabilities by using a chain rule,
- n-gram LM 212 may estimate the conditional probability of the next word given the history using counts computed from the training data, but the history of word w t may be truncated to the previous n ⁇ 1 words, which may be less than five words.
- standard RNN LM 214 may exploit word dependencies over a longer context window than what is feasible with an n-gram language model, standard RNN LM 214 is still trained with the locally-conditional design of the chain rule at each word, which limits the ability of standard RNN LM 214 to exploit the whole sentence structure.
- standard RNN LM 214 may also refer to a feed forward neural network LM that is cloned across time with the hidden state at time step (t ⁇ 1) concatenated with the embedding of the word w t to form the input that predicts the next word w t+1 .
- a feed-forward neural network LM may embed the word history into a continuous space and use the neural network to estimate the conditional probability, such that the conditional likelihood of w t+1 is influenced by the hidden states at all previous time steps 1, . . . , t.
- standard RNN LM 214 may have the capability to capture a longer context than n-gram LM 212
- the history may be truncated to the previous 15-20 words in order to speed up training and decoding and global sentence information may be difficult to capture without triggering exploding or vanishing gradient problems.
- the locally-conditional design of standard RNN LM 214 may make implicit interdependence assumptions that may not always be true, increasing the rate of errors.
- a whole sentence maximum entropy model 224 may directly model the probability of a sentence or utterance, but not within an RNN architecture.
- whole sentence maximum entropy model 224 may function independent of locally conditional models 210 , in a non-RNN architecture 220 , with flexibility of having custom sentence-level features, such as length of sentence, which are hard to model via locally conditional models 210 .
- an NLP system implementing whole sentence maximum entropy model 224 for a task may provide processed sequences of words at an error rate that is lower than locally conditional models 210 , however, the average error rate achieved by the NSP system implementing non-RNN model 220 may still be greater than the average error rate of an NLP system implementing whole sentence RNN LM 110 operating within an RNN LSTM architecture 130 .
- whole sentence RNN LM 110 may be trained to predict the probability of a sentence p(s) directly, without computing conditional probabilities for each word in the sentence independently as performed by locally conditional models 210 .
- whole sentence RNN LM 110 may represent an instance of whole sentence maximum entropy model 224 or another whole sentence model, extended for application in RNN LSTM architecture 130 , to create a whole sentence neural network language model.
- extending whole sentence maximum entropy model 224 to efficiently and effectively function in RNN LSTM architecture 130 may including specifying training controller 132 to train whole sentence maximum entropy model 224 to function in RNN LSTM architecture 130 , applying NCE 134 for generating additional training samples.
- training controller 132 may apply additional or alternate types of training to whole sentence RNN LM 110 .
- training controller 132 may apply one or more types of softmax computations and other types of computations for training one or more models applied by natural language processing system 100 .
- whole sentence RNN LM 110 may aim to assign a probability to each whole sentence, with higher scores assigned to sentences that are more likely to occur in a domain of interest.
- whole sentence RNN LM 110 may also integrate sentence-level convolutional neural network models that rely on classifying a sentence with a class label for one of N given categories, a convolutional neural network model may still only provide a conditional model for performing classification tasks based on class labels, with the limitations of locally conditional models 210 , and a class label assignment may not accurately predict the likelihood of a sentence being correct.
- FIG. 3 illustrates a block diagram of one example of components of noise contrastive estimation applied by a training controller to generate incorrect sentences to use with correct sentences to train a whole sentence RNN LM.
- NCE 134 may implement one or more types of noise samplers 310 for sampling training data 330 .
- NCE 134 is specified for training whole sentence RNN LM 110 by sampling entire sentences from a training data 330 , as opposed to only sampling word samples for speeding up other types of computations, such as softmax computations.
- training data 330 may include one or more corpus of types of data for training whole sentence RNN LM 110 to generate an un-normalized probability for an entire sentence.
- training data 330 may include a corpus of data including one or more of palindrome (PAL) 350 , lexicographically-ordered words (SORT) 352 , and expressing dates (DATE) 354 .
- palindrome 350 may include a 1-million word corpus with a 10-word vocabulary of sequences which read the same backward and forward, including examples such as “the cat ran fast ran cat the”.
- lexicographically-ordered words 352 may include a 1-million word corpus with a 15-word vocabulary of sequences of words in alphabetical order, including examples such as “bottle cup haha hello kitten that what”.
- expressing dates 354 may include a 7-million word corpus with a 70-word vocabulary of words expressing dates, including examples such as “January first nineteen oh one”.
- NCE 134 may generate a sufficient number of samples for unnormalized training of whole sentence RNN LM 110 , where whole sentence RNN LM 110 may learn the data distribution with a normalization term implicitly constrained to 1.
- noise samplers 310 may include one or more back-off n-gram LMs built on training data 330 as noise samplers. In additional or alternate examples, noise samplers 310 may include additional or alternate types of LMs implemented for noise sampling.
- noise samplers 310 may generate or more types of noise samples from training data 330 , such as, but not limited to, noise sampler model sequences 312 and edit transducer samples 314 .
- noise sampler 310 may generate noise samples from training data 330 using a single type of sampler or multiple types of samplers.
- each of the noise samples generated by noise samplers 310 may represent an incorrect sentence for use by training controller 132 with correct sentences in training data 330 to train whole sentence RNN LM 110 .
- noise sampler model sequences 312 may represent word sequences using a noise sampler model such as an n-gram LM 212 or standard RNN LM 214 , by first randomly selecting one sentence from training data 330 , such as the reference sentence illustrated at reference numeral 332 , and then randomly selecting N positions to introduce a substitution (SUB), an insertion (INS), or deletion (DEL) error.
- a substitution such as an n-gram LM 212 or standard RNN LM 214
- INS insertion
- DEL deletion
- the SUB sampled sentence of “July twenty twentieth nineteen seventy nine” illustrated at reference numeral 340 includes a substitution of “twenty” for “the” from the reference sentence illustrated at reference numeral 332 .
- the INS sampled sentence of “July the twentieth nineteen ninety seventy nine” illustrated at reference numeral 342 includes an insertion of “ninety” between “nineteen” and “seventy” from the reference sentence illustrated at reference numeral 332 .
- the DEL sampled sentence of “July the twentieth * seventy nine” illustrated at reference numeral 344 includes a deletion of “nineteen” from the reference sentence illustrated at reference numeral 332 .
- edit transducer samples 314 may include word sequences generated from training data 330 using a random (RAND) noise sampler model. For example, from a reference sentence from expressing dates 354 in training data 330 of “July the twentieth nineteen seventy nine” as illustrated at reference numeral 332 , noise samplers 310 may generate noise sampler model sequences 312 of “July the twenty fifth of September two-thousand eighteen” as illustrated at reference numeral 334 . In one example, the RAND noise sampler model may randomly select one sentence from the training data, and then randomly select N positions to introduce an insertion, substitution or deletion error into the sentence.
- RAND random
- the probability of a word to be inserted or substituted with is assigned by the noise sampler model based on the n-gram history at the position being considered to ensure that each noisy sentence, with errors, has an edit distance of at most N words from the original sentence.
- a separate noise score may be assigned to each sentence in edit transducer samples 314 by noise samplers 310 , where the noise score is the sum of all n-gram scores in the sentence.
- sampling from noise sampler model sequences 312 may limit the length of sentences, based on the length of sentence handled by the noise sampler model. For example, n-gram LM 212 based noise sampler model sequences may be limited to shorter sentences. For the types of errors that may be encountered in speech recognition tasks, however, the additional length provided by edit transducer samples 314 may allow for covering a larger noise space and avoid reducing generalization over the types of errors that may be encountered in speech recognition tasks.
- FIG. 4 illustrates a block diagram of a training sequence for training a whole sentence RNN LM using correct sentences in training data and incorrect sentences generated from the training data through noise contrastive estimation.
- training data used to train whole sentence RNN LM 110 may include a correct sentence 412 , from training data 330 , and at least one incorrect sentence 414 , generated by noise samplers 310 from training data 330 .
- training controller 132 may feed forward pass both correct sentence 412 and incorrect sentence 414 to RNN 416 to train whole sentence RNN LM 110 .
- RNN 416 receives inputs w 1 , w 2 , . . . , w T , for a correct sentence 412 and inputs v 1 , v 2 , . . . , v T for an incorrect sentence 414 .
- noise samplers 310 may generate N incorrect sentences based on correct sentence 412 and feed forward pass each of the N incorrect sentences.
- RNN 416 may represent one or more layers implemented within RNN LSTM architecture 130 .
- RNN 416 may sequentially update layers based on the inputs, learning correct sentences from inputs w 1 , w 2 , . . . , w T for correct sentence 412 as distinguished from inputs v 1 , v 2 , . . . , v T for incorrect sentence 414 , to train whole sentence RNN LM 110 to classify correct sentence 412 from incorrect sentence 414 , with outputs from a hidden layer for the entire sentence illustrated by h 1 , h 2 , . . . , h T 418 .
- An NN scorer 420 receives h 1 , h 2 , . . .
- ANN 424 receives S and determines an output of “1” if the input is a probability indicating the entire sentence is correct and an output of “0” if the input is a probability indicating the entire sentence is not correct.
- training controller 132 may pass a next correct training sentence 412 and next incorrect sentence 414 through whole sentence RNN LM 110 and NN 424 for each selection of training sentences selected to train whole sentence RNN LM 110 .
- FIG. 5 illustrates a block diagram of a testing sequence for testing a whole sentence RNN language model using entire sentences.
- a tester may input a word sequence 112 into whole sentence RNN LM 110 , as illustrated by inputs w 1 , w 2 , . . . , w T 512 .
- RNN 416 receives the inputs for an entire sentence of w 1 , w 2 , . . . , w T 512 which results in output from a hidden layer for the entire sentence illustrated by h 1 , h 2 , . . . , h T 418 .
- NN scorer 420 receives h 1 , h 2 , . . .
- h T 518 as inputs and scores a single value s 522 for the entire sentence, where s is an unnormalized probability of the entire sentence, based on the training of whole sentence RNN LM 110 for correct sentence 412 .
- single value s 522 may be further evaluated to determine whether the probability of the entire sentence matches an expected result.
- FIG. 6 illustrates a block diagram of one example of a performance evaluation of the accuracy of sequence identification tasks performed in an NLP system implementing a whole sentence RNN LM.
- a percentage of the generated data in a training set such as 10% of the generated data in a corpus of expressing dates 354 , may be applied as a test set 602 .
- a training set sentence may include “July the twentieth nineteen eighty” as illustrated at reference numeral 606 .
- an imposter sentence 604 is generated for each training set sentence by substituting one word, such as applied by the sub task in noise sampler model sequences 312 .
- an imposter sentence may include “July first twentieth nineteen eighty”, where the word “the” from the training set sentence has been substituted with the word “first”, as illustrated at reference numeral 608 .
- whole sentence RNN LM 110 may determine scores for each of the sentences. For example, whole sentence RNN LM 110 may assign a score 612 of “0.085” to the training set sentence illustrated at reference numeral 606 and a score 614 of “0.01” to the imposter sentence illustrated at reference numeral 608 .
- a binary linear classifier 620 may be trained to classify the scores output by whole sentence RNN LM 110 into two classes.
- binary linear classifier 620 may be trained to classify scores by using a linear boundary 626 to distinguish the linear space between a first class 622 , which represents an incorrect sentence, and a second class 624 , which represents a correct sentence.
- the performance of an NLP system in performing sequential classification tasks may be evaluated by the classification accuracy assessed by binary linear classifier 620 of classifying imposter sentences in first class 622 and classifying test data sentences in second class 624 .
- FIG. 7 illustrates a block diagram of one example of a one layer bidirectional LSTM (BiLSTM) configuration of a whole sentence RNN language model.
- BiLSTM bidirectional LSTM
- an LSTM layer 730 may be loaded once from beginning to end and once from end to beginning, which may increase the speed at which BiLSTM learns a sequential task in comparison with a one directional LSTM.
- BiLSTM 700 may receive each of inputs w 1 , w 2 , . . . , w T 710 at an embedding layer 720 , with an embedding node for each word w.
- each word is loaded through the embedding layer to two LSTM within LSTM layer 730 , one at the beginning of a loop and one at the end of a loop.
- the first and last LSTM outputs from LSTM layer 730 may feed forward outputs to a concatenation layer 740 .
- concatenation layer 740 may represent a layer of NN scorer 420 .
- Concatenation layer 740 may concatenate the outputs, providing double the number of outputs to a next fully connected (FC) 742 .
- FC 742 obtains the final score of the sentence.
- BiLSTM 700 may include additional or alternate sizes of embedding layer 720 and LSTM layer 730 , such as include an embedding size of two hundred in embedding layer 720 , with seven hundred hidden LSTM units in LSTM layer 730 .
- concatenation layer 740 is illustrated receiving the first and last LSTM outputs from LSTM layer 730 and concatenating the outputs, in additional or alternate examples, concatenation layer 740 may receive additional LSTM outputs and in additional or alternate examples, concatenation layer 740 may be replaced by an alternative NN scoring layer that applies one or more scoring functions to multiple outputs from LSTM layer 730 .
- FIG. 8 illustrates a block diagram of an example of the classification accuracy of an n-gram LM compared with a whole sentence RNN LM implemented in NLP systems for performing sequence identification tasks.
- a table 806 illustrates the sequence identification task classification error rates for a test set 804 , which may be determined by binary linear classifier 620 in FIG. 6 .
- the test set 804 may include the corpus of training data 330 including one or more of palindrome (PAL) 350 , lexicographically-ordered words (SORT) 352 , and expressing dates (DATE) 354 .
- PAL palindrome
- SORT lexicographically-ordered words
- DATE dates
- a percentage of test set 804 may be selected and imposter sentences generated for each of the selected sentences from date test set 804 using each of the noise sampler types, including a sub task, an ins task, a del task, and a rand task, as applied in FIG. 6 .
- table 806 illustrates the classification error rates for n-gram LM 212 , set to a 4 word length, and whole sentence RNN LM 110 , as trained on BiLSTM 700 with an embedding size of 200 and 700 hidden units, trained with training data 330 using stochastic gradient descent and the NCE loss function with a mini-batch size of 512.
- a set of 20 noise samples were generated by NCE 134 per data point.
- the learning rate may be adjusted using an annealing strategy, where the learning rate may be halved if the heldout loss was worse than a previous iteration.
- the classification accuracy of whole sentence RNN LM 110 for sequence identification tasks for imposter sentences generated by the sub task, the ins task, and the del task on average is above 99%.
- the classification accuracy for n-gram LM 212 for sequence identification tasks for imposter sentences is below 99% accuracy.
- the accuracy of each model is evaluated on each model's ability to classify the true sentences from the imposter sentences.
- the difference in classification accuracy between whole sentence RNN LM 110 and n-gram LM 212 may be because whole sentence RNN LM 110 does not need to make conditional independence assumptions that are inherent in locally-conditional models like n-gram LM 212 .
- FIG. 9 illustrates a block diagram of one example of a one-layer unidirectional LSTM configuration of a whole sentence RNN language model.
- a one-layer unidirectional LSTM 900 an LSTM layer 930 is loaded from left to right.
- unidirectional LSTM 900 may receive each of inputs w 1 , w 2 , . . . , w T 910 at an embedding layer 920 , with an embedding node for each word w.
- each word is loaded through the embedding layer to an LSTM within LSTM layer 930 , and LSTM loads words to a next LSTM within LSTM layer 930 .
- each LSTM may feed forward outputs to a mean pooling layer 940 .
- mean pooling layer 940 may represent a layer of NN scorer 420 .
- Mean pooling layer 940 may pool the outputs over hidden states at each time step into a mean value passed to a next layer FC 942 , which obtains the final score of the sentence.
- unidirectional LSTM 900 may include additional or alternate sizes of embedding layer 920 and LSTM layer 930 .
- mean pooling layer 940 is illustrated receiving all the LSTM outputs from LSTM layer 930 and taking a mean function of the outputs, in additional or alternate examples, mean pooling layer 940 may receive only a selection of LSTM outputs and in additional or alternate examples, mean pooling layer 940 may be replaced by an alternative NN scoring layer that applies one or more scoring functions to multiple outputs from LSTM layer 930 .
- FIG. 10 illustrates a block diagram of one example of the word error rate of an n-gram LM compared with a whole sentence RNN LM implemented by a NLP system for speech recognition tasks, applied on a unidirectional LSTM.
- a test set may include a Hub5 Switchboard-2000 benchmark task (SWB) and an in-house conversation interaction task (CI).
- SWB Hub5 Switchboard-2000 benchmark task
- CI in-house conversation interaction task
- each test set may represent a set of data with a duration of 1.5 hours, consisting of accented data covering spoken interaction in concierge and other similar application domains.
- the evaluation may be performed using the best scoring paths for 100 N-best lists.
- whole sentence RNN LM 110 may be trained for the SWB test set on unidirectional LSTM 900 including a projection layer of 512 embedding nodes in embedding layer 920 and 512 hidden layer elements in LSTM layer 930 .
- whole sentence RNN LM 110 may be trained for the CI test set on unidirectional LSTM 900 including a projection layer of 256 embedding nodes in embedding layer 920 and 256 hidden layer elements in LSTM layer 930 .
- an error rate for performing speech recognition on an NLP system implementing whole sentence RNN LM 110 trained on unidirectional LSTM 900 for a SWB test is 6.3%, which is lower than the error rate of 6.9% if N-gram LM 212 is implemented as the LM.
- an error rate for performing speech recognition on an NLP system implementing whole sentence RNN LM 110 trained on unidirectional LSTM 900 for a CI test is 8.3%, which is lower than the error rate of 8.5% if N-gram LM 212 is implemented as the LM.
- whole sentence RN LM 110 is able to capture sufficient long-term context and correct more errors to improve the downstream performance of natural language processing applications.
- n-gram LM 212 may allow multiple errors in the output “actually we were looking at the Saturday I sell to” and implementing whole sentence RNN LM 110 may allow a single error in the output “actually we were looking at the Saturn S L too”, where the n-gram LM predicted output includes a higher error rate % than the whole sentence RNN LM predicted output.
- a speech recognition system implementing n-gram LM 212 may allow errors in the output “could you send some sort of to room three four five” and implementing whole sentence RNN LM 110 may correctly output “could you send some soda to room three four five”.
- FIG. 11 illustrates a block diagram of one example of a computer system in which one embodiment of the invention may be implemented.
- the present invention may be performed in a variety of systems and combinations of systems, made up of functional components, such as the functional components described with reference to a computer system 1100 and may be communicatively connected to a network, such as network 502 .
- Computer system 1100 includes a bus 1122 or other communication device for communicating information within computer system 1100 , and at least one hardware processing device, such as processor 1112 , coupled to bus 1122 for processing information.
- Bus 1122 preferably includes low-latency and higher latency paths that are connected by bridges and adapters and controlled within computer system 1100 by multiple bus controllers.
- computer system 1100 may include multiple processors designed to improve network servicing power.
- Processor 1112 may be at least one general-purpose processor that, during normal operation, processes data under the control of software 1150 , which may include at least one of application software, an operating system, middleware, and other code and computer executable programs accessible from a dynamic storage device such as random access memory (RAM) 1114 , a static storage device such as Read Only Memory (ROM) 1116 , a data storage device, such as mass storage device 1118 , or other data storage medium.
- Software 1150 may include, but is not limited to, code, applications, protocols, interfaces, and processes for controlling one or more systems within a network including, but not limited to, an adapter, a switch, a server, a cluster system, and a grid environment.
- Computer system 1100 may communicate with a remote computer, such as server 1140 , or a remote client.
- server 1140 may be connected to computer system 1100 through any type of network, such as network 1102 , through a communication interface, such as network interface 532 , or over a network link that may be connected, for example, to network 1102 .
- Network 1102 may include permanent connections such as wire or fiber optics cables and temporary connections made through telephone connections and wireless transmission connections, for example, and may include routers, switches, gateways and other hardware to enable a communication channel between the systems connected via network 1102 .
- Network 1102 may represent one or more of packet-switching based networks, telephony based networks, broadcast television networks, local area and wire area networks, public networks, and restricted networks.
- Network 1102 and the systems communicatively connected to computer 1100 via network 1102 may implement one or more layers of one or more types of network protocol stacks which may include one or more of a physical layer, a link layer, a network layer, a transport layer, a presentation layer, and an application layer.
- network 1102 may implement one or more of the Transmission Control Protocol/Internet Protocol (TCP/IP) protocol stack or an Open Systems Interconnection (OSI) protocol stack.
- TCP/IP Transmission Control Protocol/Internet Protocol
- OSI Open Systems Interconnection
- network 1102 may represent the worldwide collection of networks and gateways that use the TCP/IP suite of protocols to communicate with one another.
- Network 1102 may implement a secure HTTP protocol layer or other security protocol for securing communications between systems.
- network interface 1132 includes an adapter 1134 for connecting computer system 1100 to network 1102 through a link and for communicatively connecting computer system 1100 to server 1140 or other computing systems via network 1102 .
- network interface 1132 may include additional software, such as device drivers, additional hardware and other controllers that enable communication.
- computer system 1100 may include multiple communication interfaces accessible via multiple peripheral component interconnect (PCI) bus bridges connected to an input/output controller, for example. In this manner, computer system 1100 allows connections to multiple clients via multiple separate ports and each port may also support multiple connections to multiple clients.
- PCI peripheral component interconnect
- processor 1112 may control the operations of flowchart of FIGS. 12-13 and other operations described herein. Operations performed by processor 1112 may be requested by software 1150 or other code or the steps of one embodiment of the invention might be performed by specific hardware components that contain hardwired logic for performing the steps, or by any combination of programmed computer components and custom hardware components. In one embodiment, one or more components of computer system 1100 , or other components, which may be integrated into one or more components of computer system 1100 , may contain hardwired logic for performing the operations of flowcharts in FIGS. 12-13 .
- computer system 1100 may include multiple peripheral components that facilitate input and output. These peripheral components are connected to multiple controllers, adapters, and expansion slots, such as input/output (I/O) interface 1126 , coupled to one of the multiple levels of bus 1122 .
- input device 1124 may include, for example, a microphone, a video capture device, an image scanning system, a keyboard, a mouse, or other input peripheral device, communicatively enabled on bus 1122 via I/O interface 1126 controlling inputs.
- output device 1120 communicatively enabled on bus 1122 via I/O interface 1126 for controlling outputs may include, for example, one or more graphical display devices, audio speakers, and tactile detectable output interfaces, but may also include other output interfaces.
- additional or alternate input and output peripheral components may be added.
- the present invention may be a system, a method, and/or a computer program product.
- the computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
- the computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device.
- the computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
- a non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing.
- RAM random access memory
- ROM read-only memory
- EPROM or Flash memory erasable programmable read-only memory
- SRAM static random access memory
- CD-ROM compact disc read-only memory
- DVD digital versatile disk
- memory stick a floppy disk
- a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon
- a computer readable storage medium is not to be construed as being transitory signals per se, such as radio waves or other freely, propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
- Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
- the network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers.
- a network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
- Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
- the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
- the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
- electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
- These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
- These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
- the computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
- each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s).
- the functions noted in the block may occur out of the order noted in the figures.
- two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
- FIG. 12 illustrates a high level logic flowchart of a process and computer program for training a whole sentence RNN LM on an RNN LSTM architecture.
- Block 1202 illustrates selecting one correct sentence from training data.
- block 1204 illustrates creating N incorrect sentences by applying noise samplers.
- block 1206 illustrates applying a feed forward pass for each of the N+1 sentences through the RNN layer, to a NN scorer for generating a single value for each entire sentence, and an additional NN layer for identifying if the single value probability score is correct or not correct.
- block 1208 illustrates training the model to classify the correct sentence from others, and the process ends.
- FIG. 13 illustrates a high level logic flowchart of a process and computer program product for testing an NLP system function implementing a whole sentence RNN LM on an RNN LSTM architecture.
- Block 1302 illustrates selecting a test set from 10% of the generated data.
- block 1304 illustrates generating imposter sentences by substituting one word in the selected test set sentences.
- block 1306 illustrates assigning scores for the test set sentence and the imposter by running each sentence through the model.
- block 1308 illustrates evaluating performance by the classification accuracy of the scores as determined by a trained binary linear classifier.
- each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s).
- the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, occur substantially concurrently, or the blocks may sometimes occur in the reverse order, depending upon the functionality involved.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
Abstract
Description
- This invention relates in general to computing systems and more particularly to implementing a whole sentence a recurrent neural network language model for natural language processing.
- A recurrent neural network (RNN) is a class of neural networks that includes weighted connections within a layer, in comparison to a traditional feed-forward network, where connections feed only to subsequent layers. RNNs can also include loops, which enables an RNN to store information while processing new inputs, facilitating use of RNNs for processing tasks where prior inputs need to be considered, such as time series data implemented for speech recognition and natural language processing (NLP) tasks.
- In one embodiment, a method is directed to providing, by a computer system, a whole sentence recurrent neural network language model for estimating a probability of likelihood of each whole sentence processed by natural language processing being correct. The method is directed to applying, by the computer system, a noise contrastive estimation sampler against at least one entire sentence from a corpus of multiple sentences to generate at least one incorrect sentence. The method is directed to training, by the computer system, the whole sentence recurrent neural network language model, using the at least one entire sentence from the corpus and the at least one incorrect sentence, to distinguish the at least one entire sentence as correct. The method is directed to applying, by the computer system, the whole sentence recurrent neural network language model to estimate the probability of likelihood of each whole sentence processed by natural language processing being correct.
- In another embodiment, a computer system comprises one or more processors, one or more computer-readable memories, one or more computer-readable storage devices, and program instructions, stored on at least one of the one or more storage devices for execution by at least one of the one or more processors via at least one of the one or more memories. The stored program instructions comprise program instructions to provide a whole sentence recurrent neural network language model for estimating a probability of likelihood of each whole sentence processed by natural language processing being correct. The stored program instructions comprise program instructions to apply a noise contrastive estimation sampler against at least one entire sentence from a corpus of a plurality of sentences to generate at least one incorrect sentence. The stored program instructions comprise program instructions to train the whole sentence recurrent neural network language model, using the at least one entire sentence from the corpus and the at least one incorrect sentence, to distinguish the at least one entire sentence as correct. The stored program instructions comprise program instructions to apply the whole sentence recurrent neural network language model to estimate the probability of likelihood of each whole sentence processed by natural language processing being correct.
- In another embodiment, a computer program product comprises a computer readable storage medium having program instructions embodied therewith, wherein the computer readable storage medium is not a transitory signal per se. The computer program product comprising the program instructions executable by a computer to cause the computer to provide, by a computer, a whole sentence recurrent neural network language model for estimating a probability of likelihood of each whole sentence processed by natural language processing being correct. The computer program product comprising the program instructions executable by a computer to cause the computer to apply, by the computer, a noise contrastive estimation sampler against at least one entire sentence from a corpus of a plurality of sentences to generate at least one incorrect sentence. The computer program product comprising the program instructions executable by a computer to cause the computer to train, by the computer, the whole sentence recurrent neural network language model, using the at least one entire sentence from the corpus and the at least one incorrect sentence, to distinguish the at least one entire sentence as correct. The computer program product comprising the program instructions executable by a computer to cause the computer to apply, by the computer, the whole sentence recurrent neural network language model to estimate the probability of likelihood of each whole sentence processed by natural language processing being correct.
- The novel features believed characteristic of one or more embodiments of the invention are set forth in the appended claims. The one or more embodiments of the invention itself however, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:
-
FIG. 1 is a block diagram illustrating one example of a system for utilizing a whole sentence RNN language model for improving the accuracy of natural language processing; -
FIG. 2 is a block diagram illustrating a whole sentence RNN LM for natural language processing in comparison with locally-conditional models and non-RNN architecture models for whole sentence processing; -
FIG. 3 is a block diagram illustrating one example of components of noise contrastive estimation applied by a training controller to generate incorrect sentences to use with correct sentences to train a whole sentence RNN LM; -
FIG. 4 is a block diagram illustrating a training sequence for training a whole sentence RNN LM using correct sentences in training data and incorrect sentences generated from the training data through noise contrastive estimation; -
FIG. 5 is a block diagram illustrating a testing sequence for testing a whole sentence RNN language model using entire sentences; -
FIG. 6 is a block diagram illustrating one example of a performance evaluation of the accuracy of sequence identification tasks performed in an NLP system implementing a whole sentence RNN LM; -
FIG. 7 is a block diagram illustrating one example of a one layer bidirectional LSTM (BiLSTM) configuration of a whole sentence RNN language model; -
FIG. 8 is a block diagram illustrating an example of the classification accuracy of an n-gram LM compared with a whole sentence RNN LM implemented in NLP systems for performing sequence identification tasks; -
FIG. 9 is a block diagram illustrating one example of a one layer unidirectional LSTM configuration of a whole sentence RNN language model; -
FIG. 10 is a block diagram illustrating an example of the word error rate of an n-gram LM compared with a whole sentence RNN LM implemented by a NLP system for speech recognition tasks, applied on a unidirectional LSTM; -
FIG. 11 is a block diagram illustrating one example of a computer system in which one embodiment of the invention may be implemented; -
FIG. 12 illustrates a high level logic flowchart of a process and computer program for training a whole sentence RNN LM on an RNN LSTM architecture; and -
FIG. 13 illustrates a high level logic flowchart of a process and computer program product for testing an NLP system function implementing a whole sentence RNN LM on an RNN LSTM architecture. - In the following description, for the purposes of explanation, numerous specific details are set forth to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form to avoid unnecessarily obscuring the present invention.
- In addition, in the following description, for purposes of explanation, numerous systems are described. It is important to note, and it will be apparent to one skilled in the art, that the present invention may execute in a variety of systems, including a variety of computer systems and electronic devices operating any number of different types of operating systems.
-
FIG. 1 illustrates a block diagram of one example of a system for utilizing a whole sentence RNN language model for improving the accuracy of natural language processing. - In one example, a natural language processing (NLP) system 100 may process a sequence of words in
speech 112, as input, and generate one or more types of outputs, such as processed sequence ofwords 116. In one example,speech 112 may represent an entire sentence or utterance with multiple words. In one example, natural language processing system 100 may perform one or more types of language processing including, but not limited to, automatic speech recognition, machine translation, optical character recognition, spell checking, and additional or alternate types of processing of natural language inputs. In one example, automatic speech recognition may include, but is not limited to, conversational interaction, conversational telephony speech transcription, multimedia captioning, and translation. In one example,speech 112 may include, but are not limited to an audio signal with spoken words, an image containing a sequence of words, and a stream of text words. - In one example, to manage processing of
speech 112, NLP system 100 may include aspeech model 120, for translating the audio signal, image, or stream of text into statistical representations of the sounds, images, or text that make up each word in a sequence of words. In one example, the statistical representations of aword sequence 122 may be represented by sentence s of T words w1, w2, . . . , wT, where each w is a statistical representation of a word, phrase, or utterance. For example,speech model 120 may represent an acoustic model that is used to create statistical representations of the audio signal and the phonemes or other linguistic units withinspeech 112. In one example,speech model 120 may be trained from a set of audio recordings and their corresponding transcripts, created by taking audio recordings of speech and their text transcriptions and using software to create statistical representations of the sounds that make up each word. - In one example, in processing
speech 112 into processed sequence ofwords 116, as NLP system 100 tries to match sounds with word sequences, to increase the accuracy of processing words and phrases that sound, look, or translate similarly, NLP system 100 may implement a language model (LM) to generate a probability distribution over a sequence of words, such as a whole sentence. The accuracy at which the LM generates a probability distribution forword sequence 122 impacts the accuracy of NLP system 100 to accurately processspeech 112 into processed sequence ofwords 116. - In one embodiment of the present invention, NLP system 100 may implement a whole sentence RNN language model (LM) 110, which given a sequence of processed words of a whole sentence from
speech 112 ofword sequence 122, assigns a probability to the whole sentence, illustrated as probability forentire word sequence 114. In one example, providing whole sentence RNN LM 110 to estimate the relative likelihood of an entire phrase being correctly processed is useful in many natural language processing applications that may be performed by NLP system 100. For example, in the context of NLP system 100 performing speech recognition, NLP system 100 tries to match the sounds withinspeech 112 with word sequences. In this example, whole sentence RNN LM 110 may provide context to distinguish between words and phrases that sound similar, to assign a probability that the correct sentence has been recognized. - In particular, while
word sequence 122 includes multiple individual words, wholesentence RNN LM 110 directly models the probability for the whole sentence inword sequence 122. In one example, whole sentence RNN LM 110 may be trained to predict the probability of a whole sentence directly, without partially computing conditional probabilities for each classified word in the sentence individually. - To facilitate an efficient and accurate computation of a probability of a whole sentence, whole sentence RNN LM 110 may represent a whole sentence model integrated with an RNN long short-term memory (LSTM)
architecture 130. - The whole sentence model of whole sentence RNN LM 110 is not trained with a chain rule as a locally conditional model. In particular, a LM trained with a chain rule as a locally conditional model may be limited to the local conditional likelihood of generating the current word given the word context, thus making local decisions at each word, rather than exploiting whole sentence structures when computing a probability as performed by whole sentence RNN LM 110.
- In addition, in particular, an LM run on a neural network or other type of architecture may be limited to computing probabilities for a set length of words selected when training the LM, in contrast to an
RNN LSTM architecture 130, which has a long memory and can compute the probability of a whole sentence of an arbitrary length. The addition of LSTM elements in the RNN withinRNN LSTM architecture 130 increases the amount of time that data can remain in memory over arbitrary time intervals, increasing the ability of wholesentence RNN LM 110 to classify, process, and predict sequential series as a whole and to minimize the exploding and vanishing gradient problem that may be present when training a standard RNN. In addition, an RNN LSTM architecture has less relative sensitivity to gap length in comparison to a standard RNN, feedforward neural network or n-gram model. -
RNN LSTM architecture 130 may be implemented in one or more configurations including one or more layers and including unidirectional and bidirectional layers. While the present invention is described with reference to wholesentence RNN LM 110 implemented inRNN LSTM architecture 130, in additional or alternate embodiments, wholesentence RNN LM 110 may also be implemented in additional or alternate neural network architectures, such as a conventional recurrent neural network or conventional neural network, may be implement. In addition,RNN LSTM architecture 130 may implement additional standard RNN and NN layers. In one example, an NN layer may represent a feedforward NN in which each layer feeds into the next layer in a chain connecting the inputs to the outputs. In one example, in a feedforward NN, at each iteration t, values of the inputs nodes are set and then the inputs are fed forward at each layer in a network, which overwrites previous activations. In contrast, a standard RNN more efficiently manages inputs that may exhibit a sequential relationship, such as predicting the next word in a sentence. In a standard RNN architecture, at each time step t, a hidden layer receives inputs from both the current inputs and from the same hidden layer at a previous time step.RNN LSTM architecture 130 further extends a standard RNN architecture by adding LSTM elements that increase the amount of time data can be held in memory over arbitrary periods of time. - In one example, in training whole
sentence RNN LM 110, to avoid a problem of normalizing the whole sentence withinword sequence 122 when computing a probability of a whole sentence, atraining controller 132 may control training of wholesentence RNN LM 110 by applying noise contrastive estimation (NCE) 134 to training data. In one example,NCE 134 may represent a sampling-based approach for unnormalized training of statistical models. In one example, usingNCE 234, rather than maximize the likelihood of the training data,NCE 134 generates a number of noise samples for each training sample and implicitly constrains the normalization term to be “1”.Training controller 132 trains the parameters of wholesentence RNN LM 110 to maximize the likelihood of a binary prediction task that identifies the ground truth from the noise samples. In particular,NCE 134 may perform a nonlinear logistic regression to discriminate between the observed training data and the artificially-generated noise data. - For example, to apply
NCE 134, mathematically, let X=(x1, x2, . . . , xS) be the S sentences in training data. In addition, let Y=(y1, y2, . . . , yvS) with the v*S samples drawn from a noise sampler model with a probability of density of pn(.), where v>1. A density estimate of wholesentence RNN LM 110 may be denoted by pm(., θ). In one example, theNCE 134 loss may be defined as: -
- and G(u; θ) is the log-odds ratio between pm(., θ) and pn(.), i.e. G(u; θ)=ln pm(., θ)−ln pn(n). In one example, by optimizing the loss function of l(θ) with model parameters θ, the model pm may learn the probability density of X in the limit.
- In one example, during training of whole
sentence RNN LM 110 bytraining controller 132 that is based onNCE 134, only the connections associated with a few words in the output layer need to be considered, allowing elimination of the need to compute the normalization over the full output vocabulary.NCE 134 may implicitly constrain the variance of the normalization term to be very small during training, which may make it feasible to use unnormalized probabilities during testing. With a sufficient number of noise samples, the solution to a binary prediction model of wholesentence RNN LM 110 converges to the maximum likelihood estimate on the training data. - The results of whole
sentence RNN LM 110 applied by NLP system 100 to perform processing tasks to output processed sequence ofwords 116 may be presented on a range of tasks from sequence identification tasks, such as palindrome detection, to large vocabulary automatic speech recognition (LVCSRT) and conversational interaction (CI). -
FIG. 2 illustrates a block diagram of a whole sentence RNN LM for natural language processing in comparison with locally-conditional models and non-RNN architecture models for whole sentence processing. - In one example, NLP systems, such as NLP system 100, may access one or more types of models for predicting a probability over a sequence of words, with different error rates. In one example, the error rate may indicate the error rate of a task performed by the NLP system, impacted by the probability predicted by the language model implemented by the NLP system. In one example, an NLP system implementing a whole
sentence RNN LM 110 has a lowest error rate in comparison with an error rate of NLP systems implementing a whole sentencemaximum entropy model 224 run in anon-RNN architecture 220 or locally conditionedmodels 210. - In one example, whole
sentence RNN LM 110 represents a whole sentence recurrent language model that is not constrained by locally-conditional constraints. In contrast, locallyconditional models 210 may represent one or more types of models that are trained based on a chain rule or other locally-conditional constraints. In one example, a locally-conditional constraint may represent a training criteria that generates a local conditional likelihood of generating a current word given the word context, thus locally computing conditional probabilities for each word, rather than modeling the probability of a whole sentence or utterance. A locally-conditional design effectively limits the ability of the LM to exploit whole sentence structures and increases the error rate percentage of tasks performed by NLP systems based on the probabilities predicted by locally conditionedmodels 210. In contrast, wholesentence RNN LM 110 receivesword sequence 122 and assigns a probability forentire word sequence 114, for a whole sentence withinword sequence 122, to directly model the probability of a whole sentence or utterance and decrease the error rate percentage of tasks performed based on probabilities predicted by wholesentence RNN LM 110. - In one example, locally
conditional models 210 may include n-gram LM 212 andstandard RNN LM 214. In one example, n-gram may refer to a contiguous sequence of n items from a given sample of text or speech and n-gram LM 212 may represent a probabilistic language model for predicting the next item in a sequence in the form of a n−1 order Markov model. In one example,standard RNN LM 214 may represent a language model implemented on a standard RNN. N-gram LM 212 andstandard RNN LM 214 may represent language models that are constrained by locally-conditional constraints. In particular, in one example, n-gram LM 212 andstandard RNN LM 214 may represent statistical language models that are conditional models constrained by local-conditioned constraints by estimating the probability of a word given a previous word sequence. For example, the probability of a sentence s of T words w1, w2, . . . , wT may be calculated as the product of word probabilities by using a chain rule, -
- where ht=w1, . . . , wt+1 is the history of word wt. A limitation of locally conditional models trained using a chain rule is that a captured context is dependent on the length of a history, which is often truncated to the previous n−1 words, since long histories are rarely observed in training data for an n-
gram LM 212. For example, n-gram LM 212 may estimate the conditional probability of the next word given the history using counts computed from the training data, but the history of word wt may be truncated to the previous n−1 words, which may be less than five words. Whilestandard RNN LM 214 may exploit word dependencies over a longer context window than what is feasible with an n-gram language model,standard RNN LM 214 is still trained with the locally-conditional design of the chain rule at each word, which limits the ability ofstandard RNN LM 214 to exploit the whole sentence structure. In one example,standard RNN LM 214 may also refer to a feed forward neural network LM that is cloned across time with the hidden state at time step (t−1) concatenated with the embedding of the word wt to form the input that predicts the next word wt+1. In one example, a feed-forward neural network LM may embed the word history into a continuous space and use the neural network to estimate the conditional probability, such that the conditional likelihood of wt+1 is influenced by the hidden states at allprevious time steps 1, . . . , t. Whilestandard RNN LM 214 may have the capability to capture a longer context than n-gram LM 212, in practice whenstandard RNN LM 214 is trained with the local conditional likelihood of generating the current word given the word context, the history may be truncated to the previous 15-20 words in order to speed up training and decoding and global sentence information may be difficult to capture without triggering exploding or vanishing gradient problems. In addition, the locally-conditional design ofstandard RNN LM 214 may make implicit interdependence assumptions that may not always be true, increasing the rate of errors. - In one example, a whole sentence
maximum entropy model 224 may directly model the probability of a sentence or utterance, but not within an RNN architecture. In one example, whole sentencemaximum entropy model 224 may function independent of locallyconditional models 210, in anon-RNN architecture 220, with flexibility of having custom sentence-level features, such as length of sentence, which are hard to model via locallyconditional models 210. In one example, an NLP system implementing whole sentencemaximum entropy model 224 for a task may provide processed sequences of words at an error rate that is lower than locallyconditional models 210, however, the average error rate achieved by the NSP system implementingnon-RNN model 220 may still be greater than the average error rate of an NLP system implementing wholesentence RNN LM 110 operating within anRNN LSTM architecture 130. - In one example, whole
sentence RNN LM 110 may be trained to predict the probability of a sentence p(s) directly, without computing conditional probabilities for each word in the sentence independently as performed by locallyconditional models 210. In one example, wholesentence RNN LM 110 may represent an instance of whole sentencemaximum entropy model 224 or another whole sentence model, extended for application inRNN LSTM architecture 130, to create a whole sentence neural network language model. In one example, extending whole sentencemaximum entropy model 224 to efficiently and effectively function inRNN LSTM architecture 130, may including specifyingtraining controller 132 to train whole sentencemaximum entropy model 224 to function inRNN LSTM architecture 130, applyingNCE 134 for generating additional training samples. - In one example, in additional or alternate examples,
training controller 132 may apply additional or alternate types of training to wholesentence RNN LM 110. In the example, while applying a softmax computation to compute conditional probabilities of entire sentences may be problematic for training wholesentence RNN LM 110 because a calculation of z in a softmax computation may be infeasible because it may involve summing all possible sentences, in additional or alternate embodiments,training controller 132 may apply one or more types of softmax computations and other types of computations for training one or more models applied by natural language processing system 100. - In one example, whole
sentence RNN LM 110, as trained bytraining controller 132, may aim to assign a probability to each whole sentence, with higher scores assigned to sentences that are more likely to occur in a domain of interest. In contrast, while wholesentence RNN LM 110 may also integrate sentence-level convolutional neural network models that rely on classifying a sentence with a class label for one of N given categories, a convolutional neural network model may still only provide a conditional model for performing classification tasks based on class labels, with the limitations of locallyconditional models 210, and a class label assignment may not accurately predict the likelihood of a sentence being correct. -
FIG. 3 illustrates a block diagram of one example of components of noise contrastive estimation applied by a training controller to generate incorrect sentences to use with correct sentences to train a whole sentence RNN LM. - In one example,
NCE 134 may implement one or more types ofnoise samplers 310 for samplingtraining data 330. In one example,NCE 134 is specified for training wholesentence RNN LM 110 by sampling entire sentences from atraining data 330, as opposed to only sampling word samples for speeding up other types of computations, such as softmax computations. - In one example,
training data 330 may include one or more corpus of types of data for training wholesentence RNN LM 110 to generate an un-normalized probability for an entire sentence. In one example,training data 330 may include a corpus of data including one or more of palindrome (PAL) 350, lexicographically-ordered words (SORT) 352, and expressing dates (DATE) 354. In one example,palindrome 350 may include a 1-million word corpus with a 10-word vocabulary of sequences which read the same backward and forward, including examples such as “the cat ran fast ran cat the”. In one example, lexicographically-orderedwords 352 may include a 1-million word corpus with a 15-word vocabulary of sequences of words in alphabetical order, including examples such as “bottle cup haha hello kitten that what”. In one example, expressingdates 354 may include a 7-million word corpus with a 70-word vocabulary of words expressing dates, including examples such as “January first nineteen oh one”. - In one example, based on correct sentences in
sampling training data 330,NCE 134 may generate a sufficient number of samples for unnormalized training of wholesentence RNN LM 110, where wholesentence RNN LM 110 may learn the data distribution with a normalization term implicitly constrained to 1. - In one example,
noise samplers 310 may include one or more back-off n-gram LMs built ontraining data 330 as noise samplers. In additional or alternate examples,noise samplers 310 may include additional or alternate types of LMs implemented for noise sampling. - In one example,
noise samplers 310 may generate or more types of noise samples fromtraining data 330, such as, but not limited to, noisesampler model sequences 312 and edittransducer samples 314. In one example,noise sampler 310 may generate noise samples fromtraining data 330 using a single type of sampler or multiple types of samplers. In one example, each of the noise samples generated bynoise samplers 310 may represent an incorrect sentence for use bytraining controller 132 with correct sentences intraining data 330 to train wholesentence RNN LM 110. - In one example, noise
sampler model sequences 312 may represent word sequences using a noise sampler model such as an n-gram LM 212 orstandard RNN LM 214, by first randomly selecting one sentence fromtraining data 330, such as the reference sentence illustrated atreference numeral 332, and then randomly selecting N positions to introduce a substitution (SUB), an insertion (INS), or deletion (DEL) error. For example, the SUB sampled sentence of “July twenty twentieth nineteen seventy nine” illustrated atreference numeral 340 includes a substitution of “twenty” for “the” from the reference sentence illustrated atreference numeral 332. In addition, for example, the INS sampled sentence of “July the twentieth nineteen ninety seventy nine” illustrated atreference numeral 342 includes an insertion of “ninety” between “nineteen” and “seventy” from the reference sentence illustrated atreference numeral 332. In addition, for example, the DEL sampled sentence of “July the twentieth * seventy nine” illustrated atreference numeral 344 includes a deletion of “nineteen” from the reference sentence illustrated atreference numeral 332. - In one example, edit
transducer samples 314 may include word sequences generated fromtraining data 330 using a random (RAND) noise sampler model. For example, from a reference sentence from expressingdates 354 intraining data 330 of “July the twentieth nineteen seventy nine” as illustrated atreference numeral 332,noise samplers 310 may generate noisesampler model sequences 312 of “July the twenty fifth of September two-thousand eighteen” as illustrated atreference numeral 334. In one example, the RAND noise sampler model may randomly select one sentence from the training data, and then randomly select N positions to introduce an insertion, substitution or deletion error into the sentence. The probability of a word to be inserted or substituted with is assigned by the noise sampler model based on the n-gram history at the position being considered to ensure that each noisy sentence, with errors, has an edit distance of at most N words from the original sentence. In one example, a separate noise score may be assigned to each sentence inedit transducer samples 314 bynoise samplers 310, where the noise score is the sum of all n-gram scores in the sentence. - In the example, sampling from noise
sampler model sequences 312 may limit the length of sentences, based on the length of sentence handled by the noise sampler model. For example, n-gram LM 212 based noise sampler model sequences may be limited to shorter sentences. For the types of errors that may be encountered in speech recognition tasks, however, the additional length provided byedit transducer samples 314 may allow for covering a larger noise space and avoid reducing generalization over the types of errors that may be encountered in speech recognition tasks. -
FIG. 4 illustrates a block diagram of a training sequence for training a whole sentence RNN LM using correct sentences in training data and incorrect sentences generated from the training data through noise contrastive estimation. - In one example, training data used to train whole
sentence RNN LM 110 may include acorrect sentence 412, fromtraining data 330, and at least oneincorrect sentence 414, generated bynoise samplers 310 fromtraining data 330. In one example,training controller 132 may feed forward pass bothcorrect sentence 412 andincorrect sentence 414 toRNN 416 to train wholesentence RNN LM 110. For example,RNN 416 receives inputs w1, w2, . . . , wT, for acorrect sentence 412 and inputs v1, v2, . . . , vT for anincorrect sentence 414. In one example,noise samplers 310 may generate N incorrect sentences based oncorrect sentence 412 and feed forward pass each of the N incorrect sentences. In one example,RNN 416 may represent one or more layers implemented withinRNN LSTM architecture 130. -
RNN 416 may sequentially update layers based on the inputs, learning correct sentences from inputs w1, w2, . . . , wT forcorrect sentence 412 as distinguished from inputs v1, v2, . . . , vT forincorrect sentence 414, to train wholesentence RNN LM 110 to classifycorrect sentence 412 fromincorrect sentence 414, with outputs from a hidden layer for the entire sentence illustrated by h1, h2, . . . ,h T 418. AnNN scorer 420 receives h1, h2, . . . ,h T 418 as inputs and is trained to score a single value s 422 for the entire sentence, where s is an unnormalized probability of the entire sentence.ANN 424 receives S and determines an output of “1” if the input is a probability indicating the entire sentence is correct and an output of “0” if the input is a probability indicating the entire sentence is not correct. - In one example,
training controller 132 may pass a nextcorrect training sentence 412 and nextincorrect sentence 414 through wholesentence RNN LM 110 andNN 424 for each selection of training sentences selected to train wholesentence RNN LM 110. -
FIG. 5 illustrates a block diagram of a testing sequence for testing a whole sentence RNN language model using entire sentences. - In one example, in testing whole
sentence RNN LM 110, a tester may input aword sequence 112 into wholesentence RNN LM 110, as illustrated by inputs w1, w2, . . . ,w T 512. In one example,RNN 416 receives the inputs for an entire sentence of w1, w2, . . . ,w T 512 which results in output from a hidden layer for the entire sentence illustrated by h1, h2, . . . ,h T 418.NN scorer 420 receives h1, h2, . . . ,h T 518 as inputs and scores a single value s 522 for the entire sentence, where s is an unnormalized probability of the entire sentence, based on the training of wholesentence RNN LM 110 forcorrect sentence 412. In the example, depending on the type of testing performed, single value s 522 may be further evaluated to determine whether the probability of the entire sentence matches an expected result. -
FIG. 6 illustrates a block diagram of one example of a performance evaluation of the accuracy of sequence identification tasks performed in an NLP system implementing a whole sentence RNN LM. - In one example, for evaluating the performance of the classification accuracy for sequence identification tasks by an NLP system implementing whole
sentence RNN LM 110, initially, a percentage of the generated data in a training set, such as 10% of the generated data in a corpus of expressingdates 354, may be applied as atest set 602. In one example, a training set sentence may include “July the twentieth nineteen eighty” as illustrated at reference numeral 606. - In one example, for testing,
multiple imposter sentences 604 are generated for each training set sentence by substituting one word, such as applied by the sub task in noisesampler model sequences 312. In one example, an imposter sentence may include “July first twentieth nineteen eighty”, where the word “the” from the training set sentence has been substituted with the word “first”, as illustrated at reference numeral 608. - Next, whole
sentence RNN LM 110 may determine scores for each of the sentences. For example, wholesentence RNN LM 110 may assign ascore 612 of “0.085” to the training set sentence illustrated at reference numeral 606 and ascore 614 of “0.01” to the imposter sentence illustrated at reference numeral 608. - In a next step, a binary
linear classifier 620 may be trained to classify the scores output by wholesentence RNN LM 110 into two classes. For example, binarylinear classifier 620 may be trained to classify scores by using alinear boundary 626 to distinguish the linear space between afirst class 622, which represents an incorrect sentence, and asecond class 624, which represents a correct sentence. The performance of an NLP system in performing sequential classification tasks may be evaluated by the classification accuracy assessed by binarylinear classifier 620 of classifying imposter sentences infirst class 622 and classifying test data sentences insecond class 624. -
FIG. 7 illustrates a block diagram of one example of a one layer bidirectional LSTM (BiLSTM) configuration of a whole sentence RNN language model. - In one example, in a one
layer BiLSTM 700, anLSTM layer 730 may be loaded once from beginning to end and once from end to beginning, which may increase the speed at which BiLSTM learns a sequential task in comparison with a one directional LSTM. For example,BiLSTM 700 may receive each of inputs w1, w2, . . . ,w T 710 at an embeddinglayer 720, with an embedding node for each word w. In one example, each word is loaded through the embedding layer to two LSTM withinLSTM layer 730, one at the beginning of a loop and one at the end of a loop. In one example, the first and last LSTM outputs fromLSTM layer 730 may feed forward outputs to aconcatenation layer 740. In one example,concatenation layer 740 may represent a layer ofNN scorer 420.Concatenation layer 740 may concatenate the outputs, providing double the number of outputs to a next fully connected (FC) 742.FC 742 obtains the final score of the sentence. In one example,BiLSTM 700 may include additional or alternate sizes of embeddinglayer 720 andLSTM layer 730, such as include an embedding size of two hundred in embeddinglayer 720, with seven hundred hidden LSTM units inLSTM layer 730. While in the example,concatenation layer 740 is illustrated receiving the first and last LSTM outputs fromLSTM layer 730 and concatenating the outputs, in additional or alternate examples,concatenation layer 740 may receive additional LSTM outputs and in additional or alternate examples,concatenation layer 740 may be replaced by an alternative NN scoring layer that applies one or more scoring functions to multiple outputs fromLSTM layer 730. -
FIG. 8 illustrates a block diagram of an example of the classification accuracy of an n-gram LM compared with a whole sentence RNN LM implemented in NLP systems for performing sequence identification tasks. - In one example, a table 806 illustrates the sequence identification task classification error rates for a
test set 804, which may be determined by binarylinear classifier 620 inFIG. 6 . In one example, the test set 804 may include the corpus oftraining data 330 including one or more of palindrome (PAL) 350, lexicographically-ordered words (SORT) 352, and expressing dates (DATE) 354. In one example, for testing, a percentage of test set 804 may be selected and imposter sentences generated for each of the selected sentences from date test set 804 using each of the noise sampler types, including a sub task, an ins task, a del task, and a rand task, as applied inFIG. 6 . - In one example, table 806 illustrates the classification error rates for n-
gram LM 212, set to a 4 word length, and wholesentence RNN LM 110, as trained onBiLSTM 700 with an embedding size of 200 and 700 hidden units, trained withtraining data 330 using stochastic gradient descent and the NCE loss function with a mini-batch size of 512. In one example, for each epoch, a set of 20 noise samples were generated byNCE 134 per data point. In one example, during training, the learning rate may be adjusted using an annealing strategy, where the learning rate may be halved if the heldout loss was worse than a previous iteration. - In the example, the classification accuracy of whole
sentence RNN LM 110 for sequence identification tasks for imposter sentences generated by the sub task, the ins task, and the del task on average is above 99%. In comparison, the classification accuracy for n-gram LM 212 for sequence identification tasks for imposter sentences is below 99% accuracy. In the example, the accuracy of each model is evaluated on each model's ability to classify the true sentences from the imposter sentences. In one example, the difference in classification accuracy between wholesentence RNN LM 110 and n-gram LM 212 may be because wholesentence RNN LM 110 does not need to make conditional independence assumptions that are inherent in locally-conditional models like n-gram LM 212. -
FIG. 9 illustrates a block diagram of one example of a one-layer unidirectional LSTM configuration of a whole sentence RNN language model. - In one example, a one-layer
unidirectional LSTM 900, anLSTM layer 930 is loaded from left to right. For example,unidirectional LSTM 900 may receive each of inputs w1, w2, . . . ,w T 910 at an embeddinglayer 920, with an embedding node for each word w. In one example, each word is loaded through the embedding layer to an LSTM withinLSTM layer 930, and LSTM loads words to a next LSTM withinLSTM layer 930. In one example, each LSTM may feed forward outputs to amean pooling layer 940. In one example, mean poolinglayer 940 may represent a layer ofNN scorer 420.Mean pooling layer 940 may pool the outputs over hidden states at each time step into a mean value passed to anext layer FC 942, which obtains the final score of the sentence. In one example,unidirectional LSTM 900 may include additional or alternate sizes of embeddinglayer 920 andLSTM layer 930. While in the example, mean poolinglayer 940 is illustrated receiving all the LSTM outputs fromLSTM layer 930 and taking a mean function of the outputs, in additional or alternate examples,mean pooling layer 940 may receive only a selection of LSTM outputs and in additional or alternate examples,mean pooling layer 940 may be replaced by an alternative NN scoring layer that applies one or more scoring functions to multiple outputs fromLSTM layer 930. -
FIG. 10 illustrates a block diagram of one example of the word error rate of an n-gram LM compared with a whole sentence RNN LM implemented by a NLP system for speech recognition tasks, applied on a unidirectional LSTM. - In one example, as illustrated in a table 1010, for a speech recognition application, a test set may include a Hub5 Switchboard-2000 benchmark task (SWB) and an in-house conversation interaction task (CI). In one example, each test set may represent a set of data with a duration of 1.5 hours, consisting of accented data covering spoken interaction in concierge and other similar application domains. In one example, for the speech recognition application, the evaluation may be performed using the best scoring paths for 100 N-best lists.
- In one example, as illustrated in table 1010, whole
sentence RNN LM 110 may be trained for the SWB test set onunidirectional LSTM 900 including a projection layer of 512 embedding nodes in embeddinglayer LSTM layer 930. In addition, in one example, as illustrated in table 1010, wholesentence RNN LM 110 may be trained for the CI test set onunidirectional LSTM 900 including a projection layer of 256 embedding nodes in embeddinglayer LSTM layer 930. - In one example, an error rate for performing speech recognition on an NLP system implementing whole
sentence RNN LM 110 trained onunidirectional LSTM 900 for a SWB test is 6.3%, which is lower than the error rate of 6.9% if N-gram LM 212 is implemented as the LM. In addition, in one example, an error rate for performing speech recognition on an NLP system implementing wholesentence RNN LM 110 trained onunidirectional LSTM 900 for a CI test is 8.3%, which is lower than the error rate of 8.5% if N-gram LM 212 is implemented as the LM. In the examples, wholesentence RN LM 110 is able to capture sufficient long-term context and correct more errors to improve the downstream performance of natural language processing applications. - For example, for a reference sentence of “actually we were looking at the Saturn S L two”, a speech recognition system implementing n-
gram LM 212 may allow multiple errors in the output “actually we were looking at the Saturday I sell to” and implementing wholesentence RNN LM 110 may allow a single error in the output “actually we were looking at the Saturn S L too”, where the n-gram LM predicted output includes a higher error rate % than the whole sentence RNN LM predicted output. In another example, for a reference sentence of “could you send some soda to room three four five”, a speech recognition system implementing n-gram LM 212 may allow errors in the output “could you send some sort of to room three four five” and implementing wholesentence RNN LM 110 may correctly output “could you send some soda to room three four five”. -
FIG. 11 illustrates a block diagram of one example of a computer system in which one embodiment of the invention may be implemented. The present invention may be performed in a variety of systems and combinations of systems, made up of functional components, such as the functional components described with reference to acomputer system 1100 and may be communicatively connected to a network, such as network 502. -
Computer system 1100 includes abus 1122 or other communication device for communicating information withincomputer system 1100, and at least one hardware processing device, such asprocessor 1112, coupled tobus 1122 for processing information.Bus 1122 preferably includes low-latency and higher latency paths that are connected by bridges and adapters and controlled withincomputer system 1100 by multiple bus controllers. When implemented as a server or node,computer system 1100 may include multiple processors designed to improve network servicing power. -
Processor 1112 may be at least one general-purpose processor that, during normal operation, processes data under the control ofsoftware 1150, which may include at least one of application software, an operating system, middleware, and other code and computer executable programs accessible from a dynamic storage device such as random access memory (RAM) 1114, a static storage device such as Read Only Memory (ROM) 1116, a data storage device, such asmass storage device 1118, or other data storage medium.Software 1150 may include, but is not limited to, code, applications, protocols, interfaces, and processes for controlling one or more systems within a network including, but not limited to, an adapter, a switch, a server, a cluster system, and a grid environment. -
Computer system 1100 may communicate with a remote computer, such asserver 1140, or a remote client. In one example,server 1140 may be connected tocomputer system 1100 through any type of network, such asnetwork 1102, through a communication interface, such as network interface 532, or over a network link that may be connected, for example, tonetwork 1102. - In the example, multiple systems within a network environment may be communicatively connected via
network 1102, which is the medium used to provide communications links between various devices and computer systems communicatively connected.Network 1102 may include permanent connections such as wire or fiber optics cables and temporary connections made through telephone connections and wireless transmission connections, for example, and may include routers, switches, gateways and other hardware to enable a communication channel between the systems connected vianetwork 1102.Network 1102 may represent one or more of packet-switching based networks, telephony based networks, broadcast television networks, local area and wire area networks, public networks, and restricted networks. -
Network 1102 and the systems communicatively connected tocomputer 1100 vianetwork 1102 may implement one or more layers of one or more types of network protocol stacks which may include one or more of a physical layer, a link layer, a network layer, a transport layer, a presentation layer, and an application layer. For example,network 1102 may implement one or more of the Transmission Control Protocol/Internet Protocol (TCP/IP) protocol stack or an Open Systems Interconnection (OSI) protocol stack. In addition, for example,network 1102 may represent the worldwide collection of networks and gateways that use the TCP/IP suite of protocols to communicate with one another.Network 1102 may implement a secure HTTP protocol layer or other security protocol for securing communications between systems. - In the example,
network interface 1132 includes anadapter 1134 for connectingcomputer system 1100 tonetwork 1102 through a link and for communicatively connectingcomputer system 1100 toserver 1140 or other computing systems vianetwork 1102. Although not depicted,network interface 1132 may include additional software, such as device drivers, additional hardware and other controllers that enable communication. When implemented as a server,computer system 1100 may include multiple communication interfaces accessible via multiple peripheral component interconnect (PCI) bus bridges connected to an input/output controller, for example. In this manner,computer system 1100 allows connections to multiple clients via multiple separate ports and each port may also support multiple connections to multiple clients. - In one embodiment, the operations performed by
processor 1112 may control the operations of flowchart ofFIGS. 12-13 and other operations described herein. Operations performed byprocessor 1112 may be requested bysoftware 1150 or other code or the steps of one embodiment of the invention might be performed by specific hardware components that contain hardwired logic for performing the steps, or by any combination of programmed computer components and custom hardware components. In one embodiment, one or more components ofcomputer system 1100, or other components, which may be integrated into one or more components ofcomputer system 1100, may contain hardwired logic for performing the operations of flowcharts inFIGS. 12-13 . - In addition,
computer system 1100 may include multiple peripheral components that facilitate input and output. These peripheral components are connected to multiple controllers, adapters, and expansion slots, such as input/output (I/O)interface 1126, coupled to one of the multiple levels ofbus 1122. For example,input device 1124 may include, for example, a microphone, a video capture device, an image scanning system, a keyboard, a mouse, or other input peripheral device, communicatively enabled onbus 1122 via I/O interface 1126 controlling inputs. In addition, for example,output device 1120 communicatively enabled onbus 1122 via I/O interface 1126 for controlling outputs may include, for example, one or more graphical display devices, audio speakers, and tactile detectable output interfaces, but may also include other output interfaces. In alternate embodiments of the present invention, additional or alternate input and output peripheral components may be added. - With respect to
FIG. 11 , the present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention. - The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely, propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
- Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
- Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
- Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
- These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
- The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
- The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
- Those of ordinary skill in the art will appreciate that the hardware depicted in
FIG. 11 may vary. Furthermore, those of ordinary skill in the art will appreciate that the depicted example is not meant to imply architectural limitations with respect to the present invention. -
FIG. 12 illustrates a high level logic flowchart of a process and computer program for training a whole sentence RNN LM on an RNN LSTM architecture. - In one example, the process and program start at
block 1200 and thereafter proceed to block 1202.Block 1202 illustrates selecting one correct sentence from training data. Next,block 1204 illustrates creating N incorrect sentences by applying noise samplers. Thereafter,block 1206 illustrates applying a feed forward pass for each of the N+1 sentences through the RNN layer, to a NN scorer for generating a single value for each entire sentence, and an additional NN layer for identifying if the single value probability score is correct or not correct. Next,block 1208 illustrates training the model to classify the correct sentence from others, and the process ends. -
FIG. 13 illustrates a high level logic flowchart of a process and computer program product for testing an NLP system function implementing a whole sentence RNN LM on an RNN LSTM architecture. - In one example, the process and computer program start at
block 1300 and thereafter proceed to block 1302.Block 1302 illustrates selecting a test set from 10% of the generated data. Next,block 1304 illustrates generating imposter sentences by substituting one word in the selected test set sentences. Thereafter,block 1306 illustrates assigning scores for the test set sentence and the imposter by running each sentence through the model. Next,block 1308 illustrates evaluating performance by the classification accuracy of the scores as determined by a trained binary linear classifier. - The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, occur substantially concurrently, or the blocks may sometimes occur in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
- The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising”, when used in this specification specify the presence of stated features, integers, steps, operations, elements, and/or components, but not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
- The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the one or more embodiments of the invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.
- While the invention has been particularly shown and described with reference to one or more embodiments, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention.
Claims (20)
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/954,399 US10431210B1 (en) | 2018-04-16 | 2018-04-16 | Implementing a whole sentence recurrent neural network language model for natural language processing |
CN201910298712.9A CN110389996B (en) | 2018-04-16 | 2019-04-15 | Implementing a full sentence recurrent neural network language model for natural language processing |
US16/549,893 US10692488B2 (en) | 2018-04-16 | 2019-08-23 | Implementing a whole sentence recurrent neural network language model for natural language processing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/954,399 US10431210B1 (en) | 2018-04-16 | 2018-04-16 | Implementing a whole sentence recurrent neural network language model for natural language processing |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/549,893 Continuation US10692488B2 (en) | 2018-04-16 | 2019-08-23 | Implementing a whole sentence recurrent neural network language model for natural language processing |
Publications (2)
Publication Number | Publication Date |
---|---|
US10431210B1 US10431210B1 (en) | 2019-10-01 |
US20190318732A1 true US20190318732A1 (en) | 2019-10-17 |
Family
ID=68063926
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/954,399 Active US10431210B1 (en) | 2018-04-16 | 2018-04-16 | Implementing a whole sentence recurrent neural network language model for natural language processing |
US16/549,893 Active US10692488B2 (en) | 2018-04-16 | 2019-08-23 | Implementing a whole sentence recurrent neural network language model for natural language processing |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/549,893 Active US10692488B2 (en) | 2018-04-16 | 2019-08-23 | Implementing a whole sentence recurrent neural network language model for natural language processing |
Country Status (2)
Country | Link |
---|---|
US (2) | US10431210B1 (en) |
CN (1) | CN110389996B (en) |
Cited By (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112149406A (en) * | 2020-09-25 | 2020-12-29 | 中国电子科技集团公司第十五研究所 | Chinese text error correction method and system |
US11289073B2 (en) * | 2019-05-31 | 2022-03-29 | Apple Inc. | Device text to speech |
US11467802B2 (en) | 2017-05-11 | 2022-10-11 | Apple Inc. | Maintaining privacy of personal information |
US11538469B2 (en) | 2017-05-12 | 2022-12-27 | Apple Inc. | Low-latency intelligent automated assistant |
US11557310B2 (en) | 2013-02-07 | 2023-01-17 | Apple Inc. | Voice trigger for a digital assistant |
US11630525B2 (en) | 2018-06-01 | 2023-04-18 | Apple Inc. | Attention aware virtual assistant dismissal |
US11675491B2 (en) | 2019-05-06 | 2023-06-13 | Apple Inc. | User configurable task triggers |
US11696060B2 (en) | 2020-07-21 | 2023-07-04 | Apple Inc. | User identification using headphones |
US11699448B2 (en) | 2014-05-30 | 2023-07-11 | Apple Inc. | Intelligent assistant for home automation |
US11705130B2 (en) | 2019-05-06 | 2023-07-18 | Apple Inc. | Spoken notifications |
US11783815B2 (en) | 2019-03-18 | 2023-10-10 | Apple Inc. | Multimodality in digital assistant systems |
US11790914B2 (en) | 2019-06-01 | 2023-10-17 | Apple Inc. | Methods and user interfaces for voice-based control of electronic devices |
US11809886B2 (en) | 2015-11-06 | 2023-11-07 | Apple Inc. | Intelligent automated assistant in a messaging environment |
US11838579B2 (en) | 2014-06-30 | 2023-12-05 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US11838734B2 (en) | 2020-07-20 | 2023-12-05 | Apple Inc. | Multi-device audio adjustment coordination |
US11837237B2 (en) | 2017-05-12 | 2023-12-05 | Apple Inc. | User-specific acoustic models |
US11893992B2 (en) | 2018-09-28 | 2024-02-06 | Apple Inc. | Multi-modal inputs for voice commands |
US11900936B2 (en) | 2008-10-02 | 2024-02-13 | Apple Inc. | Electronic devices with voice command and contextual data processing capabilities |
US11907436B2 (en) | 2018-05-07 | 2024-02-20 | Apple Inc. | Raise to speak |
US11914848B2 (en) | 2020-05-11 | 2024-02-27 | Apple Inc. | Providing relevant data items based on context |
US11954405B2 (en) | 2015-09-08 | 2024-04-09 | Apple Inc. | Zero latency digital assistant |
US11979836B2 (en) | 2007-04-03 | 2024-05-07 | Apple Inc. | Method and system for operating a multi-function portable electronic device using voice-activation |
US12001933B2 (en) | 2015-05-15 | 2024-06-04 | Apple Inc. | Virtual assistant in a communication session |
US12014118B2 (en) | 2017-05-15 | 2024-06-18 | Apple Inc. | Multi-modal interfaces having selection disambiguation and text modification capability |
US12026197B2 (en) | 2017-05-16 | 2024-07-02 | Apple Inc. | Intelligent automated assistant for media exploration |
US12051413B2 (en) | 2015-09-30 | 2024-07-30 | Apple Inc. | Intelligent device identification |
US12067985B2 (en) | 2018-06-01 | 2024-08-20 | Apple Inc. | Virtual assistant operations in multi-device environments |
US12118999B2 (en) | 2014-05-30 | 2024-10-15 | Apple Inc. | Reducing the need for manual start/end-pointing and trigger phrases |
Families Citing this family (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10839792B2 (en) * | 2019-02-05 | 2020-11-17 | International Business Machines Corporation | Recognition of out-of-vocabulary in direct acoustics-to-word speech recognition using acoustic word embedding |
JP7298192B2 (en) * | 2019-03-01 | 2023-06-27 | 日本電信電話株式会社 | Generation device, generation method and program |
US11636346B2 (en) * | 2019-05-06 | 2023-04-25 | Brown University | Recurrent neural circuits |
US11163620B2 (en) * | 2019-05-20 | 2021-11-02 | Fujitsu Limited | Predicting API endpoint descriptions from API documentation |
CA3078749A1 (en) * | 2019-05-22 | 2020-11-22 | Element Ai Inc. | Neural network execution block using fully connected layers |
CN114365142B (en) * | 2019-10-31 | 2024-10-18 | 微软技术许可有限责任公司 | Determining a state of a content characteristic of an electronic communication |
CN111090981B (en) * | 2019-12-06 | 2022-04-15 | 中国人民解放军战略支援部队信息工程大学 | Method and system for building Chinese text automatic sentence-breaking and punctuation generation model based on bidirectional long-time and short-time memory network |
US11544458B2 (en) * | 2020-01-17 | 2023-01-03 | Apple Inc. | Automatic grammar detection and correction |
US20230343333A1 (en) | 2020-08-24 | 2023-10-26 | Unlikely Artificial Intelligence Limited | A computer implemented method for the aut0omated analysis or use of data |
CN112487785A (en) * | 2020-12-14 | 2021-03-12 | 北京声智科技有限公司 | RNN-based language model training method and related device |
CN117043859A (en) * | 2021-03-24 | 2023-11-10 | 谷歌有限责任公司 | Lookup table cyclic language model |
CN113298365B (en) * | 2021-05-12 | 2023-12-01 | 北京信息科技大学 | Cultural additional value assessment method based on LSTM |
US20220382973A1 (en) * | 2021-05-28 | 2022-12-01 | Microsoft Technology Licensing, Llc | Word Prediction Using Alternative N-gram Contexts |
US11966428B2 (en) * | 2021-07-01 | 2024-04-23 | Microsoft Technology Licensing, Llc | Resource-efficient sequence generation with dual-level contrastive learning |
US20230008868A1 (en) * | 2021-07-08 | 2023-01-12 | Nippon Telegraph And Telephone Corporation | User authentication device, user authentication method, and user authentication computer program |
US12073180B2 (en) | 2021-08-24 | 2024-08-27 | Unlikely Artificial Intelligence Limited | Computer implemented methods for the automated analysis or use of data, including use of a large language model |
US11989527B2 (en) | 2021-08-24 | 2024-05-21 | Unlikely Artificial Intelligence Limited | Computer implemented methods for the automated analysis or use of data, including use of a large language model |
US11989507B2 (en) | 2021-08-24 | 2024-05-21 | Unlikely Artificial Intelligence Limited | Computer implemented methods for the automated analysis or use of data, including use of a large language model |
US12067362B2 (en) | 2021-08-24 | 2024-08-20 | Unlikely Artificial Intelligence Limited | Computer implemented methods for the automated analysis or use of data, including use of a large language model |
US11977854B2 (en) | 2021-08-24 | 2024-05-07 | Unlikely Artificial Intelligence Limited | Computer implemented methods for the automated analysis or use of data, including use of a large language model |
US20230147585A1 (en) * | 2021-11-11 | 2023-05-11 | International Business Machines Corporation | Dynamically enhancing supervised learning |
US11790678B1 (en) * | 2022-03-30 | 2023-10-17 | Cometgaze Limited | Method for identifying entity data in a data set |
CN116992942B (en) * | 2023-09-26 | 2024-02-02 | 苏州元脑智能科技有限公司 | Natural language model optimization method, device, natural language model, equipment and medium |
CN117578428A (en) * | 2023-11-22 | 2024-02-20 | 广西电网有限责任公司 | Fan power recursion prediction method, device, computer equipment and storage medium |
Family Cites Families (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
ES2213921T3 (en) | 1998-09-01 | 2004-09-01 | Swisscom Ag | NEURONAL NETWORK AND APPLICATION OF THE SAME FOR VOICE RECOGNITION. |
US8700403B2 (en) | 2005-11-03 | 2014-04-15 | Robert Bosch Gmbh | Unified treatment of data-sparseness and data-overfitting in maximum entropy modeling |
US8775341B1 (en) * | 2010-10-26 | 2014-07-08 | Michael Lamport Commons | Intelligent control with hierarchical stacked neural networks |
US11495213B2 (en) | 2012-07-23 | 2022-11-08 | University Of Southern California | Noise speed-ups in hidden markov models with applications to speech recognition |
US9020806B2 (en) * | 2012-11-30 | 2015-04-28 | Microsoft Technology Licensing, Llc | Generating sentence completion questions |
US20150032449A1 (en) | 2013-07-26 | 2015-01-29 | Nuance Communications, Inc. | Method and Apparatus for Using Convolutional Neural Networks in Speech Recognition |
US10867597B2 (en) | 2013-09-02 | 2020-12-15 | Microsoft Technology Licensing, Llc | Assignment of semantic labels to a sequence of words using neural network architectures |
US20150095017A1 (en) | 2013-09-27 | 2015-04-02 | Google Inc. | System and method for learning word embeddings using neural language models |
US9412365B2 (en) | 2014-03-24 | 2016-08-09 | Google Inc. | Enhanced maximum entropy models |
US20160034814A1 (en) | 2014-08-01 | 2016-02-04 | University Of Southern California | Noise-boosted back propagation and deep learning neural networks |
US9836671B2 (en) * | 2015-08-28 | 2017-12-05 | Microsoft Technology Licensing, Llc | Discovery of semantic similarities between images and text |
US10606846B2 (en) * | 2015-10-16 | 2020-03-31 | Baidu Usa Llc | Systems and methods for human inspired simple question answering (HISQA) |
US9807473B2 (en) * | 2015-11-20 | 2017-10-31 | Microsoft Technology Licensing, Llc | Jointly modeling embedding and translation to bridge video and language |
US9922647B2 (en) | 2016-01-29 | 2018-03-20 | International Business Machines Corporation | Approach to reducing the response time of a speech interface |
US10019438B2 (en) | 2016-03-18 | 2018-07-10 | International Business Machines Corporation | External word embedding neural network language models |
US9984772B2 (en) * | 2016-04-07 | 2018-05-29 | Siemens Healthcare Gmbh | Image analytics question answering |
CN106126596B (en) * | 2016-06-20 | 2019-08-23 | 中国科学院自动化研究所 | A kind of answering method based on stratification memory network |
CN106126507B (en) * | 2016-06-22 | 2019-08-09 | 哈尔滨工业大学深圳研究生院 | A kind of depth nerve interpretation method and system based on character code |
JP6727610B2 (en) * | 2016-09-05 | 2020-07-22 | 国立研究開発法人情報通信研究機構 | Context analysis device and computer program therefor |
JP6847386B2 (en) | 2016-09-09 | 2021-03-24 | インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation | Neural network regularization |
US11182665B2 (en) | 2016-09-21 | 2021-11-23 | International Business Machines Corporation | Recurrent neural network processing pooling operation |
CN106782518A (en) * | 2016-11-25 | 2017-05-31 | 深圳市唯特视科技有限公司 | A kind of audio recognition method based on layered circulation neutral net language model |
US10372821B2 (en) * | 2017-03-17 | 2019-08-06 | Adobe Inc. | Identification of reading order text segments with a probabilistic language model |
CA3022998A1 (en) * | 2017-11-02 | 2019-05-02 | Royal Bank Of Canada | Method and device for generative adversarial network training |
-
2018
- 2018-04-16 US US15/954,399 patent/US10431210B1/en active Active
-
2019
- 2019-04-15 CN CN201910298712.9A patent/CN110389996B/en active Active
- 2019-08-23 US US16/549,893 patent/US10692488B2/en active Active
Cited By (35)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11979836B2 (en) | 2007-04-03 | 2024-05-07 | Apple Inc. | Method and system for operating a multi-function portable electronic device using voice-activation |
US11900936B2 (en) | 2008-10-02 | 2024-02-13 | Apple Inc. | Electronic devices with voice command and contextual data processing capabilities |
US11862186B2 (en) | 2013-02-07 | 2024-01-02 | Apple Inc. | Voice trigger for a digital assistant |
US12009007B2 (en) | 2013-02-07 | 2024-06-11 | Apple Inc. | Voice trigger for a digital assistant |
US11557310B2 (en) | 2013-02-07 | 2023-01-17 | Apple Inc. | Voice trigger for a digital assistant |
US11699448B2 (en) | 2014-05-30 | 2023-07-11 | Apple Inc. | Intelligent assistant for home automation |
US12118999B2 (en) | 2014-05-30 | 2024-10-15 | Apple Inc. | Reducing the need for manual start/end-pointing and trigger phrases |
US12067990B2 (en) | 2014-05-30 | 2024-08-20 | Apple Inc. | Intelligent assistant for home automation |
US11838579B2 (en) | 2014-06-30 | 2023-12-05 | Apple Inc. | Intelligent automated assistant for TV user interactions |
US12001933B2 (en) | 2015-05-15 | 2024-06-04 | Apple Inc. | Virtual assistant in a communication session |
US11954405B2 (en) | 2015-09-08 | 2024-04-09 | Apple Inc. | Zero latency digital assistant |
US12051413B2 (en) | 2015-09-30 | 2024-07-30 | Apple Inc. | Intelligent device identification |
US11809886B2 (en) | 2015-11-06 | 2023-11-07 | Apple Inc. | Intelligent automated assistant in a messaging environment |
US11467802B2 (en) | 2017-05-11 | 2022-10-11 | Apple Inc. | Maintaining privacy of personal information |
US11538469B2 (en) | 2017-05-12 | 2022-12-27 | Apple Inc. | Low-latency intelligent automated assistant |
US11837237B2 (en) | 2017-05-12 | 2023-12-05 | Apple Inc. | User-specific acoustic models |
US11862151B2 (en) | 2017-05-12 | 2024-01-02 | Apple Inc. | Low-latency intelligent automated assistant |
US12014118B2 (en) | 2017-05-15 | 2024-06-18 | Apple Inc. | Multi-modal interfaces having selection disambiguation and text modification capability |
US12026197B2 (en) | 2017-05-16 | 2024-07-02 | Apple Inc. | Intelligent automated assistant for media exploration |
US11907436B2 (en) | 2018-05-07 | 2024-02-20 | Apple Inc. | Raise to speak |
US11630525B2 (en) | 2018-06-01 | 2023-04-18 | Apple Inc. | Attention aware virtual assistant dismissal |
US12067985B2 (en) | 2018-06-01 | 2024-08-20 | Apple Inc. | Virtual assistant operations in multi-device environments |
US12061752B2 (en) | 2018-06-01 | 2024-08-13 | Apple Inc. | Attention aware virtual assistant dismissal |
US11893992B2 (en) | 2018-09-28 | 2024-02-06 | Apple Inc. | Multi-modal inputs for voice commands |
US12136419B2 (en) | 2019-03-18 | 2024-11-05 | Apple Inc. | Multimodality in digital assistant systems |
US11783815B2 (en) | 2019-03-18 | 2023-10-10 | Apple Inc. | Multimodality in digital assistant systems |
US11675491B2 (en) | 2019-05-06 | 2023-06-13 | Apple Inc. | User configurable task triggers |
US11705130B2 (en) | 2019-05-06 | 2023-07-18 | Apple Inc. | Spoken notifications |
US11289073B2 (en) * | 2019-05-31 | 2022-03-29 | Apple Inc. | Device text to speech |
US11790914B2 (en) | 2019-06-01 | 2023-10-17 | Apple Inc. | Methods and user interfaces for voice-based control of electronic devices |
US11914848B2 (en) | 2020-05-11 | 2024-02-27 | Apple Inc. | Providing relevant data items based on context |
US11838734B2 (en) | 2020-07-20 | 2023-12-05 | Apple Inc. | Multi-device audio adjustment coordination |
US11750962B2 (en) | 2020-07-21 | 2023-09-05 | Apple Inc. | User identification using headphones |
US11696060B2 (en) | 2020-07-21 | 2023-07-04 | Apple Inc. | User identification using headphones |
CN112149406A (en) * | 2020-09-25 | 2020-12-29 | 中国电子科技集团公司第十五研究所 | Chinese text error correction method and system |
Also Published As
Publication number | Publication date |
---|---|
US20200013393A1 (en) | 2020-01-09 |
CN110389996B (en) | 2023-07-11 |
US10692488B2 (en) | 2020-06-23 |
US10431210B1 (en) | 2019-10-01 |
CN110389996A (en) | 2019-10-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10692488B2 (en) | Implementing a whole sentence recurrent neural network language model for natural language processing | |
US11107473B2 (en) | Approach to reducing the response time of a speech interface | |
US20220083743A1 (en) | Enhanced attention mechanisms | |
US20210117797A1 (en) | Training multiple neural networks with different accuracy | |
US11929064B2 (en) | End-to-end streaming keyword spotting | |
US8275615B2 (en) | Model weighting, selection and hypotheses combination for automatic speech recognition and machine translation | |
US20170270100A1 (en) | External Word Embedding Neural Network Language Models | |
CN110444203B (en) | Voice recognition method and device and electronic equipment | |
US20160035344A1 (en) | Identifying the language of a spoken utterance | |
US10929754B2 (en) | Unified endpointer using multitask and multidomain learning | |
US11024298B2 (en) | Methods and apparatus for speech recognition using a garbage model | |
CN114766052A (en) | Emotion detection in audio interaction | |
US11645460B2 (en) | Punctuation and capitalization of speech recognition transcripts | |
CN113574545A (en) | Training data modification for training models | |
US20200365146A1 (en) | Dialog device, dialog method, and dialog computer program | |
US20070067171A1 (en) | Updating hidden conditional random field model parameters after processing individual training samples | |
JP6810580B2 (en) | Language model learning device and its program | |
Huang et al. | Whole sentence neural language models | |
US20210049324A1 (en) | Apparatus, method, and program for utilizing language model | |
US8438029B1 (en) | Confidence tying for unsupervised synthetic speech adaptation | |
CN114171006A (en) | Audio processing method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
AS | Assignment |
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:HUANG, YINGHUI;SETHY, ABHINAV;AUDHKHASI, KARTIK;AND OTHERS;SIGNING DATES FROM 20180416 TO 20180418;REEL/FRAME:045615/0542 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 4 |