US20080312921A1 - Speech recognition utilizing multitude of speech features - Google Patents
Speech recognition utilizing multitude of speech features Download PDFInfo
- Publication number
- US20080312921A1 US20080312921A1 US12/195,123 US19512308A US2008312921A1 US 20080312921 A1 US20080312921 A1 US 20080312921A1 US 19512308 A US19512308 A US 19512308A US 2008312921 A1 US2008312921 A1 US 2008312921A1
- Authority
- US
- United States
- Prior art keywords
- speech
- features
- speech recognition
- log
- multitude
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 238000012549 training Methods 0.000 claims abstract description 41
- 238000012360 testing method Methods 0.000 claims abstract description 12
- 238000000034 method Methods 0.000 claims description 22
- 238000012886 linear function Methods 0.000 claims description 15
- 239000000284 extract Substances 0.000 claims description 4
- 230000006870 function Effects 0.000 description 17
- 230000008569 process Effects 0.000 description 8
- 238000012545 processing Methods 0.000 description 8
- 230000000875 corresponding effect Effects 0.000 description 7
- 238000013518 transcription Methods 0.000 description 7
- 230000035897 transcription Effects 0.000 description 7
- 230000001143 conditioned effect Effects 0.000 description 4
- 230000003595 spectral effect Effects 0.000 description 4
- 230000001360 synchronised effect Effects 0.000 description 4
- 238000007476 Maximum Likelihood Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000010606 normalization Methods 0.000 description 3
- 238000013138 pruning Methods 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000010304 firing Methods 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/085—Methods for reducing search complexity, pruning
Definitions
- the present invention relates generally to a speech recognition system, and more particularly, to a speech recognition system that utilizes a multitude of speech features with a log-linear model.
- Speech recognition systems are used to identify word sequences from unknown speech utterance.
- speech features such as cepstra and delta cepstra features are extracted from the unknown utterance by a feature extractor to characterize the unknown utterance.
- a search is then done to compare the extracted features of the unknown utterance to models of speech units (such as phrases, words, syllables, phonemes, sub-phones, etc.) to compute the scores or probabilities of different word sequence hypotheses.
- the search space is restricted by pruning out unlikely hypotheses.
- the word sequence associated with the highest score or likelihood, or probability is recognized as the unknown utterance.
- a language model that determines the relative likelihood of different word sequences is also used in the calculation of the overall score of the word sequence hypotheses.
- the speech recognition models may be used to model speech as a sequence of acoustic features, or observations produced by an unobservable “true” state sequence of sub-phones, phonemes, syllables, words, phrases, and the like.
- Model parameters output from the training operation are often estimated to maximize the likelihood of the training observations.
- the optimum set of parameters for speech recognition is determined by maximizing the likelihood on the training data.
- the speech recognition system determines the word sequence with the maximum posterior probability given the observed speech signal to recognize the unknown speech utterance.
- the best word sequence hypothesis is determined through the search process that considers the scores of all possible hypotheses within the search space.
- a speech recognition system is provided.
- the combination of a log-linear model with a multitude of speech features is provided to recognize unknown speech utterances.
- the speech recognition system models the posterior probability of a hypothesis, that is, the conditional probability of a sequence of linguistic units given the observed speech signal and possibly other information, using a log-linear model.
- the posterior model captures the probability of the sequence of linguistic units given the observed speech features and the parameters of the posterior model.
- the posterior model may be determined using the probability of the word sequence hypotheses given a multitude of speech features. That is, in accordance with these exemplary aspects, the probability of word sequence with timing information and labels, given a multitude of speech features, are used to determine the posterior model.
- the speech features that are utilized may include asynchronous, overlapping, and statistically non-independent speech features.
- log-linear models are used wherein parameters may be trained with sparse or incomplete training data.
- FIG. 1 shows an exemplary speech processing system embodying the exemplary aspects of the present invention.
- FIG. 2 shows an exemplary speech recognition system embodying the exemplary aspects of the present invention.
- FIG. 3 shows an exemplary speech processor embodying the exemplary aspects of the present invention.
- FIG. 4 shows an exemplary decoder embodying the exemplary aspects of the present invention.
- FIG. 5 shows a flowchart for data training in accordance with the exemplary aspects of the present invention.
- FIG. 6 shows a flowchart for speech recognition in accordance with the exemplary aspects of the present invention.
- FIGS. 1-6 When referring to the figures, like structures and elements shown throughout are indicated with like reference numerals.
- FIG. 1 an exemplary speech processing system 1000 embodying the exemplary aspects of the present invention is shown. It is initially noted that the speech processing system 1000 of FIG. 1 is presented for illustration purposes only, and is representative of countless configurations in which the exemplary aspects of the present invention may be implemented. Thus, the present invention should not be considered limited to the system configuration shown in the figure.
- the speech processing system 1000 includes a telephone system 210 , a voice transport system 220 , a voice input device 230 , and a server 300 .
- Terminals 110 - 120 are connected to telephone system 210 via telephone network 215 and terminals 140 - 150 are connected to voice transport system 220 via data network 225 .
- telephone system 210 , voice transport system 220 , and voice input device 230 are connected to speech recognition system 300 .
- the speech recognition system 300 is also connected to a speech database 310 .
- speech is sent from a remote user over network 215 or 225 through one of terminals 110 - 150 , or directly from voice input device 230 .
- terminals 110 - 150 run a variety of speech recognition and terminal applications.
- the speech recognition system 300 receives the input speech and provides the speech recognition results to the inputting terminal or device.
- the speech recognition system 300 may include or may be connected to a speech database 310 which includes training data, speech models, meta-data, speech data and their true transcription, language and pronunciation models, application specific data, speaker information, various types of models and parameters, and the like.
- the speech recognition system 300 then provides the optimal word sequence as the recognition output or it may provide a lattice of word sequence hypotheses with corresponding confidence scores.
- lattices may have a plurality of embodiments including a summary of set of hypothesis by a graph which may have complex topology. It should be appreciated that if the graph contains loops, the set of hypothesis may be infinite.
- the speech processing system 1000 may be any system known in the art for speech processing.
- the speech processing system 1000 may be configured and may include various topologies and protocols known to those skilled in the art.
- FIG. 1 only shows 2 terminals and one voice input device
- the various exemplary aspects of the present invention is not limited to any particular number of terminals and input devices.
- any number of terminals and input devices may be applied in the present invention.
- FIG. 2 shows an exemplary speech recognition system 300 embodying the exemplary aspects of the present invention.
- the speech recognition system 300 includes a speech processor 320 , a storage device 340 , an input device 360 and an output device 380 , all connected by bus 395 .
- the processor 320 of speech recognition system 300 receives the incoming speech data comprising unknown utterance, meta-data, such as caller ID, speaker gender, channel conditions, and the like, from a user at a terminal 110 - 150 or voice input device 230 through the input device 360 .
- the speech processor 320 then performs the speech recognition based on the appropriate models stored in the storage device 340 , or received from the database 310 through the input device 360 .
- the speech processor 320 then routes the recognition results to the user at the requesting terminal 110 - 150 or voice input device 230 or a computer agent (that may perform actions appropriate to what the user said) through output device 380 .
- FIG. 2 shows a particular form of speech recognition system, it should be understood that other layouts are possible and that the various aspects of the invention are not limited to such layout.
- the speech processor 320 may provide recognition results based on data stored in memory 340 or the database 310 .
- the various exemplary aspects of the present invention are not limited to such layout.
- FIG. 3 shows an exemplary speech processor 320 embodying the exemplary aspects of the present invention.
- the speech processor 320 includes a decoder 322 which utilizes the posterior probability of linguistic units relevant to speech recognition using a log-linear model to provide the recognition of the unknown utterance. That is, from the probabilities determined, the decoder 322 determines the optimal word sequence that has the highest probability, and output the word sequence as the recognized output.
- the decoder may prune the lattice of possible hypotheses to restrict the search space and reduce computation time.
- the decoder 322 is further connected to a training storage 325 which stores speech data and their true transcriptions for training, and a model storage 327 that stores model parameters obtained from the training operation.
- FIG. 4 shows the decoder of FIG. 3 in further detail.
- the decoder 322 includes a features extractor 3222 , a log-linear function 3224 , and a search device 3226 .
- training data is input to the decoder 322 along with the true word transcription from the training storage 325 , where the model parameters are generated and output to the model storage 327 , to be used during the speech recognition operation.
- unknown speech data is input to the decoder 322 along with the model parameters stored in the model storage 327 during the training operation, and the optimal word sequence is output.
- training data is input to the feature extractor 3222 along with the meta-data, and the truth from the truth element 325 which can consist of the true transcriptions, which are typically words, but can also be other linguistic units like phrases, syllables, phonemes, acoustic phonetic features, sub-phones, and the like, and possibly but not necessarily time alignments for matching the linguistic units in the true transcription with the corresponding segments of speech. That is, the training operation is performed to determine the maximum likelihood of truth.
- the feature extractor 3222 extracts a multitude of features from the input data using a multitude of extracting elements.
- the features may be advantageously asynchronous, overlapping, statistically non-independent, and the like, in accordance to the various exemplary aspects of this invention.
- the extracting elements include, but are not limited to, direct matching element, synchronous phonetic element, acoustic phonetic element, linguistic semantic pragmatic features element, and the like.
- the exemplary direct matching element may compute a dynamic time warping score against various reference speech segments in the database.
- Synchronous phonetic features can be derived from traditional features like mel cepstra features.
- Acoustic phonetic features can be asynchronous features that include linguistic distinctive features such as voicing, place of articulation, and the like.
- features can also include higher level information extracted from a particular word sequence hypothesis, for example, from a semantic or syntactic parse tree, the pragmatic or semantic coherence, and the like.
- Features can also be meta-data such as speaker information, speaking rate, channel condition, and the like.
- the multitude of extracted features are then provided to a log-linear function 3224 , which, using the parameters of the log-linear model, can compute the posterior probability of a hypothesized linguistic unit or sequence, given the extracted features and possibly a particular time alignment of the linguistic units to speech data.
- the correct word sequence is known, for example, the correct sequence is created by humans transcribing the speech.
- the correct sequence is created by humans transcribing the speech.
- the true time alignment any particular unit sequence to the speech may or may not be known.
- the trainer uses the extracted features, the correct word sequence, or linguistic unit sequence, with possibly time alignments to the speech, and optimizes the parameters of the log-linear model.
- the log-linear output may be provided to the search device 3225 which can refine and provide a better linguistic unit sequence choice and a more accurate time alignment of the linguistic unit sequence to the speech.
- This new alignment may then be looped back to the feature extractor 3222 as FEEDBACK to repeat the process for a second time to optimize the model parameters.
- the initial time alignment may be bootstrapped by human annotation or by hidden Markov model technology.
- the model parameters corresponding to the maximum likelihood are determined as the training model parameters, and are sent to the model data element 327 , where they are stored for the subsequent speech recognition operations.
- the log linear models are trained using any one of several algorithms, including improved iterative scaling, iterative scaling, preconditioned conjugate gradient, and the like.
- the training results in optimizing the parameters of the model in terms of some criterion such as maximum likelihood or maximum entropy subject to some constraints.
- the training is performed by a trainer (not shown) that uses the features provided by the features extractor, the correct linguistic unit sequence and the corresponding time alignment to the speech.
- preprocessing by a state-of-the-art hidden Markov model recognition system to extract the features and to align the target unit sequences.
- the hidden Markov model may be used to align the speech frames to optimal sub-phone state sequences, and determine the top ranked Gaussians. That is, within the hidden Markov model, the Gaussian probability models of traditional features such as mel cepstra features that are the best match to the speech frame pre-determined.
- sub-phone state sequences and the ranked Gaussian data are features used to train the log linear model.
- speech data to be recognized is input to the feature extractor 3222 along with the meta-data, and possibly a lattice that comprises the current search space of the search device 3226 .
- This lattice may be pre-generated by well known technology based on hidden Markov models, or may be generated on a previous round of recognition.
- the lattice is a compact representation of the current set of scores/probabilities of various possible hypotheses considered within the search space.
- the feature extractor 3222 then extracts a multitude of features from the input data using a multitude of extracting elements. It should be appreciated that the features may be asynchronous, overlapping, statistically non-independent, and the like, in accordance to the various exemplary aspects of this invention.
- the extracting elements include, but are not limited to, direct matching element, synchronous phonetic element, acoustic phonetic element, linguistic semantic pragmatic features element, and the like.
- the multitude of extracted features is then provided to a log-linear function 3224 .
- the search device 3226 is provided to determine the optimal word sequence of all possible word sequences. In an exemplary embodiment, the search device 3226 limits the search to the most promising candidates by pruning out unlikely word sequences.
- the search device 3226 consults the log-linear function 3224 about the likelihood of entire or partial word or other unit sequences.
- the search space considered by the search device 3226 may be represented as a lattice that is a compact representation of the hypotheses under active consideration, along with the scores/probabilities. Such a lattice may be an input to the search device, constraining the search space, or an output after work has been done by the search device 3226 to update the probabilities in the lattice or pruning out unlikely paths.
- the search device 3226 may also advantageously combine the probabilities/scores from the log-linear function 3224 with probabilities/scores from other models such as language model, hidden Markov model, and the like in a non-log-linear fashion such as linear interpolation after dynamic range compensation.
- language model and hidden Markov model information may also be considered features that are combined in the log-linear function 3224 .
- the output of the search device 3226 is an optimal word sequence with the highest posterior probability among all the hypotheses in the search space.
- the output may also output a highly pruned lattice, of which an N-best list may be an example, of highly likely hypotheses that may be utilized by a computer agent to take further action.
- the search device 3226 may also output a lattice with updated scores and possibly alignments that can be fed back into the feature extractor 3222 and log-linear function 3224 to refine the scores/probabilities. It should be appreciated that, in accordance with the various exemplary embodiments of this invention, this last step may be optional.
- a single-pass decoding or multiple-pass decoding may be applied, where a lattice, or list of top hypotheses, may be generated in the first pass using a crude model and may be looped back and rescored using the more refined model in a subsequent pass.
- the probability of each of the word sequences in the lattice is evaluated.
- the probability of each specific word sequence may be related to the probability of the best alignment of its constituent sub-phone state sequence. It should be appreciated that the optimally aligned state sequence may be found in any variety of alignment process in accordance with the various embodiments of this invention, and that this invention is not limited to any particular alignment.
- Selecting the word sequence with the highest probability is done using the new model to perform word recognition.
- the probabilities from various models may be combined heuristically with the probability from the log linear model of the various exemplary embodiments of this invention.
- a multiple of scores may be combined, including the traditional hidden Markov model likelihood score, and the language model score, through linear interpolation after dynamic range compensation, with the probability score from the log linear model of the various exemplary embodiments of this invention.
- the search device 3226 consults the log-linear function 3224 repeatedly in determining the scores/probabilities of different sequences.
- the lattice is consulted by the search device 3226 to determine what hypothesis to consider.
- Each path in the lattice corresponds to a word sequence and has an associated probability stored in the lattice.
- the log linear models are determined based on the posterior probability of a hypothesis given a multitude of speech features.
- the log linear model allows for the potential combination of multiple features in a unified fashion. For example, asynchronous and overlapping features may be incorporated formally.
- the posterior probability may be represented as the probability of a sequence associated with a hypothesis given a sequence of acoustic observations:
- i is the index pointing to the ith word (or unit)
- k is the number of words (units) in the hypothesis
- T is the length of the speech signal (e.g. number of frames)
- w 1 k is the sequence of words associated with the hypothesis H j .
- o 1 T is the sequence of acoustic observations.
- conditional probabilities may be represented by a maximum entropy log-linear model:
- ⁇ i are the parameters of the log-linear model
- Equation 2 is a true probability (will sum up to 1).
- the normalization factors are a function of the conditioned variables.
- the speech recognition system shown in FIGS. 1-4 models the posterior probability of linguistic units relevant to speech recognition using a log-linear model.
- the posterior model captures the probability of the linguistic unit given the observed speech features and the parameters of the posterior model.
- the posterior model may be used to determine the probability of the word sequence hypotheses given a multitude of speech features.
- the sequence w 1 k need not be a word sequence, but can also be a sequence of phrases, syllables, phonemes, sub-phone units, and the like associated with the spoken sentence.
- the model of the various aspects of the present invention may therefore apply at different levels of linguistic hierarchy, and that the features f j may include many possibilities, including: synchronous and asynchronous, disjoint and overlapping, correlated and uncorrelated, segmental and suprasegmental, acoustic phonetic, hierarchical linguistic, meta-data, higher level knowledge, and the like.
- the speech features that are utilized may include asynchronous, overlapping, and statistically non-independent speech features.
- a feature may be defined as a function f with the following properties:
- c i denotes everything the probability is conditioned on, which may include context and observations
- b is a binary function expressing some property of the conditioned event
- w is the target (or predicted) state/unit such as a word
- ⁇ is the weight of the function.
- a feature is a computable function that is conditioned upon context and observation, that may be thought of firing or becoming active for a specific context/observation and a specific prediction, for example, w i .
- the weight of the function ⁇ may be equal to 1 or 0, or may be real-valued.
- the weight ⁇ may be related to the confidence of whether the property was detected in the speech signal, or the importance of that property.
- the lattice output from the decoder 322 may consist of more than one score.
- scores may be obtained of the top predetermined number of matches.
- other data may be used by the search device 3226 , including such information as the hidden Markov model scores obtained from a hidden Markov model decoder and scores for different match levels of Dynamic Time Warping, such as word vs syllable vs allophone.
- An exemplary method of combining the different scores is to use a log-linear model and then train the parameters of the log-linear model.
- the log-linear model for the posterior probability of a path H i may be given by the exponent of the sum of a linear combination of the different scores:
- F wj is the j th score feature for the segment spanned by word w. for example, if the top 10 Dynamic Time Warping scores and the hidden Markov score obtained by various well known Dynamic Time Warping and hidden Markov model technologies (not explicitly shown in the figures) are returned, then there will be 11 score features for each word in the lattice.
- Z is the normalization constant Z given by the sum over all paths (H 1 . . . 3 ) of the exponential term:
- Equation (4) is a true probability, that is, sum to 1.
- the parameters ⁇ j may be estimated by maximizing the likelihood of the correct path, that is, maximizing the probability of the hypothesis over all the training data.
- weight parameters ⁇ j can be have dependencies themselves. For example they could be a function of the length of the word or of the number of training samples for that word/syllable/phone/the like.
- equation (4) may further be generalized to having an exponent which is a weighted sum of general features, each of which is a function of the path H i , and the acoustic observation sequence o 1 T .
- non-verbal information such as whether test and training sequences are from the same gender, same speaker, same noise condition, same phonetic context, etc.
- non-verbal information such as whether test and training sequences are from the same gender, same speaker, same noise condition, same phonetic context, etc.
- the individual word scores F wj may themselves be taken to be posterior word probabilities from a log-linear model.
- the log-linear models may be calculated quite tractably even using lots of features. Examples of features are Dynamic Time Warping, hidden Markov model, and the like.
- log-linear models are used to make the best use of any given set of detected features, without the use of assumptions about features that are not present. That is, in contrast in contrast to other models such as the hidden Markov models which require using the same set of features in training and testing operations, the log-linear models make no assumptions about unobserved features, so that were some feature not observable due to noise masking, for example, the log-linear model will make the best use of the other available features.
- the speech recognition system may make full use of the known models by training the known models with the log linear model, to obtain the first lattice, alignment, or decoding using the known models to combine with the log linear model of this invention.
- log-linear model is provided that utilizes among many possible features, the identities of the Gaussians that are the best match to traditional short time spectral features, in a traditional Gaussian mixture model comprising weighted combinations of Gaussian distributions of spectral features such as mel cepstra features, widely used in hidden Markov models, and matching of speech segments to a large corpus of training data.
- advantages such as not necessitating all features used in training to appear in testing/recognition operations, may be obtained. That is, with models other than log linear models, if features used for training does not appear in testing, a “mismatched condition” is obtained and performance is poor. Accordingly, usage of models other than a log linear model often results in failure if some features used in training are obscured by noise and are not present in the test data.
- FIG. 5 shows a flowchart of a method for data training according to the various exemplary aspects of the present invention.
- control proceeds to step 5100 , where training data and meta-data are input to the decoder.
- This data contains the speech data typically collected and stored beforehand in the training storage, including the truth stored.
- meta data may include such information as speaker gender or identity, recording channel, personal profile of speaker, and the like.
- the truth may generally consist of the true word sequence transcription created by human transcribers.
- a model is input to the decoder. This model is a general model stored beforehand in the model storage.
- a prestored lattice is input. Control then proceeds to step 5400 .
- step 5400 a multitude of features are extracted and a search is performed. These features include those derived from traditional spectral features such as mel cepstra and time derivatives, acoustic phonetic or articulatory distinctive features such as voicing, place of articulation, and the like, scores from dynamic time warping match to speech segments, higher level information extracted from a particular word sequence hypothesis, for example, from a semantic or syntactic parse tree, the pragmatic or semantic coherence, and the like, speaking rate and channel condition, and the like. It should also be appreciated that some of the features extracted in this step may include log-linear or other models which will be updated in this process.
- lattice with scores, objective functions and auxiliary statistics are determined using a log-linear function according to the various exemplary embodiments of this invention.
- a plurality of objective functions are calculated in this step due to the fact that a plurality of models are being trained in this process, that is, the log linear model giving the overall score as well as any other models used for feature extraction.
- the top level objective function is total posterior likelihood, which is to be maximized.
- the auxiliary statistics calculated in this step may include gradient functions, and other statistics required for optimization using an auxiliary function technique.
- step 5500 it is determined if the objective functions are close enough to optimal. It should be appreciated that there are a plurality of tests for optimality, including thresholds on increase of objective functions or gradients. If optimality has not been reached, control continues to step 5600 , where the models are updated and then control returns to step 5200 . In step 5600 , the models are updated using the auxiliary statistics. It is to be appreciated that there are a plurality of methods for updating the models, including but not limited to quasi-Newton gradient search, generalized iterative scaling, and extended Baum-Welch, and expectation maximization.
- step 5400 efficient implementations may only update a subset of parameters in an iteration, and thus, in step 5400 , only a restricted calculation need be performed. This restriction may include only updating a single feature extractor.
- step 5700 If optimality has been reached, control continues to step 5700 , where the model parameters are output. Then, in step 5800 , the process ends.
- FIG. 6 shows a flowchart of a method for speech recognition according to the various exemplary aspects of the present invention.
- control proceeds to step 6100 , where test data is input to the decoder.
- this test data is received from a user at a remote terminal via a telephone or data network or at a voice input device.
- This data may also include meta data such as speaker gender or identity, recording channel, personal profile of speaker, and the like.
- step 6200 the model is input. This model is stored in the model storage 327 during the training operation. Then, in step 6300 , a prestored hypothesis lattice is input. Control then continues to step 6400 .
- step 6400 a multitude of features are extracted and a search is performed using a log linear model of these features. These features include those derived from traditional spectral features. It should also be appreciated that some of the features extracted in this step may be determined using log-linear or other models.
- this step different unit sequence hypotheses along with their corresponding time alignments are explored and the probabilities of partial and whole sequences are determined. It should be appreciated that this search in this step is constrained by the previous input lattice. The pruned combined results determine an updated lattice with scores. It should be appreciated that a particular embodiment of this updated lattice may be a single best most likely hypothesis.
- step 6500 it is determined whether another pass is needed. If another pass is needed, then control returns to step 6200 . It should be appreciated that the features and models used in subsequent passes may vary.
- the lattice output in step 6400 may be used as the input lattice in step 6300 . Else, no additional pass is needed, and control continues to step 6600 , where the optimal word sequence is output. That is, the word sequence corresponding to the hypothesis in the lattice having the highest score is output. It should be appreciated that in an alternative embodiment, the lattice is output. Control then continues to step 6700 , where the process ends.
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Telephonic Communication Services (AREA)
- Machine Translation (AREA)
Abstract
In a speech recognition system, the combination of a log-linear model with a multitude of speech features is provided to recognize unknown speech utterances. The speech recognition system models the posterior probability of linguistic units relevant to speech recognition using a log-linear model. The posterior model captures the probability of the linguistic unit given the observed speech features and the parameters of the posterior model. The posterior model may be determined using the probability of the word sequence hypotheses given a multitude of speech features. Log-linear models are used with features derived from sparse or incomplete data. The speech features that are utilized may include asynchronous, overlapping, and statistically non-independent speech features. Not all features used in training need to appear in testing/recognition.
Description
- The present application is a continuation of parent application Ser. No. 10/724,536 filed on Nov. 28, 2003.
- The present invention relates generally to a speech recognition system, and more particularly, to a speech recognition system that utilizes a multitude of speech features with a log-linear model.
- Speech recognition systems are used to identify word sequences from unknown speech utterance. In an exemplary speech recognition system, speech features such as cepstra and delta cepstra features are extracted from the unknown utterance by a feature extractor to characterize the unknown utterance. A search is then done to compare the extracted features of the unknown utterance to models of speech units (such as phrases, words, syllables, phonemes, sub-phones, etc.) to compute the scores or probabilities of different word sequence hypotheses. Typically the search space is restricted by pruning out unlikely hypotheses. The word sequence associated with the highest score or likelihood, or probability, is recognized as the unknown utterance. In addition to the acoustic model, a language model that determines the relative likelihood of different word sequences is also used in the calculation of the overall score of the word sequence hypotheses.
- Through a training operation, the parameters for the speech recognition models are determined. The speech recognition models may be used to model speech as a sequence of acoustic features, or observations produced by an unobservable “true” state sequence of sub-phones, phonemes, syllables, words, phrases, and the like. Model parameters output from the training operation are often estimated to maximize the likelihood of the training observations. The optimum set of parameters for speech recognition is determined by maximizing the likelihood on the training data. The speech recognition system determines the word sequence with the maximum posterior probability given the observed speech signal to recognize the unknown speech utterance. The best word sequence hypothesis is determined through the search process that considers the scores of all possible hypotheses within the search space.
- In accordance with the exemplary aspects of this invention, a speech recognition system is provided.
- In accordance with the various exemplary aspects of this invention, the combination of a log-linear model with a multitude of speech features is provided to recognize unknown speech utterances.
- In accordance with various exemplary aspects of this invention, the speech recognition system models the posterior probability of a hypothesis, that is, the conditional probability of a sequence of linguistic units given the observed speech signal and possibly other information, using a log-linear model.
- In accordance with these exemplary aspects, the posterior model captures the probability of the sequence of linguistic units given the observed speech features and the parameters of the posterior model.
- In accordance with these exemplary aspects of this invention, the posterior model may be determined using the probability of the word sequence hypotheses given a multitude of speech features. That is, in accordance with these exemplary aspects, the probability of word sequence with timing information and labels, given a multitude of speech features, are used to determine the posterior model.
- In accordance with the various exemplary aspects of this invention, the speech features that are utilized may include asynchronous, overlapping, and statistically non-independent speech features.
- In accordance with the various exemplary aspects of this invention, log-linear models are used wherein parameters may be trained with sparse or incomplete training data.
- In accordance with the various exemplary aspects of this invention, not all features used in training need to appear in testing/recognition.
-
FIG. 1 shows an exemplary speech processing system embodying the exemplary aspects of the present invention. -
FIG. 2 shows an exemplary speech recognition system embodying the exemplary aspects of the present invention. -
FIG. 3 shows an exemplary speech processor embodying the exemplary aspects of the present invention. -
FIG. 4 shows an exemplary decoder embodying the exemplary aspects of the present invention. -
FIG. 5 shows a flowchart for data training in accordance with the exemplary aspects of the present invention. -
FIG. 6 shows a flowchart for speech recognition in accordance with the exemplary aspects of the present invention. - The following description details how exemplary aspects of the present invention are employed. Throughout the description of the invention, reference is made to
FIGS. 1-6 . When referring to the figures, like structures and elements shown throughout are indicated with like reference numerals. - In
FIG. 1 , an exemplary speech processing system 1000 embodying the exemplary aspects of the present invention is shown. It is initially noted that the speech processing system 1000 ofFIG. 1 is presented for illustration purposes only, and is representative of countless configurations in which the exemplary aspects of the present invention may be implemented. Thus, the present invention should not be considered limited to the system configuration shown in the figure. - As shown in
FIG. 1 , the speech processing system 1000 includes atelephone system 210, avoice transport system 220, avoice input device 230, and aserver 300. Terminals 110-120 are connected totelephone system 210 viatelephone network 215 and terminals 140-150 are connected tovoice transport system 220 viadata network 225. As shown inFIG. 1 ,telephone system 210,voice transport system 220, andvoice input device 230 are connected tospeech recognition system 300. Thespeech recognition system 300 is also connected to aspeech database 310. - In operation, speech is sent from a remote user over
network voice input device 230. In response to the input speech, terminals 110-150 run a variety of speech recognition and terminal applications. - The
speech recognition system 300 receives the input speech and provides the speech recognition results to the inputting terminal or device. - The
speech recognition system 300 may include or may be connected to aspeech database 310 which includes training data, speech models, meta-data, speech data and their true transcription, language and pronunciation models, application specific data, speaker information, various types of models and parameters, and the like. Thespeech recognition system 300 then provides the optimal word sequence as the recognition output or it may provide a lattice of word sequence hypotheses with corresponding confidence scores. In accordance with the various exemplary aspects of this invention, lattices may have a plurality of embodiments including a summary of set of hypothesis by a graph which may have complex topology. It should be appreciated that if the graph contains loops, the set of hypothesis may be infinite. - As discussed above, though the exemplary embodiment above describes speech processing system 1000 in a particular embodiment, the speech processing system 1000 may be any system known in the art for speech processing. Thus, it is contemplated that the speech processing system 1000 may be configured and may include various topologies and protocols known to those skilled in the art.
- For example, it is to be appreciated that though
FIG. 1 only shows 2 terminals and one voice input device, the various exemplary aspects of the present invention is not limited to any particular number of terminals and input devices. Thus, it is contemplated that any number of terminals and input devices may be applied in the present invention. -
FIG. 2 shows an exemplaryspeech recognition system 300 embodying the exemplary aspects of the present invention. As shown inFIG. 2 , thespeech recognition system 300 includes aspeech processor 320, astorage device 340, aninput device 360 and anoutput device 380, all connected bybus 395. - In operation, the
processor 320 ofspeech recognition system 300 receives the incoming speech data comprising unknown utterance, meta-data, such as caller ID, speaker gender, channel conditions, and the like, from a user at a terminal 110-150 orvoice input device 230 through theinput device 360. Thespeech processor 320 then performs the speech recognition based on the appropriate models stored in thestorage device 340, or received from thedatabase 310 through theinput device 360. Thespeech processor 320 then routes the recognition results to the user at the requesting terminal 110-150 orvoice input device 230 or a computer agent (that may perform actions appropriate to what the user said) throughoutput device 380. - Although
FIG. 2 shows a particular form of speech recognition system, it should be understood that other layouts are possible and that the various aspects of the invention are not limited to such layout. - In the above exemplary embodiment, the
speech processor 320 may provide recognition results based on data stored inmemory 340 or thedatabase 310. However, it is to be appreciated that the various exemplary aspects of the present invention are not limited to such layout. -
FIG. 3 shows anexemplary speech processor 320 embodying the exemplary aspects of the present invention. As shown inFIG. 3 , thespeech processor 320 includes adecoder 322 which utilizes the posterior probability of linguistic units relevant to speech recognition using a log-linear model to provide the recognition of the unknown utterance. That is, from the probabilities determined, thedecoder 322 determines the optimal word sequence that has the highest probability, and output the word sequence as the recognized output. The decoder may prune the lattice of possible hypotheses to restrict the search space and reduce computation time. - The
decoder 322 is further connected to atraining storage 325 which stores speech data and their true transcriptions for training, and amodel storage 327 that stores model parameters obtained from the training operation. -
FIG. 4 shows the decoder ofFIG. 3 in further detail. As shown inFIG. 4 , thedecoder 322 includes afeatures extractor 3222, a log-linear function 3224, and asearch device 3226. - In operation, during the training operation, training data is input to the
decoder 322 along with the true word transcription from thetraining storage 325, where the model parameters are generated and output to themodel storage 327, to be used during the speech recognition operation. During the speech recognition operation, unknown speech data is input to thedecoder 322 along with the model parameters stored in themodel storage 327 during the training operation, and the optimal word sequence is output. - As shown in
FIGS. 3-4 , during the training operation, training data is input to thefeature extractor 3222 along with the meta-data, and the truth from thetruth element 325 which can consist of the true transcriptions, which are typically words, but can also be other linguistic units like phrases, syllables, phonemes, acoustic phonetic features, sub-phones, and the like, and possibly but not necessarily time alignments for matching the linguistic units in the true transcription with the corresponding segments of speech. That is, the training operation is performed to determine the maximum likelihood of truth. Thefeature extractor 3222 extracts a multitude of features from the input data using a multitude of extracting elements. It should be appreciated that the features may be advantageously asynchronous, overlapping, statistically non-independent, and the like, in accordance to the various exemplary aspects of this invention. The extracting elements include, but are not limited to, direct matching element, synchronous phonetic element, acoustic phonetic element, linguistic semantic pragmatic features element, and the like. - For example, the exemplary direct matching element may compute a dynamic time warping score against various reference speech segments in the database. Synchronous phonetic features can be derived from traditional features like mel cepstra features. Acoustic phonetic features can be asynchronous features that include linguistic distinctive features such as voicing, place of articulation, and the like.
- It should be appreciated that, in accordance with the various exemplary embodiments of this invention, none of these feature extractors need to be perfectly accurate. Features can also include higher level information extracted from a particular word sequence hypothesis, for example, from a semantic or syntactic parse tree, the pragmatic or semantic coherence, and the like. Features can also be meta-data such as speaker information, speaking rate, channel condition, and the like.
- The multitude of extracted features are then provided to a log-
linear function 3224, which, using the parameters of the log-linear model, can compute the posterior probability of a hypothesized linguistic unit or sequence, given the extracted features and possibly a particular time alignment of the linguistic units to speech data. - During the training process, the correct word sequence is known, for example, the correct sequence is created by humans transcribing the speech. However, there may be multiple valid choices of linguistic units, for example, phonemes, that make up the word sequence due to pronunciation variants and the like. All the valid sequences may be compactly represented as a lattice. In addition, the true time alignment any particular unit sequence to the speech may or may not be known. The trainer (not shown in diagram) uses the extracted features, the correct word sequence, or linguistic unit sequence, with possibly time alignments to the speech, and optimizes the parameters of the log-linear model.
- Thus, during training, the log-linear output may be provided to the search device 3225 which can refine and provide a better linguistic unit sequence choice and a more accurate time alignment of the linguistic unit sequence to the speech. This new alignment may then be looped back to the
feature extractor 3222 as FEEDBACK to repeat the process for a second time to optimize the model parameters. It should be appreciated that the initial time alignment may be bootstrapped by human annotation or by hidden Markov model technology. Thus, the model parameters corresponding to the maximum likelihood are determined as the training model parameters, and are sent to themodel data element 327, where they are stored for the subsequent speech recognition operations. - In various exemplary embodiments of the present invention, the log linear models are trained using any one of several algorithms, including improved iterative scaling, iterative scaling, preconditioned conjugate gradient, and the like. The training results in optimizing the parameters of the model in terms of some criterion such as maximum likelihood or maximum entropy subject to some constraints. The training is performed by a trainer (not shown) that uses the features provided by the features extractor, the correct linguistic unit sequence and the corresponding time alignment to the speech.
- In an exemplary embodiment, preprocessing by a state-of-the-art hidden Markov model recognition system (not shown in figures) to extract the features and to align the target unit sequences. For example, the hidden Markov model may be used to align the speech frames to optimal sub-phone state sequences, and determine the top ranked Gaussians. That is, within the hidden Markov model, the Gaussian probability models of traditional features such as mel cepstra features that are the best match to the speech frame pre-determined. In this exemplary embodiment, sub-phone state sequences and the ranked Gaussian data are features used to train the log linear model.
- It should be understood that this exemplary embodiment is only one specific implementation, and that many other embodiments of training using log linear models may be used in the various aspects of this invention.
- During the speech recognition operation, speech data to be recognized is input to the
feature extractor 3222 along with the meta-data, and possibly a lattice that comprises the current search space of thesearch device 3226. This lattice may be pre-generated by well known technology based on hidden Markov models, or may be generated on a previous round of recognition. The lattice is a compact representation of the current set of scores/probabilities of various possible hypotheses considered within the search space. Thefeature extractor 3222 then extracts a multitude of features from the input data using a multitude of extracting elements. It should be appreciated that the features may be asynchronous, overlapping, statistically non-independent, and the like, in accordance to the various exemplary aspects of this invention. The extracting elements, include, but are not limited to, direct matching element, synchronous phonetic element, acoustic phonetic element, linguistic semantic pragmatic features element, and the like. The multitude of extracted features is then provided to a log-linear function 3224. - The
search device 3226 is provided to determine the optimal word sequence of all possible word sequences. In an exemplary embodiment, thesearch device 3226 limits the search to the most promising candidates by pruning out unlikely word sequences. Thesearch device 3226 consults the log-linear function 3224 about the likelihood of entire or partial word or other unit sequences. The search space considered by thesearch device 3226 may be represented as a lattice that is a compact representation of the hypotheses under active consideration, along with the scores/probabilities. Such a lattice may be an input to the search device, constraining the search space, or an output after work has been done by thesearch device 3226 to update the probabilities in the lattice or pruning out unlikely paths. Thesearch device 3226 may also advantageously combine the probabilities/scores from the log-linear function 3224 with probabilities/scores from other models such as language model, hidden Markov model, and the like in a non-log-linear fashion such as linear interpolation after dynamic range compensation. However, language model and hidden Markov model information may also be considered features that are combined in the log-linear function 3224. - The output of the
search device 3226 is an optimal word sequence with the highest posterior probability among all the hypotheses in the search space. The output may also output a highly pruned lattice, of which an N-best list may be an example, of highly likely hypotheses that may be utilized by a computer agent to take further action. Thesearch device 3226 may also output a lattice with updated scores and possibly alignments that can be fed back into thefeature extractor 3222 and log-linear function 3224 to refine the scores/probabilities. It should be appreciated that, in accordance with the various exemplary embodiments of this invention, this last step may be optional. - As discussed in the above exemplary embodiments, in the speech recognition system of the exemplary aspects of this invention, there are many possible word sequences in the search space consisting theoretically of any sequence of words in the vocabulary, so that an efficient search operation is performed by the
decoder 322 to obtain the optimal word sequence. It should be appreciated that, as shown by the feedback loop inFIG. 4 , a single-pass decoding or multiple-pass decoding may be applied, where a lattice, or list of top hypotheses, may be generated in the first pass using a crude model and may be looped back and rescored using the more refined model in a subsequent pass. - In the multiple-pass decoding, the probability of each of the word sequences in the lattice is evaluated. The probability of each specific word sequence may be related to the probability of the best alignment of its constituent sub-phone state sequence. It should be appreciated that the optimally aligned state sequence may be found in any variety of alignment process in accordance with the various embodiments of this invention, and that this invention is not limited to any particular alignment.
- Selecting the word sequence with the highest probability is done using the new model to perform word recognition.
- It should be appreciated that, in accordance with the various exemplary embodiments of this invention, the probabilities from various models may be combined heuristically with the probability from the log linear model of the various exemplary embodiments of this invention. In particular, a multiple of scores may be combined, including the traditional hidden Markov model likelihood score, and the language model score, through linear interpolation after dynamic range compensation, with the probability score from the log linear model of the various exemplary embodiments of this invention.
- In accordance with the various exemplary embodiments of this invention, the
search device 3226 consults the log-linear function 3224 repeatedly in determining the scores/probabilities of different sequences. The lattice is consulted by thesearch device 3226 to determine what hypothesis to consider. Each path in the lattice corresponds to a word sequence and has an associated probability stored in the lattice. - In the above-described exemplary embodiments of the present invention, the log linear models are determined based on the posterior probability of a hypothesis given a multitude of speech features. The log linear model allows for the potential combination of multiple features in a unified fashion. For example, asynchronous and overlapping features may be incorporated formally.
- As a simple example, the posterior probability may be represented as the probability of a sequence associated with a hypothesis given a sequence of acoustic observations:
-
- where:
- Hj is the jth hypothesis that contains a sequence of word (or other linguist unit) sequence w1 k=w1w2 . . . wk
- i is the index pointing to the ith word (or unit)
- k is the number of words (units) in the hypothesis
- T is the length of the speech signal (e.g. number of frames)
- w1 k is the sequence of words associated with the hypothesis Hj, and
- o1 T is the sequence of acoustic observations.
- In the above equation (1), the conditional probabilities may be represented by a maximum entropy log-linear model:
-
- where:
- λi are the parameters of the log-linear model,
- fi are the multitude of features extracted,
- and
- Z is the normalization factor that ensures that Equation 2 is a true probability (will sum up to 1). The normalization factors are a function of the conditioned variables.
- As shown in the above exemplary embodiment, in accordance with various exemplary aspects of this invention, the speech recognition system shown in
FIGS. 1-4 models the posterior probability of linguistic units relevant to speech recognition using a log-linear model. As shown above, the posterior model captures the probability of the linguistic unit given the observed speech features and the parameters of the posterior model. Thus, the posterior model may be used to determine the probability of the word sequence hypotheses given a multitude of speech features. - It should be appreciated that the above representation is just an example, and that, according to the various aspects of the present invention, myriad variations may be applied. For example, the sequence w1 k need not be a word sequence, but can also be a sequence of phrases, syllables, phonemes, sub-phone units, and the like associated with the spoken sentence. Further, it is to be appreciated that the model of the various aspects of the present invention may therefore apply at different levels of linguistic hierarchy, and that the features fj may include many possibilities, including: synchronous and asynchronous, disjoint and overlapping, correlated and uncorrelated, segmental and suprasegmental, acoustic phonetic, hierarchical linguistic, meta-data, higher level knowledge, and the like.
- By modeling in accordance to the various exemplary aspects of this invention, the speech features that are utilized may include asynchronous, overlapping, and statistically non-independent speech features.
- In the various aspects of the present invention, a feature may be defined as a function f with the following properties:
-
- where:
-
c i denotes everything the probability is conditioned on, which may include context and observations, - b is a binary function expressing some property of the conditioned event, and w is the target (or predicted) state/unit such as a word, and
- α is the weight of the function.
- That is, a feature is a computable function that is conditioned upon context and observation, that may be thought of firing or becoming active for a specific context/observation and a specific prediction, for example, wi.
- It should be appreciated that the weight of the function α may be equal to 1 or 0, or may be real-valued. For example, in an exemplary embodiment, the weight α may be related to the confidence of whether the property was detected in the speech signal, or the importance of that property.
- In accordance with various exemplary aspects of this invention, the lattice output from the
decoder 322 may consist of more than one score. For example, scores may be obtained of the top predetermined number of matches. In addition, other data may be used by thesearch device 3226, including such information as the hidden Markov model scores obtained from a hidden Markov model decoder and scores for different match levels of Dynamic Time Warping, such as word vs syllable vs allophone. - An exemplary method of combining the different scores is to use a log-linear model and then train the parameters of the log-linear model.
- For example, the log-linear model for the posterior probability of a path Hi may be given by the exponent of the sum of a linear combination of the different scores:
-
- where:
- Fwj is the jth score feature for the segment spanned by word w. for example, if the top 10 Dynamic Time Warping scores and the hidden Markov score obtained by various well known Dynamic Time Warping and hidden Markov model technologies (not explicitly shown in the figures) are returned, then there will be 11 score features for each word in the lattice.
- Z is the normalization constant Z given by the sum over all paths (H1 . . . 3) of the exponential term:
-
- that is needed to ensure that Equation (4) is a true probability, that is, sum to 1.
- For the lattice generated on training data, the parameters αj may be estimated by maximizing the likelihood of the correct path, that is, maximizing the probability of the hypothesis over all the training data.
- It should be appreciated that the above embodiment is merely an exemplary embodiment, and that the above equation (4) may be revised by adding syllable and allophone features since a hierarchical segmentation is available. The weight parameters αj can be have dependencies themselves. For example they could be a function of the length of the word or of the number of training samples for that word/syllable/phone/the like.
- It should further be appreciated that equation (4) may further be generalized to having an exponent which is a weighted sum of general features, each of which is a function of the path Hi, and the acoustic observation sequence o1 T.
- Further, it should be appreciated that other features representing “non-verbal information” (such as whether test and training sequences are from the same gender, same speaker, same noise condition, same phonetic context, etc.) may also be included in this framework, and that the various exemplary aspects of this invention are not limited to the above described embodiments.
- In other exemplary embodiments, the individual word scores Fwj may themselves be taken to be posterior word probabilities from a log-linear model. The log-linear models may be calculated quite tractably even using lots of features. Examples of features are Dynamic Time Warping, hidden Markov model, and the like.
- In accordance with the exemplary aspects of the present invention, log-linear models are used to make the best use of any given set of detected features, without the use of assumptions about features that are not present. That is, in contrast in contrast to other models such as the hidden Markov models which require using the same set of features in training and testing operations, the log-linear models make no assumptions about unobserved features, so that were some feature not observable due to noise masking, for example, the log-linear model will make the best use of the other available features.
- In accordance with the exemplary aspects of this invention, the speech recognition system may make full use of the known models by training the known models with the log linear model, to obtain the first lattice, alignment, or decoding using the known models to combine with the log linear model of this invention.
- In accordance with various exemplary embodiments of this invention, log-linear model is provided that utilizes among many possible features, the identities of the Gaussians that are the best match to traditional short time spectral features, in a traditional Gaussian mixture model comprising weighted combinations of Gaussian distributions of spectral features such as mel cepstra features, widely used in hidden Markov models, and matching of speech segments to a large corpus of training data.
- In accordance with the various exemplary aspects of this invention, advantages such as not necessitating all features used in training to appear in testing/recognition operations, may be obtained. That is, with models other than log linear models, if features used for training does not appear in testing, a “mismatched condition” is obtained and performance is poor. Accordingly, usage of models other than a log linear model often results in failure if some features used in training are obscured by noise and are not present in the test data.
-
FIG. 5 shows a flowchart of a method for data training according to the various exemplary aspects of the present invention. Beginning atstep 5000, control proceeds to step 5100, where training data and meta-data are input to the decoder. This data contains the speech data typically collected and stored beforehand in the training storage, including the truth stored. It should be appreciated that meta data may include such information as speaker gender or identity, recording channel, personal profile of speaker, and the like. The truth may generally consist of the true word sequence transcription created by human transcribers. Next, instep 5200, a model is input to the decoder. This model is a general model stored beforehand in the model storage. Then instep 5300, a prestored lattice is input. Control then proceeds to step 5400. - In
step 5400, a multitude of features are extracted and a search is performed. These features include those derived from traditional spectral features such as mel cepstra and time derivatives, acoustic phonetic or articulatory distinctive features such as voicing, place of articulation, and the like, scores from dynamic time warping match to speech segments, higher level information extracted from a particular word sequence hypothesis, for example, from a semantic or syntactic parse tree, the pragmatic or semantic coherence, and the like, speaking rate and channel condition, and the like. It should also be appreciated that some of the features extracted in this step may include log-linear or other models which will be updated in this process. - In this step, lattice with scores, objective functions and auxiliary statistics are determined using a log-linear function according to the various exemplary embodiments of this invention. It should be appreciated that a plurality of objective functions are calculated in this step due to the fact that a plurality of models are being trained in this process, that is, the log linear model giving the overall score as well as any other models used for feature extraction. The top level objective function is total posterior likelihood, which is to be maximized. It should be appreciated that there may be a plurality of types of objective functions for feature extractors. In various exemplary embodiments, these types of object functions include posterior likelihood, direct likelihood, distance, and the like.
- In this step, different unit sequence hypotheses consistent with the true word sequence transcription, along with their corresponding time alignments are explored and the probabilities of partial and whole sequences are determined. The pruned combined results determine an updated lattice with scores.
- It should be appreciated that, in accordance with the various exemplary aspects of this invention, the auxiliary statistics calculated in this step may include gradient functions, and other statistics required for optimization using an auxiliary function technique.
- Next, in
step 5500, it is determined if the objective functions are close enough to optimal. It should be appreciated that there are a plurality of tests for optimality, including thresholds on increase of objective functions or gradients. If optimality has not been reached, control continues to step 5600, where the models are updated and then control returns to step 5200. Instep 5600, the models are updated using the auxiliary statistics. It is to be appreciated that there are a plurality of methods for updating the models, including but not limited to quasi-Newton gradient search, generalized iterative scaling, and extended Baum-Welch, and expectation maximization. - It should be also appreciated that efficient implementations may only update a subset of parameters in an iteration, and thus, in
step 5400, only a restricted calculation need be performed. This restriction may include only updating a single feature extractor. - If optimality has been reached, control continues to step 5700, where the model parameters are output. Then, in
step 5800, the process ends. -
FIG. 6 shows a flowchart of a method for speech recognition according to the various exemplary aspects of the present invention. Beginning atstep 6000, control proceeds to step 6100, where test data is input to the decoder. In accordance with the various exemplary embodiments of this invention, this test data is received from a user at a remote terminal via a telephone or data network or at a voice input device. This data may also include meta data such as speaker gender or identity, recording channel, personal profile of speaker, and the like. Next, instep 6200, the model is input. This model is stored in themodel storage 327 during the training operation. Then, instep 6300, a prestored hypothesis lattice is input. Control then continues to step 6400. - In
step 6400, a multitude of features are extracted and a search is performed using a log linear model of these features. These features include those derived from traditional spectral features. It should also be appreciated that some of the features extracted in this step may be determined using log-linear or other models. - In this step, different unit sequence hypotheses along with their corresponding time alignments are explored and the probabilities of partial and whole sequences are determined. It should be appreciated that this search in this step is constrained by the previous input lattice. The pruned combined results determine an updated lattice with scores. It should be appreciated that a particular embodiment of this updated lattice may be a single best most likely hypothesis.
- Next, in
step 6500, it is determined whether another pass is needed. If another pass is needed, then control returns to step 6200. It should be appreciated that the features and models used in subsequent passes may vary. The lattice output instep 6400 may be used as the input lattice instep 6300. Else, no additional pass is needed, and control continues to step 6600, where the optimal word sequence is output. That is, the word sequence corresponding to the hypothesis in the lattice having the highest score is output. It should be appreciated that in an alternative embodiment, the lattice is output. Control then continues to step 6700, where the process ends. - The foregoing description of the invention has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed, and other modifications and variations may be possible in light of the above teachings. Thus, the embodiments disclosed were chosen and described in order to best explain the principles of the invention and its practical application to enable others skilled in the art to best utilize the invention in various embodiments and modifications as are suited to the particular use. It is intended that the appended claims be construed to include other alternative embodiments of the invention except insofar as limited by the prior art.
Claims (15)
1. A speech recognition system, comprising:
a features extractor that extracts a multitude of speech features directly from input speech;
a log-linear function that receives the multitude of speech features obtained from the input speech and determines a posterior probability of each of a plurality of hypothesized linguistic units unit given the extracted multitude of speech features, and
a search device that analyzes the posterior probabilities determined by the log-linear function to determine a recognized output of unknown utterances.
2. The speech recognition system of claim 1 , wherein the log linear function models the posterior probability using a log linear model.
3. The speech recognition system of claim 1 , wherein the speech features comprise at least one of asynchronous, overlapping, and statistically non-independent speech features.
4. The speech recognition system of claim 1 , wherein at least one of the speech features extracted is derived from incomplete data.
5. The speech recognition system of claim 1 , further comprising a loopback.
6. The speech recognition system of claim 1 , wherein the features are extracted using direct matching between test data and training data.
7. The speech recognition system of claim 1 , wherein the features are extracted using Gaussian model identities at each time frame.
8. A speech recognition method, comprising:
extracting a multitude of speech features directly from input speech;
using a log linear function for determining a posterior probability of each of a plurality of hypothesized linguistic units given the extracted multitude of speech features, and
determining a recognized output of unknown utterances using the posterior probabilities.
9. The speech recognition method of claim 8 , wherein the log linear function models the posterior probability using a log linear model.
10. The speech recognition method of claim 8 , wherein the speech features comprise at least one of asynchronous, overlapping, and statistically non-independent speech features.
11. The speech recognition method of claim 8 , wherein at least one of the speech features extracted is derived from incomplete data.
12. The speech recognition method of claim 8 , further comprising a step of loopback.
13. The speech recognition method of claim 8 , wherein the features are extracted using direct matching between test data and training data.
14. The speech recognition method of claim 8 , wherein the extracting of a multitude of speech features comprises using Gaussian model identities at each time frame to identify and extract features.
15. A program storage device storing a program of instructions executable by a machine for performing a method of speech recognition, the method comprising:
extracting a multitude of speech features directly from input speech;
using a log linear function for determining a posterior probability of each of a plurality of hypothesized linguistic units given the extracted multitude of speech features, and
determining a recognized output of unknown utterances using the posterior probabilities.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/195,123 US20080312921A1 (en) | 2003-11-28 | 2008-08-20 | Speech recognition utilizing multitude of speech features |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/724,536 US7464031B2 (en) | 2003-11-28 | 2003-11-28 | Speech recognition utilizing multitude of speech features |
US12/195,123 US20080312921A1 (en) | 2003-11-28 | 2008-08-20 | Speech recognition utilizing multitude of speech features |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/724,536 Continuation US7464031B2 (en) | 2003-11-28 | 2003-11-28 | Speech recognition utilizing multitude of speech features |
Publications (1)
Publication Number | Publication Date |
---|---|
US20080312921A1 true US20080312921A1 (en) | 2008-12-18 |
Family
ID=34620090
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/724,536 Expired - Fee Related US7464031B2 (en) | 2003-11-28 | 2003-11-28 | Speech recognition utilizing multitude of speech features |
US12/195,123 Abandoned US20080312921A1 (en) | 2003-11-28 | 2008-08-20 | Speech recognition utilizing multitude of speech features |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/724,536 Expired - Fee Related US7464031B2 (en) | 2003-11-28 | 2003-11-28 | Speech recognition utilizing multitude of speech features |
Country Status (3)
Country | Link |
---|---|
US (2) | US7464031B2 (en) |
JP (1) | JP4195428B2 (en) |
CN (1) | CN1296886C (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090099841A1 (en) * | 2007-10-04 | 2009-04-16 | Kubushiki Kaisha Toshiba | Automatic speech recognition method and apparatus |
US20100030560A1 (en) * | 2006-03-23 | 2010-02-04 | Nec Corporation | Speech recognition system, speech recognition method, and speech recognition program |
US20120078621A1 (en) * | 2010-09-24 | 2012-03-29 | International Business Machines Corporation | Sparse representation features for speech recognition |
CN104462071A (en) * | 2013-09-19 | 2015-03-25 | 株式会社东芝 | SPEECH TRANSLATION APPARATUS and SPEECH TRANSLATION METHOD |
CN108415898A (en) * | 2018-01-19 | 2018-08-17 | 苏州思必驰信息科技有限公司 | The word figure of deep learning language model beats again a point method and system |
WO2022102937A1 (en) * | 2020-11-12 | 2022-05-19 | Samsung Electronics Co., Ltd. | Methods and systems for predicting non-default actions against unstructured utterances |
US11373671B2 (en) | 2018-09-12 | 2022-06-28 | Shenzhen Shokz Co., Ltd. | Signal processing device having multiple acoustic-electric transducers |
Families Citing this family (42)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7899671B2 (en) * | 2004-02-05 | 2011-03-01 | Avaya, Inc. | Recognition results postprocessor for use in voice recognition systems |
US7392187B2 (en) * | 2004-09-20 | 2008-06-24 | Educational Testing Service | Method and system for the automatic generation of speech features for scoring high entropy speech |
US7840404B2 (en) * | 2004-09-20 | 2010-11-23 | Educational Testing Service | Method and system for using automatic generation of speech features to provide diagnostic feedback |
US7809568B2 (en) * | 2005-11-08 | 2010-10-05 | Microsoft Corporation | Indexing and searching speech with text meta-data |
US7831428B2 (en) * | 2005-11-09 | 2010-11-09 | Microsoft Corporation | Speech index pruning |
US7831425B2 (en) * | 2005-12-15 | 2010-11-09 | Microsoft Corporation | Time-anchored posterior indexing of speech |
US8214213B1 (en) * | 2006-04-27 | 2012-07-03 | At&T Intellectual Property Ii, L.P. | Speech recognition based on pronunciation modeling |
US8214208B2 (en) * | 2006-09-28 | 2012-07-03 | Reqall, Inc. | Method and system for sharing portable voice profiles |
US7788094B2 (en) * | 2007-01-29 | 2010-08-31 | Robert Bosch Gmbh | Apparatus, method and system for maximum entropy modeling for uncertain observations |
US7813929B2 (en) * | 2007-03-30 | 2010-10-12 | Nuance Communications, Inc. | Automatic editing using probabilistic word substitution models |
US20090099847A1 (en) * | 2007-10-10 | 2009-04-16 | Microsoft Corporation | Template constrained posterior probability |
US7933847B2 (en) * | 2007-10-17 | 2011-04-26 | Microsoft Corporation | Limited-memory quasi-newton optimization algorithm for L1-regularized objectives |
US8296141B2 (en) * | 2008-11-19 | 2012-10-23 | At&T Intellectual Property I, L.P. | System and method for discriminative pronunciation modeling for voice search |
US9484019B2 (en) | 2008-11-19 | 2016-11-01 | At&T Intellectual Property I, L.P. | System and method for discriminative pronunciation modeling for voice search |
US8401852B2 (en) * | 2009-11-30 | 2013-03-19 | Microsoft Corporation | Utilizing features generated from phonic units in speech recognition |
WO2012023450A1 (en) * | 2010-08-19 | 2012-02-23 | 日本電気株式会社 | Text processing system, text processing method, and text processing program |
US8630860B1 (en) * | 2011-03-03 | 2014-01-14 | Nuance Communications, Inc. | Speaker and call characteristic sensitive open voice search |
US8727991B2 (en) | 2011-08-29 | 2014-05-20 | Salutron, Inc. | Probabilistic segmental model for doppler ultrasound heart rate monitoring |
US8909512B2 (en) * | 2011-11-01 | 2014-12-09 | Google Inc. | Enhanced stability prediction for incrementally generated speech recognition hypotheses based on an age of a hypothesis |
CN102376305B (en) * | 2011-11-29 | 2013-06-19 | 安徽科大讯飞信息科技股份有限公司 | Speech recognition method and system |
US9324323B1 (en) | 2012-01-13 | 2016-04-26 | Google Inc. | Speech recognition using topic-specific language models |
US8775177B1 (en) * | 2012-03-08 | 2014-07-08 | Google Inc. | Speech recognition process |
CN102810135B (en) * | 2012-09-17 | 2015-12-16 | 顾泰来 | A kind of Medicine prescription auxiliary process system |
US9697827B1 (en) * | 2012-12-11 | 2017-07-04 | Amazon Technologies, Inc. | Error reduction in speech processing |
US9653070B2 (en) | 2012-12-31 | 2017-05-16 | Intel Corporation | Flexible architecture for acoustic signal processing engine |
CN105378830A (en) * | 2013-05-31 | 2016-03-02 | 朗桑有限公司 | Processing of audio data |
CN103337241B (en) * | 2013-06-09 | 2015-06-24 | 北京云知声信息技术有限公司 | Voice recognition method and device |
US9529901B2 (en) * | 2013-11-18 | 2016-12-27 | Oracle International Corporation | Hierarchical linguistic tags for documents |
US9842592B2 (en) * | 2014-02-12 | 2017-12-12 | Google Inc. | Language models using non-linguistic context |
KR20170034227A (en) * | 2015-09-18 | 2017-03-28 | 삼성전자주식회사 | Apparatus and method for speech recognition, apparatus and method for learning transformation parameter |
CN106683677B (en) | 2015-11-06 | 2021-11-12 | 阿里巴巴集团控股有限公司 | Voice recognition method and device |
US10832664B2 (en) | 2016-08-19 | 2020-11-10 | Google Llc | Automated speech recognition using language models that selectively use domain-specific model components |
JP6585022B2 (en) | 2016-11-11 | 2019-10-02 | 株式会社東芝 | Speech recognition apparatus, speech recognition method and program |
US10347245B2 (en) * | 2016-12-23 | 2019-07-09 | Soundhound, Inc. | Natural language grammar enablement by speech characterization |
US20180330718A1 (en) * | 2017-05-11 | 2018-11-15 | Mitsubishi Electric Research Laboratories, Inc. | System and Method for End-to-End speech recognition |
US10607601B2 (en) * | 2017-05-11 | 2020-03-31 | International Business Machines Corporation | Speech recognition by selecting and refining hot words |
US10672388B2 (en) * | 2017-12-15 | 2020-06-02 | Mitsubishi Electric Research Laboratories, Inc. | Method and apparatus for open-vocabulary end-to-end speech recognition |
JP7120064B2 (en) * | 2019-02-08 | 2022-08-17 | 日本電信電話株式会社 | Language model score calculation device, language model creation device, methods thereof, program, and recording medium |
CN110853669B (en) * | 2019-11-08 | 2023-05-16 | 腾讯科技(深圳)有限公司 | Audio identification method, device and equipment |
US11250872B2 (en) * | 2019-12-14 | 2022-02-15 | International Business Machines Corporation | Using closed captions as parallel training data for customization of closed captioning systems |
US11074926B1 (en) | 2020-01-07 | 2021-07-27 | International Business Machines Corporation | Trending and context fatigue compensation in a voice signal |
CN113657461A (en) * | 2021-07-28 | 2021-11-16 | 北京宝兰德软件股份有限公司 | Log anomaly detection method, system, device and medium based on text classification |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6304841B1 (en) * | 1993-10-28 | 2001-10-16 | International Business Machines Corporation | Automatic construction of conditional exponential models from elementary features |
US6456969B1 (en) * | 1997-12-12 | 2002-09-24 | U.S. Philips Corporation | Method of determining model-specific factors for pattern recognition, in particular for speech patterns |
US20030023438A1 (en) * | 2001-04-20 | 2003-01-30 | Hauke Schramm | Method and system for the training of parameters of a pattern recognition system, each parameter being associated with exactly one realization variant of a pattern from an inventory |
US6687690B2 (en) * | 2001-06-14 | 2004-02-03 | International Business Machines Corporation | Employing a combined function for exception exploration in multidimensional data |
US7010486B2 (en) * | 2001-02-13 | 2006-03-07 | Koninklijke Philips Electronics, N.V. | Speech recognition system, training arrangement and method of calculating iteration values for free parameters of a maximum-entropy speech model |
US7054810B2 (en) * | 2000-10-06 | 2006-05-30 | International Business Machines Corporation | Feature vector-based apparatus and method for robust pattern recognition |
US7324927B2 (en) * | 2003-07-03 | 2008-01-29 | Robert Bosch Gmbh | Fast feature selection method and system for maximum entropy modeling |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0756595A (en) | 1993-08-19 | 1995-03-03 | Hitachi Ltd | Voice recognition device |
US5790754A (en) * | 1994-10-21 | 1998-08-04 | Sensory Circuits, Inc. | Speech recognition apparatus for consumer electronic applications |
CN1141696C (en) * | 2000-03-31 | 2004-03-10 | 清华大学 | Non-particular human speech recognition and prompt method based on special speech recognition chip |
JP2002251592A (en) * | 2001-02-22 | 2002-09-06 | Toshiba Corp | Learning method for pattern recognition dictionary |
JP3919475B2 (en) | 2001-07-10 | 2007-05-23 | シャープ株式会社 | Speaker feature extraction apparatus, speaker feature extraction method, speech recognition apparatus, and program recording medium |
-
2003
- 2003-11-28 US US10/724,536 patent/US7464031B2/en not_active Expired - Fee Related
-
2004
- 2004-07-28 CN CNB2004100586870A patent/CN1296886C/en not_active Expired - Fee Related
- 2004-09-17 JP JP2004270823A patent/JP4195428B2/en not_active Expired - Fee Related
-
2008
- 2008-08-20 US US12/195,123 patent/US20080312921A1/en not_active Abandoned
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6304841B1 (en) * | 1993-10-28 | 2001-10-16 | International Business Machines Corporation | Automatic construction of conditional exponential models from elementary features |
US6456969B1 (en) * | 1997-12-12 | 2002-09-24 | U.S. Philips Corporation | Method of determining model-specific factors for pattern recognition, in particular for speech patterns |
US7054810B2 (en) * | 2000-10-06 | 2006-05-30 | International Business Machines Corporation | Feature vector-based apparatus and method for robust pattern recognition |
US7010486B2 (en) * | 2001-02-13 | 2006-03-07 | Koninklijke Philips Electronics, N.V. | Speech recognition system, training arrangement and method of calculating iteration values for free parameters of a maximum-entropy speech model |
US20030023438A1 (en) * | 2001-04-20 | 2003-01-30 | Hauke Schramm | Method and system for the training of parameters of a pattern recognition system, each parameter being associated with exactly one realization variant of a pattern from an inventory |
US6687690B2 (en) * | 2001-06-14 | 2004-02-03 | International Business Machines Corporation | Employing a combined function for exception exploration in multidimensional data |
US7324927B2 (en) * | 2003-07-03 | 2008-01-29 | Robert Bosch Gmbh | Fast feature selection method and system for maximum entropy modeling |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100030560A1 (en) * | 2006-03-23 | 2010-02-04 | Nec Corporation | Speech recognition system, speech recognition method, and speech recognition program |
US8781837B2 (en) * | 2006-03-23 | 2014-07-15 | Nec Corporation | Speech recognition system and method for plural applications |
US20090099841A1 (en) * | 2007-10-04 | 2009-04-16 | Kubushiki Kaisha Toshiba | Automatic speech recognition method and apparatus |
US8311825B2 (en) * | 2007-10-04 | 2012-11-13 | Kabushiki Kaisha Toshiba | Automatic speech recognition method and apparatus |
US20120078621A1 (en) * | 2010-09-24 | 2012-03-29 | International Business Machines Corporation | Sparse representation features for speech recognition |
US8484023B2 (en) * | 2010-09-24 | 2013-07-09 | Nuance Communications, Inc. | Sparse representation features for speech recognition |
CN104462071A (en) * | 2013-09-19 | 2015-03-25 | 株式会社东芝 | SPEECH TRANSLATION APPARATUS and SPEECH TRANSLATION METHOD |
CN108415898A (en) * | 2018-01-19 | 2018-08-17 | 苏州思必驰信息科技有限公司 | The word figure of deep learning language model beats again a point method and system |
US11373671B2 (en) | 2018-09-12 | 2022-06-28 | Shenzhen Shokz Co., Ltd. | Signal processing device having multiple acoustic-electric transducers |
US11875815B2 (en) | 2018-09-12 | 2024-01-16 | Shenzhen Shokz Co., Ltd. | Signal processing device having multiple acoustic-electric transducers |
WO2022102937A1 (en) * | 2020-11-12 | 2022-05-19 | Samsung Electronics Co., Ltd. | Methods and systems for predicting non-default actions against unstructured utterances |
US11705111B2 (en) | 2020-11-12 | 2023-07-18 | Samsung Electronics Co., Ltd. | Methods and systems for predicting non-default actions against unstructured utterances |
Also Published As
Publication number | Publication date |
---|---|
JP2005165272A (en) | 2005-06-23 |
CN1296886C (en) | 2007-01-24 |
US7464031B2 (en) | 2008-12-09 |
CN1622196A (en) | 2005-06-01 |
US20050119885A1 (en) | 2005-06-02 |
JP4195428B2 (en) | 2008-12-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7464031B2 (en) | Speech recognition utilizing multitude of speech features | |
US6542866B1 (en) | Speech recognition method and apparatus utilizing multiple feature streams | |
US9477753B2 (en) | Classifier-based system combination for spoken term detection | |
Young | HMMs and related speech recognition technologies | |
US9679556B2 (en) | Method and system for selectively biased linear discriminant analysis in automatic speech recognition systems | |
CN108989341A (en) | The autonomous register method of voice, device, computer equipment and storage medium | |
US7627473B2 (en) | Hidden conditional random field models for phonetic classification and speech recognition | |
Hemakumar et al. | Speech recognition technology: a survey on Indian languages | |
Aggarwal et al. | Integration of multiple acoustic and language models for improved Hindi speech recognition system | |
Das | Speech recognition technique: A review | |
Becerra et al. | Speech recognition in a dialog system: From conventional to deep processing: A case study applied to Spanish | |
Williams | Knowing what you don't know: roles for confidence measures in automatic speech recognition | |
Yusuf et al. | Low resource keyword search with synthesized crosslingual exemplars | |
Shahnawazuddin et al. | Improvements in IITG Assamese spoken query system: Background noise suppression and alternate acoustic modeling | |
Meyer et al. | Boosting HMM acoustic models in large vocabulary speech recognition | |
Tabibian | A survey on structured discriminative spoken keyword spotting | |
Manjunath et al. | Automatic phonetic transcription for read, extempore and conversation speech for an Indian language: Bengali | |
Breslin | Generation and combination of complementary systems for automatic speech recognition | |
Tabibian et al. | Improved dynamic match phone lattice search for Persian spoken term detection system in online and offline applications | |
Holmes | Modelling segmental variability for automatic speech recognition | |
Fissore et al. | The recognition algorithms | |
Ben Ayed | A new SVM kernel for keyword spotting using confidence measures | |
Nallasamy | Adaptation techniques to improve ASR performance on accented speakers | |
Herbig et al. | Adaptive systems for unsupervised speaker tracking and speech recognition | |
Rouvalis | GREC: Multi-domain Speech Recognition for the Greek |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |