CN103000174A

CN103000174A - Feature compensation method based on rapid noise estimation in speech recognition system

Info

Publication number: CN103000174A
Application number: CN2012104869360A
Authority: CN
Inventors: 吕勇
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2012-11-26
Filing date: 2012-11-26
Publication date: 2013-03-27
Anticipated expiration: 2032-11-26
Also published as: CN103000174B

Abstract

The invention discloses a feature compensation method based on rapid noise estimation in a speech recognition system. The method is characterized in that noise parameter estimation in the feature compensation is separated from pure speech estimation, and noise estimation and pure speech estimation are achieved through different Gaussian mixture models (GMMs). A GMM containing less Gaussian units is used for extracting noise parameters from a noisy tested speech; another GMM containing more Gaussian units is used for being combined with an estimated single Gaussian noise model to obtain a noisy GMM matched with the current test environment; and finally the noisy GMM is used for calculating the posterior probability of the noisy tested speech and the pure speech feature vector is estimated from the noisy tested speech through the minimum mean square error method. According to the method, estimation accuracy of the pure speech can be guaranteed while the calculated amount is reduced.

Description

Feature compensation method based on rapid noise estimation in speech recognition system

Technical Field

The invention relates to a feature compensation method based on rapid noise estimation in a voice recognition system, in particular to a feature compensation method for rapidly estimating noise parameters by using a Gaussian mixture model with less Gaussian units and estimating pure voice feature vectors from noise-containing test voice by using a Gaussian mixture model with more Gaussian units, belonging to the technical field of voice recognition.

Background

At present, speech recognition systems have achieved good performance in laboratory ideal environments. However, in practical environments, background noise and channel distortion are often unavoidable, which may cause the extracted feature vectors in practical application environments to be severely mismatched with the pre-trained acoustic model, and the performance of the recognizer may be severely deteriorated or even may completely fail. Therefore, the research on the environment compensation technology of the voice recognition, the reduction of the influence of the environment mismatch on the voice recognition system and the improvement of the performance of the voice recognition system in the actual environment have very important significance.

In general, the environment compensation techniques can be divided into front-end feature compensation and back-end model compensation. The feature compensation compensates the speech features in the test environment to match the acoustic models in the training environment. And the model compensation adjusts the acoustic model in the training environment to be matched with the testing environment, and the testing voice is directly recognized. Compared with the back-end model compensation, the front-end characteristic compensation technology has the advantages of small calculated amount, flexible realization and independence with a back-end recognizer, so the application range of the front-end characteristic compensation technology is wider.

In practical applications, it is difficult to ensure that there are enough silence frames for each test speech to estimate the noise parameters. In order to track the change of the environment in time, noise parameters are often required to be extracted from noisy test speech. However, the environmental transformation relationship between the training environment and the test environment is non-linear, and the noise parameters have no closed form solution. Vector Taylor Series (VTS) is an effective noise robust technique, and can well approximate the nonlinear environment transformation relation caused by noise. However, noise parameter estimation based on VTS involves more matrix operations, and the amount of computation is proportional to the number of gaussian units of the speech model. Since in feature compensation the speech model used for noise estimation is also used for estimating the clean speech feature vectors. In order to fully describe the distribution of speech and to guarantee the accuracy of the clean speech estimation, the speech model used for feature compensation must contain enough gaussian units. Therefore, the feature compensation method based on VTS has a large calculation amount, and is difficult to implement in real time on an independent terminal such as an embedded system.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems and the defects in the prior art, the invention provides a feature compensation method based on rapid noise estimation in a speech recognition system.

The technical scheme is as follows: a feature compensation method based on fast noise estimation in a speech recognition system is mainly characterized in that noise parameter estimation and pure speech estimation in feature compensation are separated, and the noise estimation and the pure speech estimation are realized by different Gaussian Mixture Models (GMMs). A Gaussian mixture model GMM with less Gaussian units is used for extracting noise parameters from the noisy test voice; the other Gaussian mixture model GMM with more Gaussian units is used for carrying out model combination with the estimated single Gaussian noise model to obtain a noise-containing GMM matched with the current test environment; and finally, calculating the posterior probability of the noise-containing test voice by using the noise-containing GMM, and estimating pure voice feature vectors from the noise-containing test voice by using a Minimum Mean Square Error (MMSE) method.

A feature compensation method based on rapid noise estimation in a voice recognition system specifically comprises a training stage and a testing stage;

the training phase comprises the following specific steps:

(1) extracting a pure voice feature vector from the pure training voice, and adopting Mel-Frequency Cepstral coeffients (MFCC) as a characteristic parameter of the voice;

(2) GMM training with MFCCs for all training voices generates two GMMs: the first GMM contains fewer gaussian units for noise estimation; the second GMM contains more Gaussian units and is used for model combination and pure voice estimation;

(3) performing acoustic Model training by using the training speech of each basic speech unit to generate a Hidden Markov Model (HMM) of each basic speech unit;

the specific steps of the test phase include:

(4) extracting a noisy speech MFCC from a noisy test speech;

(5) extracting noise parameters including a Gaussian mean vector and a covariance matrix of noise from the noisy speech MFCC by using a first GMM;

(6) performing parameter transformation on the mean value and the variance of the second GMM by using the estimated noise parameters, calculating the posterior probability of the noise-containing test voice, and estimating the MFCC of the pure voice by using an MMSE (minimum mean square error) method;

(7) and performing acoustic decoding on the MFCC of the pure voice by using the HMM of each voice unit to obtain a recognition result.

Has the advantages that: compared with the prior art, the characteristic compensation method based on the rapid noise estimation in the voice recognition system separates the noise parameter estimation and the pure voice estimation in the characteristic compensation, and the noise parameter estimation and the pure voice estimation are respectively realized by different voice models, so that the calculation amount can be reduced, and the precision of the pure voice estimation can be ensured.

Drawings

FIG. 1 is a feature compensation framework based on fast noise estimation according to an embodiment of the present invention;

FIG. 2 is a block diagram of a speech recognition system based on fast noise estimation according to an embodiment of the present invention.

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.

As shown in fig. 1, a gaussian mixture model GMM1 with fewer gaussian units is used to extract noise parameters from noisy test speech; the other Gaussian mixture model GMM2 with more Gaussian units is used for carrying out model combination with the estimated single Gaussian noise model to obtain a noise-containing GMM matched with the current test environment; and finally, calculating the posterior probability of the noise-containing test voice by using the noise-containing GMM, and estimating a pure voice feature vector from the noise-containing test voice by using a minimum mean square error method.

As shown in fig. 2, the feature compensation method based on fast noise estimation mainly includes a training phase and a testing phase. The training stage mainly completes GMM training and HMM training; the testing stage mainly completes the noise parameter estimation and the pure voice estimation.

1. And GMM training:

the GMM is used to model the distribution of speech, and two GMMs are generated from all training speech: GMM1 and GMM 2. GMM1 contains fewer gaussian cells for noise estimation; the GMM2 contains more gaussian cells for pure speech estimation. The covariance matrices for GMM1 and GMM2 both take diagonal matrices.

2. Training by using an HMM:

the invention models each basic phonetic unit of speech recognition by using continuous density HMM, and generates the HMM of each basic phonetic unit by using the training speech of each basic phonetic unit. The number of HMMs depends on the number of speech units. The covariance matrix of all HMMs also takes the diagonal matrix.

3. Noise parameter estimation:

in the cepstral domain, the relationship between noisy speech feature vector y and clean speech feature vector x can be expressed as:

y＝x+Clog(1+exp(C^-1(n-x)))

(1)

where n represents the additive noise cepstrum feature vector, C and C^-1Respectively, a Discrete Cosine Transform (DCT) matrix and an inverse matrix thereof. The mean value mu of x of formula (1)_xAnd the initial mean value mu of n_n0Nearby, unfolding with a first order VTS yields:

wherein,i denotes a unit matrix of the cell,

and U is:

U = Cdiag (\frac{\exp (C^{- 1} (μ_{n 0} - μ_{x}))}{1 + \exp (C^{- 1} (μ_{n 0} - μ_{x}))}) C^{- 1} - - - (4)

in equation (4), diag () represents a diagonal matrix generated by using vector elements in parentheses as diagonal elements.

Taking the mean and variance on both sides of equation (2) can result in:

Σ_y＝(I-U)Σ_x(I-U)^T+U∑_n U^T

(6)

wherein, mu_y、μ_xAnd mu_nMean vectors, Σ, representing noisy speech y, clean speech x and additive noise n, respectively_y、Σ_xSum-sigma_nRespectively representing their covariance matrices.

For the mth gaussian unit of GMM1, equations (5) and (6) are expressed as:

σ_y，m＝(V_m·V_m)σ_x，m+(U_m·U_m)σ_n

(8)

wherein, V_m＝I-U_m，σ_y，m、σ_x，mAnd σ_nRespectively represent ∑_y，m、Σ_x，mSum-sigma_nThe diagonal element vector of (2).

Substituting the formula (7) and the formula (8) into the auxiliary function respectively to obtain the noise parameter mu_nAnd σ_nMaximum likelihood estimation of (2):

σ_{n} = {[Σ_{m = 1}^{M_{1}} Σ_{t = 1}^{T} γ_{m} (t) G_{m} (U_{m} \cdot U_{m})]}^{- 1} [Σ_{m = 1}^{M_{1}} Σ_{t = 1}^{T} γ_{m} (t) G_{m} ((y_{t} - μ_{y, m}) \cdot (y_{t} - μ_{y, m}) - (V_{m} \cdot V_{m}) σ_{x, m})] - - - (10)

wherein M is₁The number of Gaussian cells that are GMM 1; gamma ray_m(t)＝P(k_t＝m|y_tλ) represents the noisy speech feature vector y for the t-th frame given the a priori parameter λ of GMM1_tPosterior probability of the mth gaussian unit belonging to GMM 1; g_mGiven by:

G_{m} = (U_{m}^{T} \cdot U_{m}^{T}) diag [{((V_{m} \cdot V_{m}) σ_{x, m} + (U_{m} \cdot U_{m}) σ_{n 0})}^{- 2}] - - - (11)

in formula (11), σ_n0Is σ_nThe initial value of (c).

4. And (3) pure voice estimation:

estimating noiseAcoustic parameter mu_nAnd σ_nThereafter, the mean and variance of GMM2 are first parametrically transformed using equations (7) and (8), at which time μ_y，mAnd σ_y，mMean and variance of noisy speech representing the mth gaussian unit of GMM 2; then using the noisy speech parameter mu_y，mAnd σ_y，mCalculating the posterior probability of the current test voice belonging to the Mth Gaussian unit of GMM2

Finally, MMSE estimation of pure speech characteristic vector is obtained by the following formula

{\hat{x}}_{t} = E (x_{t} | y_{t}) \approx y_{t} - Σ_{m = 1}^{M_{2}} {\overset{&OverBar;}{γ}}_{m} (t) (C \log (1 + \exp (C^{- 1} (μ_{n} - μ_{x, m})))) - - - (12)

Wherein M is₂Representing the number of gaussian cells of GMM 2.

First order dynamic coefficient of clean speech feature vector

And second order dynamic coefficient

Can be obtained by comparing the estimated static coefficient

And carrying out time domain difference to obtain the target.

Claims

1. A feature compensation method based on fast noise estimation in a speech recognition system is characterized by comprising the following steps:

(1) adopting a Mel frequency cepstrum coefficient as a characteristic parameter of a voice recognition system, wherein the characteristic compensation aims at extracting a pure voice MFCC from a noisy test voice;

(2) in the training phase, the distribution of speech is modeled using a gaussian mixture model, and two GMMs are generated using the entire training speech: a first GMM and a second GMM;

(3) modeling background noise by using a single Gaussian model, and extracting a mean vector and a covariance matrix of the single Gaussian noise model from noise-containing test voice in order to track the change of the environment in real time;

(4) extracting noise parameters including a Gaussian mean vector and a covariance matrix of noise from a noisy test speech MFCC by using a first GMM;

(5) performing parameter transformation on the mean value and the variance of the second GMM2 by using the estimated noise parameters, namely performing model combination on a single Gaussian noise model and the second GMM to obtain the mean value and the variance of the noisy speech of the second GMM;

(6) calculating the posterior probability of the noise-containing test voice by using the mean value and the variance of the noise-containing voice of the second GMM, and estimating the MFCC of the pure voice by using a minimum mean square error method;

(7) the first order dynamic coefficient and the second order dynamic coefficient of the clean voice feature vector are not directly estimated from the noise-containing test voice, but are obtained by carrying out time domain difference on the estimated static coefficient.

2. The method of fast noise estimation based feature compensation in a speech recognition system according to claim 1, wherein: the first gaussian mixture model for noise parameter estimation contains fewer gaussian units and thus is less computationally intensive and allows for fast estimation of the mean and variance of the noise from noisy test speech.

3. The method of fast noise estimation based feature compensation in a speech recognition system according to claim 1, wherein: the second Gaussian mixture model for pure speech estimation contains more Gaussian units, so that the distribution of speech can be fully described, and an accurate pure speech estimation value can be obtained.

4. The method of fast noise estimation based feature compensation in a speech recognition system according to claim 1, wherein: the covariance matrices of the gaussian mixture model for both the noise parameter estimation and the clean speech estimation take the diagonal matrix.

5. The method of fast noise estimation based feature compensation in a speech recognition system according to claim 1, wherein: the first GMM and the second GMM only model the static coefficient of the feature vector and do not consider the dynamic coefficient; the noise parameter estimation based on the first GMM and the clean speech estimation based on the second GMM also only calculate the static coefficients of the noise and the speech; the dynamic coefficients of the clean speech feature vector are obtained by time-domain differencing the estimated static coefficients.