CN104992708A

CN104992708A - Short-time specific audio detection model generating method and short-time specific audio detection method

Info

Publication number: CN104992708A
Application number: CN201510236568.8A
Authority: CN
Inventors: 云晓春; 颜永红; 袁庆升; 黄宇飞; 任彦; 周若华; 黄文廷; 邹学强; 包秀国
Original assignee: Institute of Acoustics CAS; National Computer Network and Information Security Management Center
Current assignee: Institute of Acoustics CAS; National Computer Network and Information Security Management Center
Priority date: 2015-05-11
Filing date: 2015-05-11
Publication date: 2015-10-21
Anticipated expiration: 2035-05-11
Also published as: CN104992708B

Abstract

The invention relates to a short-time specific audio detection model generating method comprising: extracting a characteristic of training voice data, wherein the training voice data comprises unspecific audio data and specific audio data; training a universal background model by using the characteristic of training voice data; self-adaptively acquiring a model of a kind of specific audio data according to the universal background model and the characteristic of the kind of specific audio data in the training voice data; and repeating the operation until the models of all kinds of specific audio data in the training voice data are obtained. The invention also provides a short-time specific audio detection method which detects the specific audio data by model scoring. The method not only well solves a problem of insufficient specific audio model training data, but also suppresses the background noise of input data to a certain extent.

Description

Short-time specific audio detection model generation and detection method

Technical Field

The invention relates to a method for detecting short-time specific audio, in particular to the detection of the short-time specific audio by using a Gaussian mixture model.

Background

In many fields, the short-time specific audio plays an important role, especially in the security field, and in some specific cases, a certain type of short-time specific audio needs to be detected so as to facilitate the timely processing of some urgent events. For example, in public places, we need to supervise public safety and detect the occurrence of accidents like sudden screaming sounds, unexpected explosions or gunshots, and we must detect these short specific tones in time to handle these accidents in time. In addition, in some relatively important places, the detection of the short-time specific audio can be used for detecting abnormal sounds, and the early warning function can be well played.

The problems encountered by the current short-time specific audio detection method are still many, firstly, because the short-time specific audio occurs quickly and the occurrence time of the event is short, how to utilize the information in the short-time audio is important; second, the frequency of short-time specific audio occurrences is not so high, so one has to face the problem of insufficient training data; third, since the used scenes often have complex background noise, good suppression of the background noise is also an important issue for short-time specific audio detection.

Disclosure of Invention

The invention aims to overcome the defects of insufficient training data and incapability of inhibiting background noise of the existing short-time specific audio detection method, and provides a short-time specific audio model generation and detection method based on a Gaussian mixture model.

The invention also provides a method for generating the short-time specific audio detection model, which comprises the following steps:

101, extracting features of training voice data; wherein the training voice data comprises non-specific audio data and specific audio data;

step 102, training a general background model by using the characteristics of the training voice data obtained in the step 101; the general background model is a Gaussian mixture model, and the expression of the general background model is as follows:

wi represents the weight of each Gaussian, the value range is 0-1, and the normalization condition is met:x represents a frame feature of the training speech segment; λ represents the set of all parameters in the gaussian mixture model; p is a radical of_i(x) And representing the probability density function of each single Gaussian model, wherein the expression is as follows:

d represents the dimension of the frame feature of the training speech segment; Σ i represents the covariance matrix of the gaussian function; mu.s_iThe mean vector of the Gaussian function is represented;

103, obtaining a model of a certain type of specific audio data in the training voice data in a self-adaptive manner according to the characteristics of the specific audio data in the general background model obtained in the step 102; this operation is repeated until models are obtained that train all class-specific audio data in the speech data.

In the above technical solution, in step 101, the feature extracted from the training speech data is a mel-frequency cepstrum coefficient.

In the above technical solution, in step 102, the training of the general background model includes performing parameter estimation on the general background model by using an expectation-maximization method, where the parameters to be estimated include three types: a gaussian weight w, a gaussian variance and a gaussian mean μ, where w is each gaussian weight w_iIs a set of each Gaussian variance_iIs the mean value of each Gaussian μ_iI represents the number of each single gaussian model; the method specifically comprises the following steps:

step 102-1, weighting the k-th Gaussian weight w_kUpdating:

the k-th Gaussian weight w_kThe update process is shown by the following equation:

wherein x is_tThe t-th frame feature vector in the input training speech x is a known vector calculated in the feature extraction process; λ is a general term for all parameters in the gaussian mixture model, which will give initial values in the initialization of the initial stage of training, which are known parameters; t represents the total frame number of all input training voices and is a known value which can be calculated; k represents the number of the kth single Gaussian model in the Gaussian mixture model; p (k | x)_tLambda) represents the input training speech frame x_tPosterior probability over k Gauss of general background model, from input frame x_tAnd calculating a mixed Gaussian model parameter lambda;

step 102-2, for the k-th Gaussian mean value mu_kUpdating:

k-th Gaussian mean value mu_kThe update process is shown by the following equation:

wherein, T, x_tAnd λ are both known variables, and p (k | x)_tλ) is formed from the input frame x_tAnd calculating a mixed Gaussian model parameter lambda;

step 102-3, for the k-th Gaussian varianceUpdating:

k-th Gaussian mean valueThe update process is shown by the following equation:

wherein, T, x_tλ and μ_kAre all known variables, and p (k | x)_tλ) is formed from the input frame x_tAnd a mixed Gaussian model parameter lambda.

In the above technical solution, in step 103, adaptively obtaining a model of a specific type of audio data from the general background model obtained in step 102 includes:

step 103-1, firstly, calculating the posterior probability n of each speech frame on the general background model according to the trained feature vector of the specific audio_iFirst order statistic E_i(x) And a second order statistic E_i(x²) (ii) a The specific calculation process is shown by the following formula:

wherein, Pr (i | x)_t) Representing the posterior probability of the ith Gaussian of the input audio x (th) frame in the general background model; x is the number of_tFeatures representing input audio xth frame data; t represents the total frame number of the input audio; i represents the number of ith single gaussians in the general background model;

step 103-2, utilizing the posterior probability, the first order statistic and the second order statistic obtained by the calculation in the step 103-1 to carry out self-adaptive adjustment on the parameters of the general background modelInteger, get the weight of the specific audio modelMean valueAnd covarianceThe formula of the adaptive adjustment is as follows:

wherein,andrespectively is variance, mean value and weight adjusting coefficient; t represents the total frame number of the specific audio training data, and gamma represents the normalization parameter to ensurew_iThe weight of the ith Gaussian model in the general background model is represented; mu.s_iThe mean value of the ith Gaussian model in the general background model is shown;indicating general useCovariance of ith Gaussian in background model, μ_iThe mean of the ith gaussian in the generic background model is shown,the mean value of the ith gaussian of the particular audio model obtained adaptively is shown.

The invention also provides a short-time specific audio detection method, which comprises the following steps:

step 201, performing feature extraction on the input test voice;

step 202, inputting the test voice features extracted in step 201 into the general background model obtained by the short-time specific audio detection model generating method, and calculating the score of the test voice on the general background model;

step 203, inputting the test voice features extracted in step 201 into the mixed Gaussian model of each type of specific audio obtained by the short-time specific audio detection model generation method, and calculating the score of the test voice on the mixed Gaussian model of each type of specific audio;

and step 204, respectively calculating the score of the test voice obtained in the step 202 on the general background model and the score of the test voice obtained in the step 203 on the mixed Gaussian models of various specific audios to obtain difference values, comparing the difference values with a threshold value, so as to judge which specific audio the test voice belongs to, if a plurality of model scores are within the threshold value range, judging by adopting a maximum value method, and selecting the specific audio represented by the model with the maximum score as a final judgment result of the test voice.

In the above technical solution, in step 202, calculating the score of the test speech on the general background model includes: and selecting N gaussians with the maximum posterior probability in the general background model, calculating the sum of the N probabilities, and marking the N gaussians with the sequence numbers.

In the above technical solution, in step 203, calculating the score of the test speech on the gaussian mixture model of each type of specific audio includes: through the N gaussian sequences of the general background model recorded in step 202, the posterior probability sum of the N gaussians in the gaussian mixture model of the specific audio is correspondingly calculated, and the value is used as the score of the test voice on the gaussian mixture models of various types of specific audio.

In the above technical solution, in step 201, the feature extracted from the test speech is a mel-frequency cepstrum coefficient.

The invention has the advantages that:

the method of the invention not only can well overcome the problem of insufficient training data of the short-time specific audio model, but also can well inhibit the background noise to a certain extent.

Drawings

FIG. 1 is a basic block diagram of training for a generic background model in a short-time specific audio detection model generation method;

FIG. 2 is a basic block diagram of training for a specific audio model in a short-time specific audio detection model generation method;

fig. 3 is a flow chart of a short-time specific audio detection method.

Detailed Description

The embodiment of the present invention will now be described in further detail with reference to fig. 1 and 2.

The short-time specific audio detection method comprises two stages, wherein the first stage is to train a model by using training voice data, and the second stage is to detect test voice by using the trained model.

First, model training phase

Step 101, extracting features of training voice data, wherein the extracted features are Mel cepstrum coefficients (MFCC features), and the features comprise energy values and first-order and second-order differences;

in one embodiment, the frame length of the extracted mel-frequency cepstrum coefficient is 20ms, the frame shift is 10ms, and the extracted mel-frequency cepstrum coefficient comprises an energy value and first-order and second-order differences; the characteristic dimension is 60 dimensions in total;

the training speech data should include a large amount of non-audio-specific data as well as a certain amount of audio-specific data.

Step 102, training a universal background model (UBM model) by using the characteristics of the training voice data obtained in step 101, namely the Mel cepstrum coefficient;

referring to the training schematic diagram of the general background model given in fig. 1, the general background model is as shown in formula (1):

w in formula (1)_iThe weight of each Gaussian is represented, the value range is 0-1, and the normalization condition is met:x represents the training speech segment frame characteristics; λ represents the set of all parameters in the gaussian mixture model; m represents Gaussian mixture in Gaussian mixture modelAnd (4) the total number.

P in formula (1)_i(x) And (3) representing the probability density function of each single Gaussian model, wherein the specific expression of the probability density function is as the formula (2):

wherein p is_i(x) Characterized by several parameters: d represents the dimension of the frame feature of the training voice segment, which is determined by the feature dimension in the feature extraction process; expressed by Σ i is the co-ordination of the gaussian functionA variance matrix; mu.s_iThe mean vector of the gaussian function is represented.

The above is a specific representation of a general background model, and the gaussian mixture model uses a linear weighted sum of a plurality of single gaussians to fit a probability distribution function, i.e. a distribution probability density function, of a general speaker for the characteristics of the uttered speech. Therefore, the sound distribution of the speaker can be well represented through the general Gaussian mixture model, and the sound characteristic of the speaker can be well represented.

On the basis of the general background model, the process of training the general background model by using the characteristics of the training voice data refers to parameter estimation by using an expectation maximization method.

After parameter estimation, a general background model can be obtained, the model is a Gaussian mixture model in nature, and the parameters of the model include three: a gaussian weight w, a gaussian variance and a gaussian mean μ, where w is each gaussian weight w_iIs a set of each Gaussian variance_iIs the mean value of each Gaussian μ_iI denotes the number of no single gaussian model. The three parameters obtained are unique through training data.

The specific parameter estimation process is as follows:

step 102-1, weighting the k-th Gaussian weight w_kUpdating:

the k-th Gaussian weight w_kThe updating process is as the formula (3):

wherein x is_tThe t-th frame feature vector in the input training speech x is a known vector calculated in the feature extraction process; λ is the same as the expression in formula (1), and is a general term for all parameters in the gaussian mixture model, and these will give initial values in the initialization of the initial stage of training, and are known parameters; t represents the total frame number of all input training voices and is a known value which can be calculated; k represents the number of the kth single Gaussian model in the Gaussian mixture model; p (k | x)_tLambda) represents the input training speech frame x_tPosterior probability over k Gauss of general background model, which is determined by input frame x_tAnd a mixed Gaussian model parameter lambda.

Step 102-2, for the k-th Gaussian mean value mu_kUpdating:

k-th Gaussian mean value mu_kThe updating process is as the formula (4):

wherein each parameter has the same meaning as in formula (3), wherein T, x_tAnd λ are both known variables, and p (k | x)_tλ) is formed from the input frame x_tAnd a mixed Gaussian model parameter lambda.

Step 102-3, for the k-th Gaussian varianceUpdating:

k-th Gaussian mean valueThe updating process is as the formula (5):

wherein each parameter has the same meaning as in formula (3) and formula (4), wherein T, x_tλ and μ_kAre all known variables, and p (k | x)_tλ) is formed from the input frame x_tAnd a mixed Gaussian model parameter lambda.

Step 103, in order to obtain a model of each specific audio, firstly, a part of the specific audio speech of the class needs to be obtained as a model training speech, if the specific audio data of the class is difficult to obtain, the specific audio data of the class during training the general background model can be used, if new specific audio data of the class can be obtained, the new audio data of the class is used as training data, and no matter how much training data exists, the specific audio of the class obtains a corresponding specific audio model.

In this step, as shown in fig. 2, a small amount of specific audio training data of a certain class and a bayesian adaptive algorithm are used to obtain a specific audio model of the class from a general background model in an adaptive manner, and the specific adaptive process is as follows:

step 103-1, firstly, calculating posterior probability, first-order statistic and second-order statistic of each voice frame on a general background model according to the trained feature vector of the specific audio; the specific calculation process is as in formulas (6), (7) and (8):

wherein, Pr (i | x)_t) Representing the posterior probability of the ith Gaussian of the input audio x (th) frame in the general background model; x is the number of_tFeatures representing input audio xth frame data; t represents the total frame number of the input audio; i denotes the number of the ith single gaussians in the generic background model.

Since each specific audio data used for adaptation is different when each specific audio model is trained, the calculated posterior probability, first order and second order statistics, used to train each specific audio model are also different.

Step 103-2, utilizing the posterior probability, the first order statistic and the second order statistic obtained by the calculation in the step 103-1 to carry out self-adaptive adjustment on the parameters of the general background model to obtain the weight of the specific audio modelMean valueAnd covarianceBecause the particular audio model is also Gaussian mixture model in nature, features are obtainedDetermining weights of audio modelsMean valueAnd covarianceThe Gaussian mixture model for that particular audio can then be characterized.

The concrete adaptive formula is shown as (9), (10) and (11):

wherein,andrespectively is variance, mean value and weight adjusting coefficient; n is_i、E_i(x) And E_i(x²) The posterior probability, the first order statistic and the second order statistic of the specific audio training data are calculated by the formulas (6), (7) and (8); in formula (9), T represents the total frame number of the specific audio training data, and gamma represents the normalization parameter to ensurew_iThe weight of the ith Gaussian model in the general background model is represented; in the formula (10), μ_iThe mean value of the ith Gaussian model in the general background model is shown; in the formula (11), the reaction mixture,represents the covariance, μ, of the ith Gaussian in the general background model_iThe mean of the ith gaussian in the generic background model is shown,the mean value of the ith gaussian of the particular audio model obtained adaptively is shown.

After the above calculation process, the class-specific audio model is obtained.

It can be known from step 103-1 that since the adaptive data of each specific audio model is different, the posterior probability and the first-order and second-order statistics obtained by calculation are different, and therefore the specific audio models finally obtained after calculation at step 103-2 are different.

Second, testing stage

Referring to fig. 3, the testing phase includes the following steps:

step 201, performing feature extraction on the input test voice;

the features extracted in this step are of the same type as those extracted in step 101, and are all mel-frequency cepstrum coefficients, for example;

step 202, inputting the test voice features extracted in step 201 into the general background model trained in step 102, and calculating the score of the test voice on the general background model.

As can be seen from the foregoing description, the general background model is essentially a Gaussian mixture model, and the score of the test speech in the general background model is the sum of the Gaussian posterior probabilities. As a preferred implementation manner, in order to speed up the score calculation, in the actual calculation, the posterior probabilities of all gaussians are not calculated, but N gaussians with the largest posterior probabilities are selected, the sum of the N probabilities is calculated, and the N gaussian sequence numbers are marked.

Step 203, inputting the test voice features extracted in step 201 into the gaussian mixture models of the respective specific audio obtained in step 103, and calculating scores of the test voice on the gaussian mixture models of each specific audio, wherein if there are M specific audio models, the total number of the finally obtained scores is M.

The specific method of calculating the score of the test speech on the mixed gaussian models of the respective specific audio is still to calculate the sum of the posterior probabilities of the test speech on each gaussian on the specific audio model. In order to increase the calculation speed, in a preferred implementation, the sum of the posterior probabilities of the N gaussians in the gaussian mixture model of the specific audio is correspondingly calculated through the N gaussian sequences of the general background model recorded in step 202, and the value is used as the score of the test speech on the gaussian mixture model of the specific audio.

And step 204, respectively obtaining difference values of the score of the test voice obtained in the step 202 in the general background model and the score of the test voice obtained in the step 203 in the Gaussian mixture model of the respective specific audio, comparing the difference values with a threshold value, so as to judge which specific audio the test voice belongs to, if a plurality of model scores are in the threshold value range, judging by adopting a maximum value method, namely comparing the model scores of the scores in the threshold value range, and selecting the specific audio represented by the model with the maximum score as a final judgment result of the test voice.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method for generating a short-time specific audio detection model, comprising:

w_ithe weight of each Gaussian is represented, the value range is 0-1, and the normalization condition is met:x represents a frame feature of the training speech segment; λ represents the set of all parameters in the gaussian mixture model; p is a radical of_i(x) And representing the probability density function of each single Gaussian model, wherein the expression is as follows:

2. The method of generating a short-time specific audio detection model according to claim 1, wherein the features extracted from the training speech data in step 101 are mel-frequency cepstral coefficients.

3. The method of claim 1, wherein the training of the general background model in step 102 comprises performing parameter estimation on the general background model by using expectation-maximization, and the parameters to be estimated include three types: a gaussian weight w, a gaussian variance and a gaussian mean μ, where w is each gaussian weight w_iIs a set of each Gaussian variance_iIs the mean value of each Gaussian μ_iI represents the number of each single gaussian model; the method specifically comprises the following steps:

step 102-1, weighting the k-th Gaussian weight w_kUpdating:

step 102-2, for the k-th Gaussian mean value mu_kUpdating:

step 102-3, for the k-th Gaussian varianceUpdating:

k-th Gaussian mean valueThe update process is shown by the following equation:

4. The method for generating a short-time specific audio detection model according to claim 1, wherein adaptively obtaining a model of a specific type of audio data in step 103 according to the generic background model obtained in step 102 comprises:

step 103-2, useThe posterior probability, the first order statistic and the second order statistic obtained by the calculation in the step 103-1 are subjected to self-adaptive adjustment on the parameters of the general background model to obtain the weight of the specific audio modelMean valueAnd covarianceThe formula of the adaptive adjustment is as follows:

wherein,andrespectively is variance, mean value and weight adjusting coefficient; t represents the total frame number of the specific audio training data, and gamma represents the normalization parameter to ensurew_iThe weight of the ith Gaussian model in the general background model is represented; mu.s_iThe mean value of the ith Gaussian model in the general background model is shown;represents the covariance, μ, of the ith Gaussian in the general background model_iThe mean of the ith gaussian in the generic background model is shown,the mean value of the ith gaussian of the particular audio model obtained adaptively is shown.

5. A method of short-time specific audio detection, comprising:

step 201, performing feature extraction on the input test voice;

step 202, inputting the test voice features extracted in step 201 into the general background model obtained by the short-time specific audio detection model generation method according to any one of claims 1 to 4, and calculating the score of the test voice on the general background model;

step 203, inputting the test voice features extracted in step 201 into the mixed gaussian model of each type of specific audio obtained by the short-time specific audio detection model generation method according to any one of claims 1 to 4, and calculating the score of the test voice on the mixed gaussian model of each type of specific audio;

6. The short-time specific audio detection method according to claim 5, wherein calculating a score of the test speech on top of the generic background model in step 202 comprises: and selecting N gaussians with the maximum posterior probability in the general background model, calculating the sum of the N probabilities, and marking the N gaussians with the sequence numbers.

7. The short-term specific audio detection method according to claim 6, wherein the calculating the score of the test speech on the Gaussian mixture model of each specific audio type in step 203 comprises: through the N gaussian sequences of the general background model recorded in step 202, the posterior probability sum of the N gaussians in the gaussian mixture model of the specific audio is correspondingly calculated, and the value is used as the score of the test voice on the gaussian mixture models of various types of specific audio.

8. The short-time specific audio detection method according to claim 5, wherein the extracted features for the test speech are Mel cepstral coefficients in step 201.