Nothing Special   »   [go: up one dir, main page]

CN104992708A - Short-time specific audio detection model generating method and short-time specific audio detection method - Google Patents

Short-time specific audio detection model generating method and short-time specific audio detection method Download PDF

Info

Publication number
CN104992708A
CN104992708A CN201510236568.8A CN201510236568A CN104992708A CN 104992708 A CN104992708 A CN 104992708A CN 201510236568 A CN201510236568 A CN 201510236568A CN 104992708 A CN104992708 A CN 104992708A
Authority
CN
China
Prior art keywords
mrow
msub
model
gaussian
msubsup
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510236568.8A
Other languages
Chinese (zh)
Other versions
CN104992708B (en
Inventor
云晓春
颜永红
袁庆升
黄宇飞
任彦
周若华
黄文廷
邹学强
包秀国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
National Computer Network and Information Security Management Center
Original Assignee
Institute of Acoustics CAS
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS, National Computer Network and Information Security Management Center filed Critical Institute of Acoustics CAS
Priority to CN201510236568.8A priority Critical patent/CN104992708B/en
Publication of CN104992708A publication Critical patent/CN104992708A/en
Application granted granted Critical
Publication of CN104992708B publication Critical patent/CN104992708B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Complex Calculations (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention relates to a short-time specific audio detection model generating method comprising: extracting a characteristic of training voice data, wherein the training voice data comprises unspecific audio data and specific audio data; training a universal background model by using the characteristic of training voice data; self-adaptively acquiring a model of a kind of specific audio data according to the universal background model and the characteristic of the kind of specific audio data in the training voice data; and repeating the operation until the models of all kinds of specific audio data in the training voice data are obtained. The invention also provides a short-time specific audio detection method which detects the specific audio data by model scoring. The method not only well solves a problem of insufficient specific audio model training data, but also suppresses the background noise of input data to a certain extent.

Description

Short-time specific audio detection model generation and detection method
Technical Field
The invention relates to a method for detecting short-time specific audio, in particular to the detection of the short-time specific audio by using a Gaussian mixture model.
Background
In many fields, the short-time specific audio plays an important role, especially in the security field, and in some specific cases, a certain type of short-time specific audio needs to be detected so as to facilitate the timely processing of some urgent events. For example, in public places, we need to supervise public safety and detect the occurrence of accidents like sudden screaming sounds, unexpected explosions or gunshots, and we must detect these short specific tones in time to handle these accidents in time. In addition, in some relatively important places, the detection of the short-time specific audio can be used for detecting abnormal sounds, and the early warning function can be well played.
The problems encountered by the current short-time specific audio detection method are still many, firstly, because the short-time specific audio occurs quickly and the occurrence time of the event is short, how to utilize the information in the short-time audio is important; second, the frequency of short-time specific audio occurrences is not so high, so one has to face the problem of insufficient training data; third, since the used scenes often have complex background noise, good suppression of the background noise is also an important issue for short-time specific audio detection.
Disclosure of Invention
The invention aims to overcome the defects of insufficient training data and incapability of inhibiting background noise of the existing short-time specific audio detection method, and provides a short-time specific audio model generation and detection method based on a Gaussian mixture model.
The invention also provides a method for generating the short-time specific audio detection model, which comprises the following steps:
101, extracting features of training voice data; wherein the training voice data comprises non-specific audio data and specific audio data;
step 102, training a general background model by using the characteristics of the training voice data obtained in the step 101; the general background model is a Gaussian mixture model, and the expression of the general background model is as follows:
<math> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>|</mo> <mi>&lambda;</mi> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>M</mi> </munderover> <msub> <mi>w</mi> <mi>i</mi> </msub> <msub> <mi>p</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>;</mo> </mrow> </math>
wi represents the weight of each Gaussian, the value range is 0-1, and the normalization condition is met:x represents a frame feature of the training speech segment; λ represents the set of all parameters in the gaussian mixture model; p is a radical ofi(x) And representing the probability density function of each single Gaussian model, wherein the expression is as follows:
<math> <mrow> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <msup> <mrow> <mo>(</mo> <mn>2</mn> <mi>&pi;</mi> <mo>)</mo> </mrow> <mrow> <mi>D</mi> <mo>/</mo> <mn>2</mn> </mrow> </msup> <msup> <mrow> <mo>|</mo> <mi>&Sigma;i</mi> <mo>|</mo> </mrow> <mrow> <mn>1</mn> <mo>/</mo> <mn>2</mn> </mrow> </msup> </mrow> </mfrac> <mi>exp</mi> <mo>{</mo> <mo>-</mo> <mfrac> <mn>1</mn> <mn>2</mn> </mfrac> <msup> <mrow> <mo>(</mo> <mi>x</mi> <mo>-</mo> <msub> <mi>&mu;</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>&prime;</mo> </msup> <msup> <mrow> <mo>(</mo> <mi>&Sigma;i</mi> <mo>)</mo> </mrow> <mrow> <mo>-</mo> <mn>1</mn> </mrow> </msup> <mrow> <mo>(</mo> <mi>x</mi> <mo>-</mo> <msub> <mi>&mu;</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>}</mo> <mo>;</mo> </mrow> </math>
d represents the dimension of the frame feature of the training speech segment; Σ i represents the covariance matrix of the gaussian function; mu.siThe mean vector of the Gaussian function is represented;
103, obtaining a model of a certain type of specific audio data in the training voice data in a self-adaptive manner according to the characteristics of the specific audio data in the general background model obtained in the step 102; this operation is repeated until models are obtained that train all class-specific audio data in the speech data.
In the above technical solution, in step 101, the feature extracted from the training speech data is a mel-frequency cepstrum coefficient.
In the above technical solution, in step 102, the training of the general background model includes performing parameter estimation on the general background model by using an expectation-maximization method, where the parameters to be estimated include three types: a gaussian weight w, a gaussian variance and a gaussian mean μ, where w is each gaussian weight wiIs a set of each Gaussian varianceiIs the mean value of each Gaussian μiI represents the number of each single gaussian model; the method specifically comprises the following steps:
step 102-1, weighting the k-th Gaussian weight wkUpdating:
the k-th Gaussian weight wkThe update process is shown by the following equation:
<math> <mrow> <msub> <mi>w</mi> <mi>k</mi> </msub> <mo>=</mo> <mfrac> <mn>1</mn> <mi>T</mi> </mfrac> <munderover> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <mi>p</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>|</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>&lambda;</mi> <mo>)</mo> </mrow> </mrow> </math>
wherein x istThe t-th frame feature vector in the input training speech x is a known vector calculated in the feature extraction process; λ is a general term for all parameters in the gaussian mixture model, which will give initial values in the initialization of the initial stage of training, which are known parameters; t represents the total frame number of all input training voices and is a known value which can be calculated; k represents the number of the kth single Gaussian model in the Gaussian mixture model; p (k | x)tLambda) represents the input training speech frame xtPosterior probability over k Gauss of general background model, from input frame xtAnd calculating a mixed Gaussian model parameter lambda;
step 102-2, for the k-th Gaussian mean value mukUpdating:
k-th Gaussian mean value mukThe update process is shown by the following equation:
<math> <mrow> <msub> <mi>&mu;</mi> <mi>k</mi> </msub> <mo>=</mo> <mfrac> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <mi>p</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>|</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>&lambda;</mi> <mo>)</mo> </mrow> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> </mrow> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <mi>p</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>|</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>&lambda;</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow> </math>
wherein, T, xtAnd λ are both known variables, and p (k | x)tλ) is formed from the input frame xtAnd calculating a mixed Gaussian model parameter lambda;
step 102-3, for the k-th Gaussian varianceUpdating:
k-th Gaussian mean valueThe update process is shown by the following equation:
<math> <mrow> <msubsup> <mi>&delta;</mi> <mi>k</mi> <mn>2</mn> </msubsup> <mo>=</mo> <mfrac> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <mi>p</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>|</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>&lambda;</mi> <mo>)</mo> </mrow> <mo>&CenterDot;</mo> <msubsup> <mi>x</mi> <mi>t</mi> <mn>2</mn> </msubsup> </mrow> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <mi>p</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>|</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>&lambda;</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>-</mo> <msubsup> <mi>&mu;</mi> <mi>k</mi> <mn>2</mn> </msubsup> </mrow> </math>
wherein, T, xtλ and μkAre all known variables, and p (k | x)tλ) is formed from the input frame xtAnd a mixed Gaussian model parameter lambda.
In the above technical solution, in step 103, adaptively obtaining a model of a specific type of audio data from the general background model obtained in step 102 includes:
step 103-1, firstly, calculating the posterior probability n of each speech frame on the general background model according to the trained feature vector of the specific audioiFirst order statistic Ei(x) And a second order statistic Ei(x2) (ii) a The specific calculation process is shown by the following formula:
<math> <mrow> <msub> <mi>n</mi> <mi>i</mi> </msub> <mo>=</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <mi>Pr</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>|</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> </math>
<math> <mrow> <msub> <mi>E</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <msub> <mi>n</mi> <mi>i</mi> </msub> </mfrac> <munderover> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <mi>Pr</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>|</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> </mrow> </math>
<math> <mrow> <msub> <mi>E</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <msup> <mi>x</mi> <mn>2</mn> </msup> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <msub> <mi>n</mi> <mi>i</mi> </msub> </mfrac> <munderover> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <mi>Pr</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>|</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> </mrow> </math>
wherein, Pr (i | x)t) Representing the posterior probability of the ith Gaussian of the input audio x (th) frame in the general background model; x is the number oftFeatures representing input audio xth frame data; t represents the total frame number of the input audio; i represents the number of ith single gaussians in the general background model;
step 103-2, utilizing the posterior probability, the first order statistic and the second order statistic obtained by the calculation in the step 103-1 to carry out self-adaptive adjustment on the parameters of the general background modelInteger, get the weight of the specific audio modelMean valueAnd covarianceThe formula of the adaptive adjustment is as follows:
<math> <mrow> <msub> <mover> <mi>w</mi> <mo>^</mo> </mover> <mi>i</mi> </msub> <mo>=</mo> <mo>[</mo> <msubsup> <mi>&alpha;</mi> <mi>i</mi> <mi>w</mi> </msubsup> <msub> <mi>n</mi> <mi>i</mi> </msub> <mo>/</mo> <mi>T</mi> <mo>+</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msubsup> <mi>&alpha;</mi> <mi>i</mi> <mi>w</mi> </msubsup> <mo>)</mo> </mrow> <msub> <mi>w</mi> <mi>i</mi> </msub> <mo>]</mo> <mi>&gamma;</mi> </mrow> </math>
<math> <mrow> <msub> <mover> <mi>&mu;</mi> <mo>^</mo> </mover> <mi>i</mi> </msub> <mo>=</mo> <msubsup> <mi>&alpha;</mi> <mi>i</mi> <mi>m</mi> </msubsup> <msub> <mi>E</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>+</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msubsup> <mi>&alpha;</mi> <mi>i</mi> <mi>m</mi> </msubsup> <mo>)</mo> </mrow> <msub> <mi>&mu;</mi> <mi>i</mi> </msub> </mrow> </math>
<math> <mrow> <msubsup> <mover> <mi>&delta;</mi> <mo>^</mo> </mover> <mi>i</mi> <mn>2</mn> </msubsup> <mo>=</mo> <msubsup> <mi>&alpha;</mi> <mi>i</mi> <mi>v</mi> </msubsup> <msub> <mi>E</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <msup> <mi>x</mi> <mn>2</mn> </msup> <mo>)</mo> </mrow> <mo>+</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msubsup> <mi>&alpha;</mi> <mi>i</mi> <mi>v</mi> </msubsup> <mo>)</mo> </mrow> <mrow> <mo>(</mo> <msubsup> <mi>&sigma;</mi> <mi>i</mi> <mn>2</mn> </msubsup> <mo>+</mo> <msubsup> <mi>&mu;</mi> <mi>i</mi> <mn>2</mn> </msubsup> <mo>)</mo> </mrow> <mo>-</mo> <msubsup> <mover> <mi>&mu;</mi> <mo>^</mo> </mover> <mi>i</mi> <mn>2</mn> </msubsup> </mrow> </math>
wherein,andrespectively is variance, mean value and weight adjusting coefficient; t represents the total frame number of the specific audio training data, and gamma represents the normalization parameter to ensurewiThe weight of the ith Gaussian model in the general background model is represented; mu.siThe mean value of the ith Gaussian model in the general background model is shown;indicating general useCovariance of ith Gaussian in background model, μiThe mean of the ith gaussian in the generic background model is shown,the mean value of the ith gaussian of the particular audio model obtained adaptively is shown.
The invention also provides a short-time specific audio detection method, which comprises the following steps:
step 201, performing feature extraction on the input test voice;
step 202, inputting the test voice features extracted in step 201 into the general background model obtained by the short-time specific audio detection model generating method, and calculating the score of the test voice on the general background model;
step 203, inputting the test voice features extracted in step 201 into the mixed Gaussian model of each type of specific audio obtained by the short-time specific audio detection model generation method, and calculating the score of the test voice on the mixed Gaussian model of each type of specific audio;
and step 204, respectively calculating the score of the test voice obtained in the step 202 on the general background model and the score of the test voice obtained in the step 203 on the mixed Gaussian models of various specific audios to obtain difference values, comparing the difference values with a threshold value, so as to judge which specific audio the test voice belongs to, if a plurality of model scores are within the threshold value range, judging by adopting a maximum value method, and selecting the specific audio represented by the model with the maximum score as a final judgment result of the test voice.
In the above technical solution, in step 202, calculating the score of the test speech on the general background model includes: and selecting N gaussians with the maximum posterior probability in the general background model, calculating the sum of the N probabilities, and marking the N gaussians with the sequence numbers.
In the above technical solution, in step 203, calculating the score of the test speech on the gaussian mixture model of each type of specific audio includes: through the N gaussian sequences of the general background model recorded in step 202, the posterior probability sum of the N gaussians in the gaussian mixture model of the specific audio is correspondingly calculated, and the value is used as the score of the test voice on the gaussian mixture models of various types of specific audio.
In the above technical solution, in step 201, the feature extracted from the test speech is a mel-frequency cepstrum coefficient.
The invention has the advantages that:
the method of the invention not only can well overcome the problem of insufficient training data of the short-time specific audio model, but also can well inhibit the background noise to a certain extent.
Drawings
FIG. 1 is a basic block diagram of training for a generic background model in a short-time specific audio detection model generation method;
FIG. 2 is a basic block diagram of training for a specific audio model in a short-time specific audio detection model generation method;
fig. 3 is a flow chart of a short-time specific audio detection method.
Detailed Description
The embodiment of the present invention will now be described in further detail with reference to fig. 1 and 2.
The short-time specific audio detection method comprises two stages, wherein the first stage is to train a model by using training voice data, and the second stage is to detect test voice by using the trained model.
First, model training phase
Step 101, extracting features of training voice data, wherein the extracted features are Mel cepstrum coefficients (MFCC features), and the features comprise energy values and first-order and second-order differences;
in one embodiment, the frame length of the extracted mel-frequency cepstrum coefficient is 20ms, the frame shift is 10ms, and the extracted mel-frequency cepstrum coefficient comprises an energy value and first-order and second-order differences; the characteristic dimension is 60 dimensions in total;
the training speech data should include a large amount of non-audio-specific data as well as a certain amount of audio-specific data.
Step 102, training a universal background model (UBM model) by using the characteristics of the training voice data obtained in step 101, namely the Mel cepstrum coefficient;
referring to the training schematic diagram of the general background model given in fig. 1, the general background model is as shown in formula (1):
<math> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>|</mo> <mi>&lambda;</mi> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>M</mi> </munderover> <msub> <mi>w</mi> <mi>i</mi> </msub> <msub> <mi>p</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow> </math>
w in formula (1)iThe weight of each Gaussian is represented, the value range is 0-1, and the normalization condition is met:x represents the training speech segment frame characteristics; λ represents the set of all parameters in the gaussian mixture model; m represents Gaussian mixture in Gaussian mixture modelAnd (4) the total number.
P in formula (1)i(x) And (3) representing the probability density function of each single Gaussian model, wherein the specific expression of the probability density function is as the formula (2):
<math> <mrow> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <msup> <mrow> <mo>(</mo> <mn>2</mn> <mi>&pi;</mi> <mo>)</mo> </mrow> <mrow> <mi>D</mi> <mo>/</mo> <mn>2</mn> </mrow> </msup> <msup> <mrow> <mo>|</mo> <mi>&Sigma;i</mi> <mo>|</mo> </mrow> <mrow> <mn>1</mn> <mo>/</mo> <mn>2</mn> </mrow> </msup> </mrow> </mfrac> <mi>exp</mi> <mo>{</mo> <mo>-</mo> <mfrac> <mn>1</mn> <mn>2</mn> </mfrac> <msup> <mrow> <mo>(</mo> <mi>x</mi> <mo>-</mo> <msub> <mi>&mu;</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>&prime;</mo> </msup> <msup> <mrow> <mo>(</mo> <mi>&Sigma;i</mi> <mo>)</mo> </mrow> <mrow> <mo>-</mo> <mn>1</mn> </mrow> </msup> <mrow> <mo>(</mo> <mi>x</mi> <mo>-</mo> <msub> <mi>&mu;</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>}</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow> </math>
wherein p isi(x) Characterized by several parameters: d represents the dimension of the frame feature of the training voice segment, which is determined by the feature dimension in the feature extraction process; expressed by Σ i is the co-ordination of the gaussian functionA variance matrix; mu.siThe mean vector of the gaussian function is represented.
The above is a specific representation of a general background model, and the gaussian mixture model uses a linear weighted sum of a plurality of single gaussians to fit a probability distribution function, i.e. a distribution probability density function, of a general speaker for the characteristics of the uttered speech. Therefore, the sound distribution of the speaker can be well represented through the general Gaussian mixture model, and the sound characteristic of the speaker can be well represented.
On the basis of the general background model, the process of training the general background model by using the characteristics of the training voice data refers to parameter estimation by using an expectation maximization method.
After parameter estimation, a general background model can be obtained, the model is a Gaussian mixture model in nature, and the parameters of the model include three: a gaussian weight w, a gaussian variance and a gaussian mean μ, where w is each gaussian weight wiIs a set of each Gaussian varianceiIs the mean value of each Gaussian μiI denotes the number of no single gaussian model. The three parameters obtained are unique through training data.
The specific parameter estimation process is as follows:
step 102-1, weighting the k-th Gaussian weight wkUpdating:
the k-th Gaussian weight wkThe updating process is as the formula (3):
<math> <mrow> <msub> <mi>w</mi> <mi>k</mi> </msub> <mo>=</mo> <mfrac> <mn>1</mn> <mi>T</mi> </mfrac> <munderover> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <mi>p</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>|</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>&lambda;</mi> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow> </math>
wherein x istThe t-th frame feature vector in the input training speech x is a known vector calculated in the feature extraction process; λ is the same as the expression in formula (1), and is a general term for all parameters in the gaussian mixture model, and these will give initial values in the initialization of the initial stage of training, and are known parameters; t represents the total frame number of all input training voices and is a known value which can be calculated; k represents the number of the kth single Gaussian model in the Gaussian mixture model; p (k | x)tLambda) represents the input training speech frame xtPosterior probability over k Gauss of general background model, which is determined by input frame xtAnd a mixed Gaussian model parameter lambda.
Step 102-2, for the k-th Gaussian mean value mukUpdating:
k-th Gaussian mean value mukThe updating process is as the formula (4):
<math> <mrow> <msub> <mi>&mu;</mi> <mi>k</mi> </msub> <mo>=</mo> <mfrac> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <mi>p</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>|</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>&lambda;</mi> <mo>)</mo> </mrow> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> </mrow> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>-</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <mi>p</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>|</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>&lambda;</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow> </math>
wherein each parameter has the same meaning as in formula (3), wherein T, xtAnd λ are both known variables, and p (k | x)tλ) is formed from the input frame xtAnd a mixed Gaussian model parameter lambda.
Step 102-3, for the k-th Gaussian varianceUpdating:
k-th Gaussian mean valueThe updating process is as the formula (5):
<math> <mrow> <msubsup> <mi>&delta;</mi> <mi>k</mi> <mn>2</mn> </msubsup> <mo>=</mo> <mfrac> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <mi>p</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>|</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>&lambda;</mi> <mo>)</mo> </mrow> <mo>&CenterDot;</mo> <msubsup> <mi>x</mi> <mi>t</mi> <mn>2</mn> </msubsup> </mrow> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <mi>p</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>|</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>&lambda;</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>-</mo> <msubsup> <mi>&mu;</mi> <mi>k</mi> <mn>2</mn> </msubsup> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>5</mn> <mo>)</mo> </mrow> </mrow> </math>
wherein each parameter has the same meaning as in formula (3) and formula (4), wherein T, xtλ and μkAre all known variables, and p (k | x)tλ) is formed from the input frame xtAnd a mixed Gaussian model parameter lambda.
Step 103, in order to obtain a model of each specific audio, firstly, a part of the specific audio speech of the class needs to be obtained as a model training speech, if the specific audio data of the class is difficult to obtain, the specific audio data of the class during training the general background model can be used, if new specific audio data of the class can be obtained, the new audio data of the class is used as training data, and no matter how much training data exists, the specific audio of the class obtains a corresponding specific audio model.
In this step, as shown in fig. 2, a small amount of specific audio training data of a certain class and a bayesian adaptive algorithm are used to obtain a specific audio model of the class from a general background model in an adaptive manner, and the specific adaptive process is as follows:
step 103-1, firstly, calculating posterior probability, first-order statistic and second-order statistic of each voice frame on a general background model according to the trained feature vector of the specific audio; the specific calculation process is as in formulas (6), (7) and (8):
<math> <mrow> <msub> <mi>n</mi> <mi>i</mi> </msub> <mo>=</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <mi>Pr</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>|</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>6</mn> <mo>)</mo> </mrow> </mrow> </math>
<math> <mrow> <msub> <mi>E</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <msub> <mi>n</mi> <mi>i</mi> </msub> </mfrac> <munderover> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <mi>Pr</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>|</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>7</mn> <mo>)</mo> </mrow> </mrow> </math>
<math> <mrow> <msub> <mi>E</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <msup> <mi>x</mi> <mn>2</mn> </msup> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <msub> <mi>n</mi> <mi>i</mi> </msub> </mfrac> <munderover> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <mi>Pr</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>|</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>8</mn> <mo>)</mo> </mrow> </mrow> </math>
wherein, Pr (i | x)t) Representing the posterior probability of the ith Gaussian of the input audio x (th) frame in the general background model; x is the number oftFeatures representing input audio xth frame data; t represents the total frame number of the input audio; i denotes the number of the ith single gaussians in the generic background model.
Since each specific audio data used for adaptation is different when each specific audio model is trained, the calculated posterior probability, first order and second order statistics, used to train each specific audio model are also different.
Step 103-2, utilizing the posterior probability, the first order statistic and the second order statistic obtained by the calculation in the step 103-1 to carry out self-adaptive adjustment on the parameters of the general background model to obtain the weight of the specific audio modelMean valueAnd covarianceBecause the particular audio model is also Gaussian mixture model in nature, features are obtainedDetermining weights of audio modelsMean valueAnd covarianceThe Gaussian mixture model for that particular audio can then be characterized.
The concrete adaptive formula is shown as (9), (10) and (11):
<math> <mrow> <msub> <mover> <mi>w</mi> <mo>^</mo> </mover> <mi>i</mi> </msub> <mo>=</mo> <mo>[</mo> <msubsup> <mi>&alpha;</mi> <mi>i</mi> <mi>w</mi> </msubsup> <msub> <mi>n</mi> <mi>i</mi> </msub> <mo>/</mo> <mi>T</mi> <mo>+</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msubsup> <mi>&alpha;</mi> <mi>i</mi> <mi>w</mi> </msubsup> <mo>)</mo> </mrow> <msub> <mi>w</mi> <mi>i</mi> </msub> <mo>]</mo> <mi>&gamma;</mi> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>9</mn> <mo>)</mo> </mrow> </mrow> </math>
<math> <mrow> <msub> <mover> <mi>&mu;</mi> <mo>^</mo> </mover> <mi>i</mi> </msub> <mo>=</mo> <msubsup> <mi>&alpha;</mi> <mi>i</mi> <mi>m</mi> </msubsup> <msub> <mi>E</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>+</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msubsup> <mi>&alpha;</mi> <mi>i</mi> <mi>m</mi> </msubsup> <mo>)</mo> </mrow> <msub> <mi>&mu;</mi> <mi>i</mi> </msub> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>10</mn> <mo>)</mo> </mrow> </mrow> </math>
<math> <mrow> <msubsup> <mover> <mi>&delta;</mi> <mo>^</mo> </mover> <mi>i</mi> <mn>2</mn> </msubsup> <mo>=</mo> <msubsup> <mi>&alpha;</mi> <mi>i</mi> <mi>v</mi> </msubsup> <msub> <mi>E</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <msup> <mi>x</mi> <mn>2</mn> </msup> <mo>)</mo> </mrow> <mo>+</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msubsup> <mi>&alpha;</mi> <mi>i</mi> <mi>v</mi> </msubsup> <mo>)</mo> </mrow> <mrow> <mo>(</mo> <msubsup> <mi>&sigma;</mi> <mi>i</mi> <mn>2</mn> </msubsup> <mo>+</mo> <msubsup> <mi>&mu;</mi> <mi>i</mi> <mn>2</mn> </msubsup> <mo>)</mo> </mrow> <mo>-</mo> <msubsup> <mover> <mi>&mu;</mi> <mo>^</mo> </mover> <mi>i</mi> <mn>2</mn> </msubsup> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>11</mn> <mo>)</mo> </mrow> </mrow> </math>
wherein,andrespectively is variance, mean value and weight adjusting coefficient; n isi、Ei(x) And Ei(x2) The posterior probability, the first order statistic and the second order statistic of the specific audio training data are calculated by the formulas (6), (7) and (8); in formula (9), T represents the total frame number of the specific audio training data, and gamma represents the normalization parameter to ensurewiThe weight of the ith Gaussian model in the general background model is represented; in the formula (10), μiThe mean value of the ith Gaussian model in the general background model is shown; in the formula (11), the reaction mixture,represents the covariance, μ, of the ith Gaussian in the general background modeliThe mean of the ith gaussian in the generic background model is shown,the mean value of the ith gaussian of the particular audio model obtained adaptively is shown.
After the above calculation process, the class-specific audio model is obtained.
It can be known from step 103-1 that since the adaptive data of each specific audio model is different, the posterior probability and the first-order and second-order statistics obtained by calculation are different, and therefore the specific audio models finally obtained after calculation at step 103-2 are different.
Second, testing stage
Referring to fig. 3, the testing phase includes the following steps:
step 201, performing feature extraction on the input test voice;
the features extracted in this step are of the same type as those extracted in step 101, and are all mel-frequency cepstrum coefficients, for example;
step 202, inputting the test voice features extracted in step 201 into the general background model trained in step 102, and calculating the score of the test voice on the general background model.
As can be seen from the foregoing description, the general background model is essentially a Gaussian mixture model, and the score of the test speech in the general background model is the sum of the Gaussian posterior probabilities. As a preferred implementation manner, in order to speed up the score calculation, in the actual calculation, the posterior probabilities of all gaussians are not calculated, but N gaussians with the largest posterior probabilities are selected, the sum of the N probabilities is calculated, and the N gaussian sequence numbers are marked.
Step 203, inputting the test voice features extracted in step 201 into the gaussian mixture models of the respective specific audio obtained in step 103, and calculating scores of the test voice on the gaussian mixture models of each specific audio, wherein if there are M specific audio models, the total number of the finally obtained scores is M.
The specific method of calculating the score of the test speech on the mixed gaussian models of the respective specific audio is still to calculate the sum of the posterior probabilities of the test speech on each gaussian on the specific audio model. In order to increase the calculation speed, in a preferred implementation, the sum of the posterior probabilities of the N gaussians in the gaussian mixture model of the specific audio is correspondingly calculated through the N gaussian sequences of the general background model recorded in step 202, and the value is used as the score of the test speech on the gaussian mixture model of the specific audio.
And step 204, respectively obtaining difference values of the score of the test voice obtained in the step 202 in the general background model and the score of the test voice obtained in the step 203 in the Gaussian mixture model of the respective specific audio, comparing the difference values with a threshold value, so as to judge which specific audio the test voice belongs to, if a plurality of model scores are in the threshold value range, judging by adopting a maximum value method, namely comparing the model scores of the scores in the threshold value range, and selecting the specific audio represented by the model with the maximum score as a final judgment result of the test voice.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (8)

1. A method for generating a short-time specific audio detection model, comprising:
101, extracting features of training voice data; wherein the training voice data comprises non-specific audio data and specific audio data;
step 102, training a general background model by using the characteristics of the training voice data obtained in the step 101; the general background model is a Gaussian mixture model, and the expression of the general background model is as follows:
<math> <mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>|</mo> <mi>&lambda;</mi> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>M</mi> </munderover> <msub> <mi>w</mi> <mi>i</mi> </msub> <msub> <mi>p</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>;</mo> </mrow> </math>
withe weight of each Gaussian is represented, the value range is 0-1, and the normalization condition is met:x represents a frame feature of the training speech segment; λ represents the set of all parameters in the gaussian mixture model; p is a radical ofi(x) And representing the probability density function of each single Gaussian model, wherein the expression is as follows:
<math> <mrow> <msub> <mi>p</mi> <mi>i</mi> </msub> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <msup> <mrow> <mo>(</mo> <mn>2</mn> <mi>&pi;</mi> <mo>)</mo> </mrow> <mrow> <mi>D</mi> <mo>/</mo> <mn>2</mn> </mrow> </msup> <msup> <mrow> <mo>|</mo> <mi>&Sigma;i</mi> <mo>|</mo> </mrow> <mrow> <mn>1</mn> <mo>/</mo> <mn>2</mn> </mrow> </msup> </mrow> </mfrac> <mi>exp</mi> <mo>{</mo> <mo>-</mo> <mfrac> <mn>1</mn> <mn>2</mn> </mfrac> <msup> <mrow> <mo>(</mo> <mi>x</mi> <mo>-</mo> <msub> <mi>&mu;</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>&prime;</mo> </msup> <msup> <mrow> <mo>(</mo> <mi>&Sigma;i</mi> <mo>)</mo> </mrow> <mrow> <mo>-</mo> <mn>1</mn> </mrow> </msup> <mrow> <mo>(</mo> <mi>x</mi> <mo>-</mo> <msub> <mi>&mu;</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>}</mo> <mo>;</mo> </mrow> </math>
d represents the dimension of the frame feature of the training speech segment; Σ i represents the covariance matrix of the gaussian function; mu.siThe mean vector of the Gaussian function is represented;
103, obtaining a model of a certain type of specific audio data in the training voice data in a self-adaptive manner according to the characteristics of the specific audio data in the general background model obtained in the step 102; this operation is repeated until models are obtained that train all class-specific audio data in the speech data.
2. The method of generating a short-time specific audio detection model according to claim 1, wherein the features extracted from the training speech data in step 101 are mel-frequency cepstral coefficients.
3. The method of claim 1, wherein the training of the general background model in step 102 comprises performing parameter estimation on the general background model by using expectation-maximization, and the parameters to be estimated include three types: a gaussian weight w, a gaussian variance and a gaussian mean μ, where w is each gaussian weight wiIs a set of each Gaussian varianceiIs the mean value of each Gaussian μiI represents the number of each single gaussian model; the method specifically comprises the following steps:
step 102-1, weighting the k-th Gaussian weight wkUpdating:
the k-th Gaussian weight wkThe update process is shown by the following equation:
<math> <mrow> <msub> <mi>w</mi> <mi>k</mi> </msub> <mo>=</mo> <mfrac> <mn>1</mn> <mi>T</mi> </mfrac> <munderover> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <mi>p</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>|</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>&lambda;</mi> <mo>)</mo> </mrow> </mrow> </math>
wherein x istThe t-th frame feature vector in the input training speech x is a known vector calculated in the feature extraction process; λ is a general term for all parameters in the gaussian mixture model, which will give initial values in the initialization of the initial stage of training, which are known parameters; t represents the total frame number of all input training voices and is a known value which can be calculated; k represents the number of the kth single Gaussian model in the Gaussian mixture model; p (k | x)tLambda) represents the input training speech frame xtPosterior probability over k Gauss of general background model, from input frame xtAnd calculating a mixed Gaussian model parameter lambda;
step 102-2, for the k-th Gaussian mean value mukUpdating:
k-th Gaussian mean value mukThe update process is shown by the following equation:
<math> <mrow> <msub> <mi>&mu;</mi> <mi>k</mi> </msub> <mo>=</mo> <mfrac> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <mi>p</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>|</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>&lambda;</mi> <mo>)</mo> </mrow> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> </mrow> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <mi>p</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>|</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>&lambda;</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow> </math>
wherein, T, xtAnd λ are both known variables, and p (k | x)tλ) is formed from the input frame xtAnd calculating a mixed Gaussian model parameter lambda;
step 102-3, for the k-th Gaussian varianceUpdating:
k-th Gaussian mean valueThe update process is shown by the following equation:
<math> <mrow> <msubsup> <mi>&delta;</mi> <mi>k</mi> <mn>2</mn> </msubsup> <mo>=</mo> <mfrac> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <mi>p</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>|</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>&lambda;</mi> <mo>)</mo> </mrow> <mo>&CenterDot;</mo> <msubsup> <mi>x</mi> <mi>t</mi> <mn>2</mn> </msubsup> </mrow> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <mi>p</mi> <mrow> <mo>(</mo> <mi>k</mi> <mo>|</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>&lambda;</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>-</mo> <msubsup> <mi>&mu;</mi> <mi>k</mi> <mn>2</mn> </msubsup> </mrow> </math>
wherein, T, xtλ and μkAre all known variables, and p (k | x)tλ) is formed from the input frame xtAnd a mixed Gaussian model parameter lambda.
4. The method for generating a short-time specific audio detection model according to claim 1, wherein adaptively obtaining a model of a specific type of audio data in step 103 according to the generic background model obtained in step 102 comprises:
step 103-1, firstly, calculating the posterior probability n of each speech frame on the general background model according to the trained feature vector of the specific audioiFirst order statistic Ei(x) And a second order statistic Ei(x2) (ii) a The specific calculation process is shown by the following formula:
<math> <mrow> <msub> <mi>n</mi> <mi>i</mi> </msub> <mo>=</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <mi>Pr</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>|</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> </math>
<math> <mrow> <msub> <mi>E</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>=</mo> <mo>-</mo> <mfrac> <mn>1</mn> <msub> <mi>n</mi> <mi>i</mi> </msub> </mfrac> <munderover> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <mi>Pr</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>|</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> </mrow> </math>
<math> <mrow> <msub> <mi>E</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <msup> <mi>x</mi> <mn>2</mn> </msup> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <msub> <mi>n</mi> <mi>i</mi> </msub> </mfrac> <munderover> <mi>&Sigma;</mi> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>T</mi> </munderover> <mi>Pr</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>|</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>&CenterDot;</mo> <msub> <mi>x</mi> <mi>t</mi> </msub> </mrow> </math>
wherein, Pr (i | x)t) Representing the posterior probability of the ith Gaussian of the input audio x (th) frame in the general background model; x is the number oftFeatures representing input audio xth frame data; t represents the total frame number of the input audio; i represents the number of ith single gaussians in the general background model;
step 103-2, useThe posterior probability, the first order statistic and the second order statistic obtained by the calculation in the step 103-1 are subjected to self-adaptive adjustment on the parameters of the general background model to obtain the weight of the specific audio modelMean valueAnd covarianceThe formula of the adaptive adjustment is as follows:
<math> <mrow> <msub> <mover> <mi>w</mi> <mo>^</mo> </mover> <mi>i</mi> </msub> <mo>=</mo> <mo>[</mo> <msubsup> <mi>&alpha;</mi> <mi>i</mi> <mi>w</mi> </msubsup> <msub> <mi>n</mi> <mi>i</mi> </msub> <mo>/</mo> <mi>T</mi> <mo>+</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msubsup> <mi>&alpha;</mi> <mi>i</mi> <mi>w</mi> </msubsup> <mo>)</mo> </mrow> <msub> <mi>w</mi> <mi>i</mi> </msub> <mo>]</mo> <mi>&gamma;</mi> </mrow> </math>
<math> <mrow> <msub> <mover> <mi>&mu;</mi> <mo>^</mo> </mover> <mi>i</mi> </msub> <mo>=</mo> <msubsup> <mi>&alpha;</mi> <mi>i</mi> <mi>m</mi> </msubsup> <msub> <mi>E</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> <mo>+</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msubsup> <mi>&alpha;</mi> <mi>i</mi> <mi>m</mi> </msubsup> <mo>)</mo> </mrow> <msub> <mi>&mu;</mi> <mi>i</mi> </msub> </mrow> </math>
<math> <mrow> <msubsup> <mover> <mi>&delta;</mi> <mo>^</mo> </mover> <mi>i</mi> <mn>2</mn> </msubsup> <mo>=</mo> <msubsup> <mi>&alpha;</mi> <mi>i</mi> <mi>v</mi> </msubsup> <msub> <mi>E</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <msup> <mi>x</mi> <mn>2</mn> </msup> <mo>)</mo> </mrow> <mo>+</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>-</mo> <msubsup> <mi>&alpha;</mi> <mi>i</mi> <mi>v</mi> </msubsup> <mo>)</mo> </mrow> <mrow> <mo>(</mo> <msubsup> <mi>&sigma;</mi> <mi>i</mi> <mn>2</mn> </msubsup> <mo>+</mo> <msubsup> <mi>&mu;</mi> <mi>i</mi> <mn>2</mn> </msubsup> <mo>)</mo> </mrow> <mo>-</mo> <msubsup> <mover> <mi>&mu;</mi> <mo>^</mo> </mover> <mi>i</mi> <mn>2</mn> </msubsup> </mrow> </math>
wherein,andrespectively is variance, mean value and weight adjusting coefficient; t represents the total frame number of the specific audio training data, and gamma represents the normalization parameter to ensurewiThe weight of the ith Gaussian model in the general background model is represented; mu.siThe mean value of the ith Gaussian model in the general background model is shown;represents the covariance, μ, of the ith Gaussian in the general background modeliThe mean of the ith gaussian in the generic background model is shown,the mean value of the ith gaussian of the particular audio model obtained adaptively is shown.
5. A method of short-time specific audio detection, comprising:
step 201, performing feature extraction on the input test voice;
step 202, inputting the test voice features extracted in step 201 into the general background model obtained by the short-time specific audio detection model generation method according to any one of claims 1 to 4, and calculating the score of the test voice on the general background model;
step 203, inputting the test voice features extracted in step 201 into the mixed gaussian model of each type of specific audio obtained by the short-time specific audio detection model generation method according to any one of claims 1 to 4, and calculating the score of the test voice on the mixed gaussian model of each type of specific audio;
and step 204, respectively calculating the score of the test voice obtained in the step 202 on the general background model and the score of the test voice obtained in the step 203 on the mixed Gaussian models of various specific audios to obtain difference values, comparing the difference values with a threshold value, so as to judge which specific audio the test voice belongs to, if a plurality of model scores are within the threshold value range, judging by adopting a maximum value method, and selecting the specific audio represented by the model with the maximum score as a final judgment result of the test voice.
6. The short-time specific audio detection method according to claim 5, wherein calculating a score of the test speech on top of the generic background model in step 202 comprises: and selecting N gaussians with the maximum posterior probability in the general background model, calculating the sum of the N probabilities, and marking the N gaussians with the sequence numbers.
7. The short-term specific audio detection method according to claim 6, wherein the calculating the score of the test speech on the Gaussian mixture model of each specific audio type in step 203 comprises: through the N gaussian sequences of the general background model recorded in step 202, the posterior probability sum of the N gaussians in the gaussian mixture model of the specific audio is correspondingly calculated, and the value is used as the score of the test voice on the gaussian mixture models of various types of specific audio.
8. The short-time specific audio detection method according to claim 5, wherein the extracted features for the test speech are Mel cepstral coefficients in step 201.
CN201510236568.8A 2015-05-11 2015-05-11 Specific audio detection model generation in short-term and detection method Expired - Fee Related CN104992708B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510236568.8A CN104992708B (en) 2015-05-11 2015-05-11 Specific audio detection model generation in short-term and detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510236568.8A CN104992708B (en) 2015-05-11 2015-05-11 Specific audio detection model generation in short-term and detection method

Publications (2)

Publication Number Publication Date
CN104992708A true CN104992708A (en) 2015-10-21
CN104992708B CN104992708B (en) 2018-07-24

Family

ID=54304511

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510236568.8A Expired - Fee Related CN104992708B (en) 2015-05-11 2015-05-11 Specific audio detection model generation in short-term and detection method

Country Status (1)

Country Link
CN (1) CN104992708B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106251861A (en) * 2016-08-05 2016-12-21 重庆大学 A kind of abnormal sound in public places detection method based on scene modeling
CN107517207A (en) * 2017-03-13 2017-12-26 平安科技(深圳)有限公司 Server, auth method and computer-readable recording medium
CN108305616A (en) * 2018-01-16 2018-07-20 国家计算机网络与信息安全管理中心 A kind of audio scene recognition method and device based on long feature extraction in short-term
CN110135492A (en) * 2019-05-13 2019-08-16 山东大学 Equipment fault diagnosis and method for detecting abnormality and system based on more Gauss models
CN113888777A (en) * 2021-09-08 2022-01-04 南京金盾公共安全技术研究院有限公司 Voiceprint unlocking method and device based on cloud machine learning
CN115331689A (en) * 2022-08-11 2022-11-11 北京声智科技有限公司 Training method, device, equipment, storage medium and product of voice noise reduction model

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102509546A (en) * 2011-11-11 2012-06-20 北京声迅电子股份有限公司 Noise reduction and abnormal sound detection method applied to rail transit
CN102623009A (en) * 2012-03-02 2012-08-01 安徽科大讯飞信息技术股份有限公司 Abnormal emotion automatic detection and extraction method and system on basis of short-time analysis
CN103198605A (en) * 2013-03-11 2013-07-10 成都百威讯科技有限责任公司 Indoor emergent abnormal event alarm system
CN103226951A (en) * 2013-04-19 2013-07-31 清华大学 Speaker verification system creation method based on model sequence adaptive technique
CN103366738A (en) * 2012-04-01 2013-10-23 佳能株式会社 Methods and devices for generating sound classifier and detecting abnormal sound, and monitoring system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102509546A (en) * 2011-11-11 2012-06-20 北京声迅电子股份有限公司 Noise reduction and abnormal sound detection method applied to rail transit
CN102623009A (en) * 2012-03-02 2012-08-01 安徽科大讯飞信息技术股份有限公司 Abnormal emotion automatic detection and extraction method and system on basis of short-time analysis
CN103366738A (en) * 2012-04-01 2013-10-23 佳能株式会社 Methods and devices for generating sound classifier and detecting abnormal sound, and monitoring system
CN103198605A (en) * 2013-03-11 2013-07-10 成都百威讯科技有限责任公司 Indoor emergent abnormal event alarm system
CN103226951A (en) * 2013-04-19 2013-07-31 清华大学 Speaker verification system creation method based on model sequence adaptive technique

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
罗森林; 王坤; 谢尔曼; 潘丽敏; 李金玉: "融合GMM及SVM的特定音频事件高精度识别方法", 《北京理工大学学报》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106251861A (en) * 2016-08-05 2016-12-21 重庆大学 A kind of abnormal sound in public places detection method based on scene modeling
CN106251861B (en) * 2016-08-05 2019-04-23 重庆大学 A kind of abnormal sound in public places detection method based on scene modeling
CN107517207A (en) * 2017-03-13 2017-12-26 平安科技(深圳)有限公司 Server, auth method and computer-readable recording medium
WO2018166187A1 (en) * 2017-03-13 2018-09-20 平安科技(深圳)有限公司 Server, identity verification method and system, and a computer-readable storage medium
CN108305616A (en) * 2018-01-16 2018-07-20 国家计算机网络与信息安全管理中心 A kind of audio scene recognition method and device based on long feature extraction in short-term
CN110135492A (en) * 2019-05-13 2019-08-16 山东大学 Equipment fault diagnosis and method for detecting abnormality and system based on more Gauss models
CN113888777A (en) * 2021-09-08 2022-01-04 南京金盾公共安全技术研究院有限公司 Voiceprint unlocking method and device based on cloud machine learning
CN113888777B (en) * 2021-09-08 2023-08-18 南京金盾公共安全技术研究院有限公司 Voiceprint unlocking method and device based on cloud machine learning
CN115331689A (en) * 2022-08-11 2022-11-11 北京声智科技有限公司 Training method, device, equipment, storage medium and product of voice noise reduction model

Also Published As

Publication number Publication date
CN104992708B (en) 2018-07-24

Similar Documents

Publication Publication Date Title
CN104992708B (en) Specific audio detection model generation in short-term and detection method
CN104835498B (en) Method for recognizing sound-groove based on polymorphic type assemblage characteristic parameter
Xu et al. Deep sparse rectifier neural networks for speech denoising
US9754608B2 (en) Noise estimation apparatus, noise estimation method, noise estimation program, and recording medium
CN104900235A (en) Voiceprint recognition method based on pitch period mixed characteristic parameters
CN107452389A (en) A kind of general monophonic real-time noise-reducing method
CN106941008A (en) It is a kind of that blind checking method is distorted based on Jing Yin section of heterologous audio splicing
Ghalehjegh et al. Deep bottleneck features for i-vector based text-independent speaker verification
KR100682909B1 (en) Method and apparatus for recognizing speech
Smolenski et al. Usable speech processing: A filterless approach in the presence of interference
Fauve et al. Influence of task duration in text-independent speaker verification.
WO2012105385A1 (en) Sound segment classification device, sound segment classification method, and sound segment classification program
Hong et al. Modified-prior PLDA and score calibration for duration mismatch compensation in speaker recognition system.
Yamamoto et al. Denoising autoencoder-based speaker feature restoration for utterances of short duration.
KR100784456B1 (en) Voice Enhancement System using GMM
Wang et al. F0 estimation in noisy speech based on long-term harmonic feature analysis combined with neural network classification
Soni et al. Effectiveness of ideal ratio mask for non-intrusive quality assessment of noise suppressed speech
Shokri et al. A robust keyword spotting system for Persian conversational telephone speech using feature and score normalization and ARMA filter
Nicolson et al. Bidirectional Long-Short Term Memory Network-based Estimation of Reliable Spectral Component Locations.
Islam et al. Neural-Response-Based Text-Dependent speaker identification under noisy conditions
Wang et al. An ideal Wiener filter correction-based cIRM speech enhancement method using deep neural networks with skip connections
Chen et al. Truth-to-estimate ratio mask: A post-processing method for speech enhancement direct at low signal-to-noise ratios
Preti et al. Confidence measure based unsupervised target model adaptation for speaker verification.
JPH04296799A (en) Voice recognition device
Kenny et al. Development of the primary CRIM system for the NIST 2008 speaker recognition evaluation.

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20180724