CN108986788A

CN108986788A - A kind of noise robust acoustic modeling method based on aposterior knowledge supervision

Info

Publication number: CN108986788A
Application number: CN201810576451.8A
Authority: CN
Inventors: 潘子春; 李葵; 李明; 张引强; 黄影; 赵峰; 吴立刚; 徐海青; 章爱武; 陈是同; 徐唯耀; 秦浩; 王文清; 郑娟; 秦婷; 梁翀; 浦正国; 张天奇; 余江斌; 韩涛
Original assignee: State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd; Anhui Jiyuan Software Co Ltd; Information and Telecommunication Branch of State Grid Anhui Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd; Anhui Jiyuan Software Co Ltd; Information and Telecommunication Branch of State Grid Anhui Electric Power Co Ltd
Priority date: 2018-06-06
Filing date: 2018-06-06
Publication date: 2018-12-11

Abstract

The invention discloses a kind of noise robust acoustic modeling methods based on aposterior knowledge supervision, belong to voice human-computer interaction technique field, this method comprises: obtaining the Posterior probability distribution of clean speech by the training of teacher's model；The training for supervising student model as standard using the Posterior probability distribution of the clean speech makes student model infinitely approach the Posterior probability distribution of teacher's model；Wherein, teacher's model is the model of clean speech training, and the student model is the model of noisy speech training.The acoustic model of the exemplary modeling method of the present invention, foundation has stronger environmental robustness, shows superior noiseproof feature.

Description

A kind of noise robust acoustic modeling method based on aposterior knowledge supervision

Technical field

The invention belongs to voice human-computer interaction technique field, specifically a kind of noise Shandong based on aposterior knowledge supervision Stick Acoustic Modeling method.

Background technique

In recent years, continuous with the development of the technologies such as speech recognition, natural language processing, deep learning and the market demand In-depth, the research and development and application of interactive voice product are increasingly becoming a new hot spot；On the other hand, due to practical application scene Complexity, the operation of voice interactive system is typically located in the environment of a low signal-to-noise ratio, due to the anti-interference ability to noise Deficiency, system interaction often will appear situations such as speech recognition accuracy is low or human-computer interaction is chaotic in the process, lead to service pair The interactive experience sense of elephant is bad, largely limits the market application of interactive voice product and promotes.

It is to determine voice that correlative study, which shows that can Speech acoustics model extract complete phoneme information from noisy speech, The key of interactive system noise robustness, deficiency of the acoustic model on noise robustness are mainly that model construction stage environment is made an uproar Acoustic conductance cause training data and test data mismatch and caused by, i.e. the purpose of raising noise robustness is to drop to greatest extent Influence that is low or eliminating such factor.So far, many scholars of field of speech recognition open acoustic model noise robustness It has opened up extensive research and has proposed a variety of improvement strategies, wherein application effect preferably has feature compensation, model compensation, robustness Four kinds of methods of feature extraction and speech enhan-cement.

Feature and model compensation are the noise robustness methods for optimizing processing to acoustic model by adaptive algorithm. Such as Leggetter etc. returns (MLLR) algorithm using maximum likelihood and carries out model adaptation；Tran etc. passes through linear decomposition net Network carries out self-adaptive processing to the input data of the acoustic training model based on deep neural network (DNN), enables acoustic model The data structure of enough preferable matching noisy speeches, model robustness get a promotion.

Robust features extraction refers to the characteristic parameter extracted from corpus for insensitive for noise, constructs anti-noise ability Strong characteristic sequence, to improve the noise robustness of acoustic model.Cepstrum mean normalization method (CMN) and mean variance normalizing Change method (MVN) is most common two kinds of robust features extracting methods, and in addition some scholars will perceive linear predictor coefficient (PLP) feature is combined with relative spectrum (RASTA) filtering, reinforces acoustic model to the robustness of additive noise and linear filtering；Separately Outer Liu Changzheng etc. takes the mode of supervised learning using MFCC feature as the input of CNN network, extracts the voice of higher Feature, experiment, which shows these features in a noisy environment, has preferable timing invariance.

Most common mode is to update to eliminate the spectrum-subtraction combined to voice with noise by noise to speech enhan-cement now Subtract from noisy speech spectrum with noise independent process assuming that estimate the noise spectrum of corpus in situation known to noise information The noise spectrum estimated is gone to obtain the clean spectrum of corpus, to extract the instruction that the clean feature in noise speech is used for acoustic model Practice；Furthermore Xu etc. proposes the mode that spectrum-subtraction is combined with DNN network, and spectrum-subtraction treated feature and noise estimation are joined Number is input in DNN network as basic sample, and the depth acoustic model obtained by noise independent training is compared with spectrum-subtraction Noiseproof feature is more preferable.

Although above-mentioned four kinds of methods can effectively promote the environmental robustness of acoustic model, theoretical upper with application There are problems that two: first is that the above method exercises supervision simply by noise reduction of the clean speech to noisy speech or makes an uproar by band Voice is fitted clean speech, reduces otherness between the two, does not excavate the tacit knowledge of clean speech sufficiently, right The refinement of information is not enough；On the other hand, acoustic feature extraction module and subsequent training identified in above-mentioned four classes method Journey is independent from each other, and does not account for the inner link between device modeling and characterization unit, so that the target letter of model training Comprising partial redundance information, these redundancies in the phonetic feature that several and system entirety performance indicator has deviation, and extracts Information does not have noise robustness usually, causes so that optimal performance is often not achieved in entire acoustic network.

Therefore, how to improve the noise robustness of voice interactive system is urgent problem at this stage.

Summary of the invention

For above-mentioned problems of the prior art, the purpose of the present invention is to provide one kind to be supervised based on aposterior knowledge Noise robust acoustic modeling method, this method can promote the noise robustness of acoustic model.

The technical scheme adopted by the invention is as follows:

Provide a kind of noise robust acoustic modeling method based on aposterior knowledge supervision, comprising:

The Posterior probability distribution of clean speech is obtained by the training of teacher's model；

The training for supervising student model as standard using the Posterior probability distribution of the clean speech keeps student model unlimited Approach the Posterior probability distribution of teacher's model；

Wherein, teacher's model is the model of clean speech training, and the student model is the mould of noisy speech training Type.

Further, the training of teacher's model, comprising:

Feature X is carried out to clean speech_tIt extracts；

To dividing the feature X after window_tIt carries out forcing alignment frame by frame, and obtains the hard mark of each frame voice data；Described point Window, that is, framing and adding window usually carry out framing to voice data according to preset parameter, and adding window is convenient for subsequent characteristics alignment.

The start-stop point on time dimension is carried out to each hard mark on the basis of forcing alignment to mark；

The start-stop point markup information and hard labeled data are sent into DNN module as supervision message and carry out acoustic model Modeling training.

Further, the feature after described pair of point window carries out forcing alignment frame by frame, is carried out by GMM-HMM module.

Further, the modeling training of the acoustic model, comprising:

By feature X_tAs mode input, phoneme is marked firmly with labeled data as supervision message, is obtained using forwards algorithms Three factor Posterior probability distributions of data frame by frame out.

Further, the training of the student model, comprising:

Preliminary feature X is carried out to noisy speech_sIt extracts；

The phoneme feature X extracted_sParallel alignment is carried out with the soft mark of teacher's model, to obtain the soft of student model Mark；

High-level characteristic is extracted on the basis of the acoustic feature tentatively extracted, and carries out the dimensionality reduction of high-level characteristic, extracts energy Enough characteristic sequences that noise speech invariance is characterized；

High-level characteristic input DNN module is carried out to the modeling training of acoustic model.

Further, the extraction high-level characteristic is locally connected by CNN network and is extracted with down-sampled module.

Further, the training process of the neural network module is using opposite entropy minimization as Optimality Criteria.

Further, the Posterior probability distribution otherness of teacher's model and student model, passes through the relative entropy amount of progress Change.

Further, the relative entropy of teacher's model and student model are as follows:

Wherein: P_tFor the Posterior probability distribution of teacher's model, Q_sFor the Posterior probability distribution of student model, i indicates triphones Order in state set, ph_iFor i-th of state in triphones state set, X_tIt indicates for training the clean of teacher's model Phonetic feature, X_sIndicate the noisy speech feature for being used for training of students model, P_t(ph_i︱ X_t) indicate feature X_tIt is identified as i-th The posterior probability of triphones state, Q_s(ph_i︱ X_s) indicate feature X_sIt is identified as the posterior probability of i-th of triphones state.

Further, the Posterior probability distribution relative entropy of teacher's model and student model are as follows:

Compared with prior art, the invention has the benefit that

1, the exemplary noise robust acoustic modeling method based on aposterior knowledge supervision of the present invention, with clean speech training Model is as teacher's model, and the model of noisy speech training as student model, know by the Posterior probability distribution for refining teacher's model Know the training for supervising student model, indirect reaches the requirement for improving acoustic model environmental robustness.

2, the exemplary noise robust acoustic modeling method based on aposterior knowledge supervision of the present invention, using CNN (convolutional Neural Network) the acoustic training model network structure that is combined with DNN (deep neural network), wherein CNN module is made an uproar for extracting band The Invariance feature of voice, DNN are used for Acoustic Modeling, the training of whole network parameter by CNN and DNN module linkage adjustment with Optimization, the model of building have carried out the verifying of the speech recognition performance under different signal-to-noise ratio and comparison, test on CHIME data set The result shows that the model has stronger environmental robustness, superior noiseproof feature is shown.

3, the exemplary noise robust acoustic modeling method based on aposterior knowledge supervision of the present invention, the CNN-DNN of use Raw model increases the extraction that convolutional neural networks module carries out voice high-level characteristic, can preferably catch compared with DNN model Catch the timing invariance of noisy speech；In addition down-sampled (Pooling) layer inside CNN convolutional neural networks is superfluous to phonetic feature Remaining information has rejecting effect, realizes phonetic feature dimensionality reduction, also promotes while improving acoustic model noise robustness The improved efficiency of model training.

4, the exemplary noise robust acoustic modeling method based on aposterior knowledge supervision of the present invention, is handed over compared to traditional standard It pitches entropy (CE) and minimizes criterion, 0-1 vector (hard mark) is substituted with probability vector (soft mark), soft mark is to posterior probability The deep layer of distribution is refined, and the useful information for including is richer, more conducively the modeling of robustness acoustic model.

Detailed description of the invention

By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other Feature, objects and advantages will become more apparent upon:

Fig. 1 is the flow chart of the embodiment of the present invention；

Fig. 2 is the flow chart of teacher of embodiment of the present invention model training；

Fig. 3 is the structural schematic diagram of GMM-HMM module；

Fig. 4 is the flow chart of student model of embodiment of the present invention training.

Specific embodiment

The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to Convenient for description, part relevant to invention is illustrated only in attached drawing.

It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.

As shown in Figure 1, An embodiment provides a kind of noise robust acoustics based on aposterior knowledge supervision Modeling method, comprising:

S1: the Posterior probability distribution of clean speech is obtained by the training of teacher's model；

S2: the training of student model is supervised as standard using the Posterior probability distribution of the clean speech, makes student model Infinitely approach the Posterior probability distribution of teacher's model；

For the Posterior probability distribution otherness of two kinds of models, the present embodiment is quantified using KL divergence (relative entropy). For acoustic model, the physical significance of KL divergence refers to that in identical basic speech space, probability distribution P (x) is corresponding Each phoneme feature, if encoded with probability distribution Q (x), average each increased bit number of phoneme feature coding length.This reality It applies official holiday and sets P_tFor the Posterior probability distribution of teacher's model, Q_SFor the Posterior probability distribution of student model, Q_SIt is equivalent to P_tPosteriority The approximate evaluation of probability distribution, therefore the relative entropy of the two may be expressed as:

Wherein: i indicates the order in triphones state set, ph_iFor i-th of state in triphones state set, X_tTable Show the clean speech feature for training teacher's model, X_sIndicate the noisy speech feature for being used for training of students model, P_t(ph_i| X_t) indicate feature X_tIt is identified as the posterior probability of i-th of triphones state, Q_s(ph_i|X_s) indicate feature X_sIt is identified as i-th The posterior probability of a triphones state, the formula can be simplified to following form by deformation:

It can be found through observation,Calculating and student model modeling process without It closes, can ignore during practical supervised training, therefore the Posterior probability distribution relative entropy of two kinds of models can indicate are as follows:

Above-mentioned formula formally sees the calculating similar to standard cross entropy (CE), the difference is that standard cross entropy (CE) is Experienced probability distribution and model Posterior probability distribution to training data carry out difference analysis, in general, empirical probability point Cloth is usually to be marked firmly with 0-1 vector to be described, and the relative entropy of teacher's model and student model is to two kinds of models Posterior probability distribution carries out otherness comparison, is equivalent to " hard mark " being substituted for " soft mark ".

The building of teacher's model is training step such as Fig. 2 institute based on GMM-HMM and the mixed model of neural network Show:

Feature X is carried out to clean speech first_tIt extracts, GMM-HMM module is to the feature X after dividing window_tCarry out pressure pair frame by frame Together, and the hard mark of each frame voice data is obtained, i.e., 0-1 vector determination is carried out to the triphones state of each frame, belongs to certain Then observation probability is set as 1 to one phoneme state, is not belonging to be set as 0, to obtain the triphones state observation probability of each frame data Distribution, such as [1 1010 0]；The start-stop point on time dimension is carried out to each hard mark on the basis of forcing alignment Mark, the markup information and hard labeled data are sent into the modeling instruction that neural network module carries out acoustic model as supervision message Practice.The structure of GMM-HMM module is as shown in Figure 3.Above-mentioned point of window, that is, framing and adding window, usually according to preset parameter to voice Data carry out framing, and adding window is convenient for subsequent characteristics alignment.

The training of neural network module is with feature X_tAs mode input, phoneme is marked firmly with labeled data as supervision letter Breath obtains the triphones Posterior probability distribution (hard mark) of data frame by frame using forwards algorithms.The difference of hard mark and soft mark It is, it is soft to mark the triphones state Posterior probability distribution for referring to each frame data, rather than simple 0-1 judges, thus obtains The forms of soft mark of each frame data be similar to [0.2 0.15 0.3 0.1 0.1 0.1], each data therein indicate The frame data belong to the posterior probability of different triphones states.

The method that the building of student model is combined using CNN with DNN network, the propaedeutics process of student model is as schemed Shown in 4:

The training of student model carries out preliminary feature X to noisy speech first_sIt extracts, the phoneme feature X extracted_sWith it is old The soft mark of teacher's model carries out parallel alignment, to obtain the soft mark of student model.On the basis of preliminary feature extraction, borrow CNN network is helped locally to connect the functional characteristic with down-sampled module, on the acoustic feature basis that MFCC and FBANK etc. is tentatively extracted Upper extraction high-level characteristic, and carry out the dimensionality reduction of high-level characteristic, to extract and can be characterized to noise speech invariance Characteristic sequence；On the other hand, it is contemplated that DNN network has powerful classification capacity, has surmounted in the performance of acoustic model High-level characteristic is finally inputted DNN layers of progress Acoustic Modeling by the conventional models such as GMM, the training process of entire prototype network with Opposite entropy minimization (formula 3) is used as Optimality Criteria.The dimensionality reduction of above-mentioned high-level characteristic, which refers to, carries out characteristic pattern by pooling layers Dimensionality reduction and the condensed important feature with local summaries.

The noise robust acoustic modeling method based on aposterior knowledge supervision of the present embodiment, similar to teacher's instruction of papil Mode instructs the training of student model as supervision message using the Posterior probability distribution (soft mark) of teacher's model, and A kind of student model based on CNN-DNN hybrid network is designed, is refined by the high-level characteristic to noisy speech, is promoted The noiseproof feature of acoustic model.The student model of the present embodiment building has carried out performance verification work in the case where CHIME band makes an uproar data set Make, experimental result shows that the student model Word Error Rate under three kinds of teacher's model supervision averagely has dropped compared with baseline model 5.21%, 6.35% and 7.83%, show that aposterior knowledge measure of supervision proposed in this paper has very the robustness of acoustic model Good promotion effect.

Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.Those skilled in the art Member is it should be appreciated that invention scope involved in the application, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic Scheme, while should also cover in the case where not departing from the inventive concept, it is carried out by above-mentioned technical characteristic or its equivalent feature Any combination and the other technical solutions formed.Such as features described above has similar function with (but being not limited to) disclosed herein Can technical characteristic replaced mutually and the technical solution that is formed.

Except for the technical features described in the specification, remaining technical characteristic is the known technology of those skilled in the art, is prominent Innovative characteristics of the invention out, details are not described herein for remaining technical characteristic.

Claims

1. a kind of noise robust acoustic modeling method based on aposterior knowledge supervision, characterized in that include:

The training for supervising student model as standard using the Posterior probability distribution of the clean speech, approaches student model infinitely The Posterior probability distribution of teacher's model；

Wherein, teacher's model is the model of clean speech training, and the student model is the model of noisy speech training.

2. the noise robust acoustic modeling method according to claim 1 based on aposterior knowledge supervision, characterized in that described The training of teacher's model, comprising:

Feature X is carried out to clean speech_tIt extracts；

To dividing the feature X after window_tIt carries out forcing alignment frame by frame, and obtains the hard mark of each frame voice data；

The start-stop point markup information and hard labeled data are sent into the modeling that DNN module carries out acoustic model as supervision message Training.

3. the noise robust acoustic modeling method according to claim 2 based on aposterior knowledge supervision, characterized in that described To dividing the feature after window to carry out forcing alignment frame by frame, carried out by GMM-HMM module.

4. the noise robust acoustic modeling method according to claim 2 based on aposterior knowledge supervision, characterized in that described The modeling training of acoustic model, comprising:

By feature X_tAs mode input, phoneme is marked firmly with labeled data as supervision message, is obtained frame by frame using forwards algorithms Three factor Posterior probability distributions of data.

5. the noise robust acoustic modeling method according to claim 1 based on aposterior knowledge supervision, characterized in that described The training of student model, comprising:

Preliminary feature X is carried out to noisy speech_sIt extracts；

The phoneme feature X extracted_sParallel alignment is carried out with the soft mark of teacher's model, to obtain the soft mark of student model；

High-level characteristic is extracted on the basis of the acoustic feature tentatively extracted, and carries out the dimensionality reduction of high-level characteristic, and extracting can be right The characteristic sequence that noise speech invariance is characterized；

6. the noise robust acoustic modeling method according to claim 5 based on aposterior knowledge supervision, characterized in that described It extracts high-level characteristic and is extracted by the part connection of CNN network with down-sampled module.

7. the noise robust acoustic modeling method according to claim 5 based on aposterior knowledge supervision, characterized in that described The training process of neural network module is using opposite entropy minimization as Optimality Criteria.

8. the noise robust acoustic modeling method according to claim 7 based on aposterior knowledge supervision, characterized in that described The Posterior probability distribution otherness of teacher's model and student model, is quantified by relative entropy.

9. the noise robust acoustic modeling method according to claim 8 based on aposterior knowledge supervision, characterized in that described The relative entropy of teacher's model and student model are as follows:

Wherein: P_tFor the Posterior probability distribution of teacher's model, Q_sFor the Posterior probability distribution of student model, i indicates triphones state Order in set, ph_iFor i-th of state in triphones state set, X_tIndicate the clean speech for training teacher's model Feature, X_sIndicate the noisy speech feature for being used for training of students model, P_t(ph_i︱ X_t) indicate feature X_tIt is identified as i-th of three sounds The posterior probability of plain state, Q_s(ph_i︱ X_s) indicate feature X_sIt is identified as the posterior probability of i-th of triphones state.

10. the noise robust acoustic modeling method according to claim 9 based on aposterior knowledge supervision, characterized in that institute State the Posterior probability distribution relative entropy of teacher's model and student model are as follows: