CN108986788A - A kind of noise robust acoustic modeling method based on aposterior knowledge supervision - Google Patents
A kind of noise robust acoustic modeling method based on aposterior knowledge supervision Download PDFInfo
- Publication number
- CN108986788A CN108986788A CN201810576451.8A CN201810576451A CN108986788A CN 108986788 A CN108986788 A CN 108986788A CN 201810576451 A CN201810576451 A CN 201810576451A CN 108986788 A CN108986788 A CN 108986788A
- Authority
- CN
- China
- Prior art keywords
- model
- training
- supervision
- feature
- teacher
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 43
- 238000012549 training Methods 0.000 claims abstract description 54
- 238000009826 distribution Methods 0.000 claims abstract description 38
- 238000013459 approach Methods 0.000 claims abstract description 4
- 238000013527 convolutional neural network Methods 0.000 claims description 12
- 239000000284 extract Substances 0.000 claims description 11
- 238000013528 artificial neural network Methods 0.000 claims description 7
- 230000008569 process Effects 0.000 claims description 7
- 230000009467 reduction Effects 0.000 claims description 7
- 230000007613 environmental effect Effects 0.000 abstract description 4
- 230000003993 interaction Effects 0.000 abstract description 3
- 238000000605 extraction Methods 0.000 description 7
- 230000002452 interceptive effect Effects 0.000 description 6
- 238000001228 spectrum Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 238000009432 framing Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 239000004568 cement Substances 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000007812 deficiency Effects 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 241000406668 Loxodonta cyclotis Species 0.000 description 1
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 241001014642 Rasta Species 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 239000000654 additive Substances 0.000 description 1
- 230000000996 additive effect Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000000739 chaotic effect Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 239000011295 pitch Substances 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 230000007474 system interaction Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0638—Interactive procedures
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Probability & Statistics with Applications (AREA)
- Quality & Reliability (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The invention discloses a kind of noise robust acoustic modeling methods based on aposterior knowledge supervision, belong to voice human-computer interaction technique field, this method comprises: obtaining the Posterior probability distribution of clean speech by the training of teacher's model;The training for supervising student model as standard using the Posterior probability distribution of the clean speech makes student model infinitely approach the Posterior probability distribution of teacher's model;Wherein, teacher's model is the model of clean speech training, and the student model is the model of noisy speech training.The acoustic model of the exemplary modeling method of the present invention, foundation has stronger environmental robustness, shows superior noiseproof feature.
Description
Technical field
The invention belongs to voice human-computer interaction technique field, specifically a kind of noise Shandong based on aposterior knowledge supervision
Stick Acoustic Modeling method.
Background technique
In recent years, continuous with the development of the technologies such as speech recognition, natural language processing, deep learning and the market demand
In-depth, the research and development and application of interactive voice product are increasingly becoming a new hot spot;On the other hand, due to practical application scene
Complexity, the operation of voice interactive system is typically located in the environment of a low signal-to-noise ratio, due to the anti-interference ability to noise
Deficiency, system interaction often will appear situations such as speech recognition accuracy is low or human-computer interaction is chaotic in the process, lead to service pair
The interactive experience sense of elephant is bad, largely limits the market application of interactive voice product and promotes.
It is to determine voice that correlative study, which shows that can Speech acoustics model extract complete phoneme information from noisy speech,
The key of interactive system noise robustness, deficiency of the acoustic model on noise robustness are mainly that model construction stage environment is made an uproar
Acoustic conductance cause training data and test data mismatch and caused by, i.e. the purpose of raising noise robustness is to drop to greatest extent
Influence that is low or eliminating such factor.So far, many scholars of field of speech recognition open acoustic model noise robustness
It has opened up extensive research and has proposed a variety of improvement strategies, wherein application effect preferably has feature compensation, model compensation, robustness
Four kinds of methods of feature extraction and speech enhan-cement.
Feature and model compensation are the noise robustness methods for optimizing processing to acoustic model by adaptive algorithm.
Such as Leggetter etc. returns (MLLR) algorithm using maximum likelihood and carries out model adaptation;Tran etc. passes through linear decomposition net
Network carries out self-adaptive processing to the input data of the acoustic training model based on deep neural network (DNN), enables acoustic model
The data structure of enough preferable matching noisy speeches, model robustness get a promotion.
Robust features extraction refers to the characteristic parameter extracted from corpus for insensitive for noise, constructs anti-noise ability
Strong characteristic sequence, to improve the noise robustness of acoustic model.Cepstrum mean normalization method (CMN) and mean variance normalizing
Change method (MVN) is most common two kinds of robust features extracting methods, and in addition some scholars will perceive linear predictor coefficient
(PLP) feature is combined with relative spectrum (RASTA) filtering, reinforces acoustic model to the robustness of additive noise and linear filtering;Separately
Outer Liu Changzheng etc. takes the mode of supervised learning using MFCC feature as the input of CNN network, extracts the voice of higher
Feature, experiment, which shows these features in a noisy environment, has preferable timing invariance.
Most common mode is to update to eliminate the spectrum-subtraction combined to voice with noise by noise to speech enhan-cement now
Subtract from noisy speech spectrum with noise independent process assuming that estimate the noise spectrum of corpus in situation known to noise information
The noise spectrum estimated is gone to obtain the clean spectrum of corpus, to extract the instruction that the clean feature in noise speech is used for acoustic model
Practice;Furthermore Xu etc. proposes the mode that spectrum-subtraction is combined with DNN network, and spectrum-subtraction treated feature and noise estimation are joined
Number is input in DNN network as basic sample, and the depth acoustic model obtained by noise independent training is compared with spectrum-subtraction
Noiseproof feature is more preferable.
Although above-mentioned four kinds of methods can effectively promote the environmental robustness of acoustic model, theoretical upper with application
There are problems that two: first is that the above method exercises supervision simply by noise reduction of the clean speech to noisy speech or makes an uproar by band
Voice is fitted clean speech, reduces otherness between the two, does not excavate the tacit knowledge of clean speech sufficiently, right
The refinement of information is not enough;On the other hand, acoustic feature extraction module and subsequent training identified in above-mentioned four classes method
Journey is independent from each other, and does not account for the inner link between device modeling and characterization unit, so that the target letter of model training
Comprising partial redundance information, these redundancies in the phonetic feature that several and system entirety performance indicator has deviation, and extracts
Information does not have noise robustness usually, causes so that optimal performance is often not achieved in entire acoustic network.
Therefore, how to improve the noise robustness of voice interactive system is urgent problem at this stage.
Summary of the invention
For above-mentioned problems of the prior art, the purpose of the present invention is to provide one kind to be supervised based on aposterior knowledge
Noise robust acoustic modeling method, this method can promote the noise robustness of acoustic model.
The technical scheme adopted by the invention is as follows:
Provide a kind of noise robust acoustic modeling method based on aposterior knowledge supervision, comprising:
The Posterior probability distribution of clean speech is obtained by the training of teacher's model;
The training for supervising student model as standard using the Posterior probability distribution of the clean speech keeps student model unlimited
Approach the Posterior probability distribution of teacher's model;
Wherein, teacher's model is the model of clean speech training, and the student model is the mould of noisy speech training
Type.
Further, the training of teacher's model, comprising:
Feature X is carried out to clean speechtIt extracts;
To dividing the feature X after windowtIt carries out forcing alignment frame by frame, and obtains the hard mark of each frame voice data;Described point
Window, that is, framing and adding window usually carry out framing to voice data according to preset parameter, and adding window is convenient for subsequent characteristics alignment.
The start-stop point on time dimension is carried out to each hard mark on the basis of forcing alignment to mark;
The start-stop point markup information and hard labeled data are sent into DNN module as supervision message and carry out acoustic model
Modeling training.
Further, the feature after described pair of point window carries out forcing alignment frame by frame, is carried out by GMM-HMM module.
Further, the modeling training of the acoustic model, comprising:
By feature XtAs mode input, phoneme is marked firmly with labeled data as supervision message, is obtained using forwards algorithms
Three factor Posterior probability distributions of data frame by frame out.
Further, the training of the student model, comprising:
Preliminary feature X is carried out to noisy speechsIt extracts;
The phoneme feature X extractedsParallel alignment is carried out with the soft mark of teacher's model, to obtain the soft of student model
Mark;
High-level characteristic is extracted on the basis of the acoustic feature tentatively extracted, and carries out the dimensionality reduction of high-level characteristic, extracts energy
Enough characteristic sequences that noise speech invariance is characterized;
High-level characteristic input DNN module is carried out to the modeling training of acoustic model.
Further, the extraction high-level characteristic is locally connected by CNN network and is extracted with down-sampled module.
Further, the training process of the neural network module is using opposite entropy minimization as Optimality Criteria.
Further, the Posterior probability distribution otherness of teacher's model and student model, passes through the relative entropy amount of progress
Change.
Further, the relative entropy of teacher's model and student model are as follows:
Wherein: PtFor the Posterior probability distribution of teacher's model, QsFor the Posterior probability distribution of student model, i indicates triphones
Order in state set, phiFor i-th of state in triphones state set, XtIt indicates for training the clean of teacher's model
Phonetic feature, XsIndicate the noisy speech feature for being used for training of students model, Pt(phi︱ Xt) indicate feature XtIt is identified as i-th
The posterior probability of triphones state, Qs(phi︱ Xs) indicate feature XsIt is identified as the posterior probability of i-th of triphones state.
Further, the Posterior probability distribution relative entropy of teacher's model and student model are as follows:
Compared with prior art, the invention has the benefit that
1, the exemplary noise robust acoustic modeling method based on aposterior knowledge supervision of the present invention, with clean speech training
Model is as teacher's model, and the model of noisy speech training as student model, know by the Posterior probability distribution for refining teacher's model
Know the training for supervising student model, indirect reaches the requirement for improving acoustic model environmental robustness.
2, the exemplary noise robust acoustic modeling method based on aposterior knowledge supervision of the present invention, using CNN (convolutional Neural
Network) the acoustic training model network structure that is combined with DNN (deep neural network), wherein CNN module is made an uproar for extracting band
The Invariance feature of voice, DNN are used for Acoustic Modeling, the training of whole network parameter by CNN and DNN module linkage adjustment with
Optimization, the model of building have carried out the verifying of the speech recognition performance under different signal-to-noise ratio and comparison, test on CHIME data set
The result shows that the model has stronger environmental robustness, superior noiseproof feature is shown.
3, the exemplary noise robust acoustic modeling method based on aposterior knowledge supervision of the present invention, the CNN-DNN of use
Raw model increases the extraction that convolutional neural networks module carries out voice high-level characteristic, can preferably catch compared with DNN model
Catch the timing invariance of noisy speech;In addition down-sampled (Pooling) layer inside CNN convolutional neural networks is superfluous to phonetic feature
Remaining information has rejecting effect, realizes phonetic feature dimensionality reduction, also promotes while improving acoustic model noise robustness
The improved efficiency of model training.
4, the exemplary noise robust acoustic modeling method based on aposterior knowledge supervision of the present invention, is handed over compared to traditional standard
It pitches entropy (CE) and minimizes criterion, 0-1 vector (hard mark) is substituted with probability vector (soft mark), soft mark is to posterior probability
The deep layer of distribution is refined, and the useful information for including is richer, more conducively the modeling of robustness acoustic model.
Detailed description of the invention
By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other
Feature, objects and advantages will become more apparent upon:
Fig. 1 is the flow chart of the embodiment of the present invention;
Fig. 2 is the flow chart of teacher of embodiment of the present invention model training;
Fig. 3 is the structural schematic diagram of GMM-HMM module;
Fig. 4 is the flow chart of student model of embodiment of the present invention training.
Specific embodiment
The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched
The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to
Convenient for description, part relevant to invention is illustrated only in attached drawing.
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase
Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
As shown in Figure 1, An embodiment provides a kind of noise robust acoustics based on aposterior knowledge supervision
Modeling method, comprising:
S1: the Posterior probability distribution of clean speech is obtained by the training of teacher's model;
S2: the training of student model is supervised as standard using the Posterior probability distribution of the clean speech, makes student model
Infinitely approach the Posterior probability distribution of teacher's model;
Wherein, teacher's model is the model of clean speech training, and the student model is the mould of noisy speech training
Type.
For the Posterior probability distribution otherness of two kinds of models, the present embodiment is quantified using KL divergence (relative entropy).
For acoustic model, the physical significance of KL divergence refers to that in identical basic speech space, probability distribution P (x) is corresponding
Each phoneme feature, if encoded with probability distribution Q (x), average each increased bit number of phoneme feature coding length.This reality
It applies official holiday and sets PtFor the Posterior probability distribution of teacher's model, QSFor the Posterior probability distribution of student model, QSIt is equivalent to PtPosteriority
The approximate evaluation of probability distribution, therefore the relative entropy of the two may be expressed as:
Wherein: i indicates the order in triphones state set, phiFor i-th of state in triphones state set, XtTable
Show the clean speech feature for training teacher's model, XsIndicate the noisy speech feature for being used for training of students model, Pt(phi|
Xt) indicate feature XtIt is identified as the posterior probability of i-th of triphones state, Qs(phi|Xs) indicate feature XsIt is identified as i-th
The posterior probability of a triphones state, the formula can be simplified to following form by deformation:
It can be found through observation,Calculating and student model modeling process without
It closes, can ignore during practical supervised training, therefore the Posterior probability distribution relative entropy of two kinds of models can indicate are as follows:
Above-mentioned formula formally sees the calculating similar to standard cross entropy (CE), the difference is that standard cross entropy (CE) is
Experienced probability distribution and model Posterior probability distribution to training data carry out difference analysis, in general, empirical probability point
Cloth is usually to be marked firmly with 0-1 vector to be described, and the relative entropy of teacher's model and student model is to two kinds of models
Posterior probability distribution carries out otherness comparison, is equivalent to " hard mark " being substituted for " soft mark ".
The building of teacher's model is training step such as Fig. 2 institute based on GMM-HMM and the mixed model of neural network
Show:
Feature X is carried out to clean speech firsttIt extracts, GMM-HMM module is to the feature X after dividing windowtCarry out pressure pair frame by frame
Together, and the hard mark of each frame voice data is obtained, i.e., 0-1 vector determination is carried out to the triphones state of each frame, belongs to certain
Then observation probability is set as 1 to one phoneme state, is not belonging to be set as 0, to obtain the triphones state observation probability of each frame data
Distribution, such as [1 1010 0];The start-stop point on time dimension is carried out to each hard mark on the basis of forcing alignment
Mark, the markup information and hard labeled data are sent into the modeling instruction that neural network module carries out acoustic model as supervision message
Practice.The structure of GMM-HMM module is as shown in Figure 3.Above-mentioned point of window, that is, framing and adding window, usually according to preset parameter to voice
Data carry out framing, and adding window is convenient for subsequent characteristics alignment.
The training of neural network module is with feature XtAs mode input, phoneme is marked firmly with labeled data as supervision letter
Breath obtains the triphones Posterior probability distribution (hard mark) of data frame by frame using forwards algorithms.The difference of hard mark and soft mark
It is, it is soft to mark the triphones state Posterior probability distribution for referring to each frame data, rather than simple 0-1 judges, thus obtains
The forms of soft mark of each frame data be similar to [0.2 0.15 0.3 0.1 0.1 0.1], each data therein indicate
The frame data belong to the posterior probability of different triphones states.
The method that the building of student model is combined using CNN with DNN network, the propaedeutics process of student model is as schemed
Shown in 4:
The training of student model carries out preliminary feature X to noisy speech firstsIt extracts, the phoneme feature X extractedsWith it is old
The soft mark of teacher's model carries out parallel alignment, to obtain the soft mark of student model.On the basis of preliminary feature extraction, borrow
CNN network is helped locally to connect the functional characteristic with down-sampled module, on the acoustic feature basis that MFCC and FBANK etc. is tentatively extracted
Upper extraction high-level characteristic, and carry out the dimensionality reduction of high-level characteristic, to extract and can be characterized to noise speech invariance
Characteristic sequence;On the other hand, it is contemplated that DNN network has powerful classification capacity, has surmounted in the performance of acoustic model
High-level characteristic is finally inputted DNN layers of progress Acoustic Modeling by the conventional models such as GMM, the training process of entire prototype network with
Opposite entropy minimization (formula 3) is used as Optimality Criteria.The dimensionality reduction of above-mentioned high-level characteristic, which refers to, carries out characteristic pattern by pooling layers
Dimensionality reduction and the condensed important feature with local summaries.
The noise robust acoustic modeling method based on aposterior knowledge supervision of the present embodiment, similar to teacher's instruction of papil
Mode instructs the training of student model as supervision message using the Posterior probability distribution (soft mark) of teacher's model, and
A kind of student model based on CNN-DNN hybrid network is designed, is refined by the high-level characteristic to noisy speech, is promoted
The noiseproof feature of acoustic model.The student model of the present embodiment building has carried out performance verification work in the case where CHIME band makes an uproar data set
Make, experimental result shows that the student model Word Error Rate under three kinds of teacher's model supervision averagely has dropped compared with baseline model
5.21%, 6.35% and 7.83%, show that aposterior knowledge measure of supervision proposed in this paper has very the robustness of acoustic model
Good promotion effect.
Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.Those skilled in the art
Member is it should be appreciated that invention scope involved in the application, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic
Scheme, while should also cover in the case where not departing from the inventive concept, it is carried out by above-mentioned technical characteristic or its equivalent feature
Any combination and the other technical solutions formed.Such as features described above has similar function with (but being not limited to) disclosed herein
Can technical characteristic replaced mutually and the technical solution that is formed.
Except for the technical features described in the specification, remaining technical characteristic is the known technology of those skilled in the art, is prominent
Innovative characteristics of the invention out, details are not described herein for remaining technical characteristic.
Claims (10)
1. a kind of noise robust acoustic modeling method based on aposterior knowledge supervision, characterized in that include:
The Posterior probability distribution of clean speech is obtained by the training of teacher's model;
The training for supervising student model as standard using the Posterior probability distribution of the clean speech, approaches student model infinitely
The Posterior probability distribution of teacher's model;
Wherein, teacher's model is the model of clean speech training, and the student model is the model of noisy speech training.
2. the noise robust acoustic modeling method according to claim 1 based on aposterior knowledge supervision, characterized in that described
The training of teacher's model, comprising:
Feature X is carried out to clean speechtIt extracts;
To dividing the feature X after windowtIt carries out forcing alignment frame by frame, and obtains the hard mark of each frame voice data;
The start-stop point on time dimension is carried out to each hard mark on the basis of forcing alignment to mark;
The start-stop point markup information and hard labeled data are sent into the modeling that DNN module carries out acoustic model as supervision message
Training.
3. the noise robust acoustic modeling method according to claim 2 based on aposterior knowledge supervision, characterized in that described
To dividing the feature after window to carry out forcing alignment frame by frame, carried out by GMM-HMM module.
4. the noise robust acoustic modeling method according to claim 2 based on aposterior knowledge supervision, characterized in that described
The modeling training of acoustic model, comprising:
By feature XtAs mode input, phoneme is marked firmly with labeled data as supervision message, is obtained frame by frame using forwards algorithms
Three factor Posterior probability distributions of data.
5. the noise robust acoustic modeling method according to claim 1 based on aposterior knowledge supervision, characterized in that described
The training of student model, comprising:
Preliminary feature X is carried out to noisy speechsIt extracts;
The phoneme feature X extractedsParallel alignment is carried out with the soft mark of teacher's model, to obtain the soft mark of student model;
High-level characteristic is extracted on the basis of the acoustic feature tentatively extracted, and carries out the dimensionality reduction of high-level characteristic, and extracting can be right
The characteristic sequence that noise speech invariance is characterized;
High-level characteristic input DNN module is carried out to the modeling training of acoustic model.
6. the noise robust acoustic modeling method according to claim 5 based on aposterior knowledge supervision, characterized in that described
It extracts high-level characteristic and is extracted by the part connection of CNN network with down-sampled module.
7. the noise robust acoustic modeling method according to claim 5 based on aposterior knowledge supervision, characterized in that described
The training process of neural network module is using opposite entropy minimization as Optimality Criteria.
8. the noise robust acoustic modeling method according to claim 7 based on aposterior knowledge supervision, characterized in that described
The Posterior probability distribution otherness of teacher's model and student model, is quantified by relative entropy.
9. the noise robust acoustic modeling method according to claim 8 based on aposterior knowledge supervision, characterized in that described
The relative entropy of teacher's model and student model are as follows:
Wherein: PtFor the Posterior probability distribution of teacher's model, QsFor the Posterior probability distribution of student model, i indicates triphones state
Order in set, phiFor i-th of state in triphones state set, XtIndicate the clean speech for training teacher's model
Feature, XsIndicate the noisy speech feature for being used for training of students model, Pt(phi︱ Xt) indicate feature XtIt is identified as i-th of three sounds
The posterior probability of plain state, Qs(phi︱ Xs) indicate feature XsIt is identified as the posterior probability of i-th of triphones state.
10. the noise robust acoustic modeling method according to claim 9 based on aposterior knowledge supervision, characterized in that institute
State the Posterior probability distribution relative entropy of teacher's model and student model are as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810576451.8A CN108986788A (en) | 2018-06-06 | 2018-06-06 | A kind of noise robust acoustic modeling method based on aposterior knowledge supervision |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810576451.8A CN108986788A (en) | 2018-06-06 | 2018-06-06 | A kind of noise robust acoustic modeling method based on aposterior knowledge supervision |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108986788A true CN108986788A (en) | 2018-12-11 |
Family
ID=64540863
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810576451.8A Pending CN108986788A (en) | 2018-06-06 | 2018-06-06 | A kind of noise robust acoustic modeling method based on aposterior knowledge supervision |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108986788A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110246487A (en) * | 2019-06-13 | 2019-09-17 | 苏州思必驰信息科技有限公司 | Optimization method and system for single pass speech recognition modeling |
CN110610715A (en) * | 2019-07-29 | 2019-12-24 | 西安工程大学 | Noise reduction method based on CNN-DNN hybrid neural network |
CN110634476A (en) * | 2019-10-09 | 2019-12-31 | 深圳大学 | Method and system for rapidly building robust acoustic model |
CN111599373A (en) * | 2020-04-07 | 2020-08-28 | 云知声智能科技股份有限公司 | Compression method of noise reduction model |
CN112291424A (en) * | 2020-10-29 | 2021-01-29 | 上海观安信息技术股份有限公司 | Fraud number identification method and device, computer equipment and storage medium |
CN113380268A (en) * | 2021-08-12 | 2021-09-10 | 北京世纪好未来教育科技有限公司 | Model training method and device and speech signal processing method and device |
WO2023279693A1 (en) * | 2021-07-09 | 2023-01-12 | 平安科技(深圳)有限公司 | Knowledge distillation method and apparatus, and terminal device and medium |
US11907845B2 (en) | 2020-08-17 | 2024-02-20 | International Business Machines Corporation | Training teacher machine learning models using lossless and lossy branches |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101710490A (en) * | 2009-11-20 | 2010-05-19 | 安徽科大讯飞信息科技股份有限公司 | Method and device for compensating noise for voice assessment |
CN104392718A (en) * | 2014-11-26 | 2015-03-04 | 河海大学 | Robust voice recognition method based on acoustic model array |
CN104992705A (en) * | 2015-05-20 | 2015-10-21 | 普强信息技术(北京)有限公司 | English oral automatic grading method and system |
CN105609100A (en) * | 2014-10-31 | 2016-05-25 | 中国科学院声学研究所 | Acoustic model training and constructing method, acoustic model and speech recognition system |
US20170263240A1 (en) * | 2012-11-29 | 2017-09-14 | Sony Interactive Entertainment Inc. | Combining auditory attention cues with phoneme posterior scores for phone/vowel/syllable boundary detection |
-
2018
- 2018-06-06 CN CN201810576451.8A patent/CN108986788A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101710490A (en) * | 2009-11-20 | 2010-05-19 | 安徽科大讯飞信息科技股份有限公司 | Method and device for compensating noise for voice assessment |
US20170263240A1 (en) * | 2012-11-29 | 2017-09-14 | Sony Interactive Entertainment Inc. | Combining auditory attention cues with phoneme posterior scores for phone/vowel/syllable boundary detection |
CN105609100A (en) * | 2014-10-31 | 2016-05-25 | 中国科学院声学研究所 | Acoustic model training and constructing method, acoustic model and speech recognition system |
CN104392718A (en) * | 2014-11-26 | 2015-03-04 | 河海大学 | Robust voice recognition method based on acoustic model array |
CN104992705A (en) * | 2015-05-20 | 2015-10-21 | 普强信息技术(北京)有限公司 | English oral automatic grading method and system |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110246487A (en) * | 2019-06-13 | 2019-09-17 | 苏州思必驰信息科技有限公司 | Optimization method and system for single pass speech recognition modeling |
CN110246487B (en) * | 2019-06-13 | 2021-06-22 | 思必驰科技股份有限公司 | Optimization method and system for single-channel speech recognition model |
CN110610715B (en) * | 2019-07-29 | 2022-02-22 | 西安工程大学 | Noise reduction method based on CNN-DNN hybrid neural network |
CN110610715A (en) * | 2019-07-29 | 2019-12-24 | 西安工程大学 | Noise reduction method based on CNN-DNN hybrid neural network |
CN110634476A (en) * | 2019-10-09 | 2019-12-31 | 深圳大学 | Method and system for rapidly building robust acoustic model |
CN110634476B (en) * | 2019-10-09 | 2022-06-14 | 深圳大学 | Method and system for rapidly building robust acoustic model |
CN111599373A (en) * | 2020-04-07 | 2020-08-28 | 云知声智能科技股份有限公司 | Compression method of noise reduction model |
CN111599373B (en) * | 2020-04-07 | 2023-04-18 | 云知声智能科技股份有限公司 | Compression method of noise reduction model |
US11907845B2 (en) | 2020-08-17 | 2024-02-20 | International Business Machines Corporation | Training teacher machine learning models using lossless and lossy branches |
CN112291424B (en) * | 2020-10-29 | 2021-09-14 | 上海观安信息技术股份有限公司 | Fraud number identification method and device, computer equipment and storage medium |
CN112291424A (en) * | 2020-10-29 | 2021-01-29 | 上海观安信息技术股份有限公司 | Fraud number identification method and device, computer equipment and storage medium |
WO2023279693A1 (en) * | 2021-07-09 | 2023-01-12 | 平安科技(深圳)有限公司 | Knowledge distillation method and apparatus, and terminal device and medium |
CN113380268A (en) * | 2021-08-12 | 2021-09-10 | 北京世纪好未来教育科技有限公司 | Model training method and device and speech signal processing method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108986788A (en) | A kind of noise robust acoustic modeling method based on aposterior knowledge supervision | |
CN104036774B (en) | Tibetan dialect recognition methods and system | |
WO2018054361A1 (en) | Environment self-adaptive method of speech recognition, speech recognition device, and household appliance | |
CN113488058B (en) | Voiceprint recognition method based on short voice | |
CN108694949B (en) | Speaker identification method and device based on reordering supervectors and residual error network | |
CN109616105A (en) | A kind of noisy speech recognition methods based on transfer learning | |
CN108806667A (en) | The method for synchronously recognizing of voice and mood based on neural network | |
CN100440315C (en) | Speaker recognition method based on MFCC linear emotion compensation | |
CN110211594B (en) | Speaker identification method based on twin network model and KNN algorithm | |
CN103811009A (en) | Smart phone customer service system based on speech analysis | |
CN103730114A (en) | Mobile equipment voiceprint recognition method based on joint factor analysis model | |
CN101246685A (en) | Pronunciation quality evaluation method of computer auxiliary language learning system | |
CN108922541A (en) | Multidimensional characteristic parameter method for recognizing sound-groove based on DTW and GMM model | |
CN104123933A (en) | Self-adaptive non-parallel training based voice conversion method | |
CN109243460A (en) | A method of automatically generating news or interrogation record based on the local dialect | |
JPH075892A (en) | Voice recognition method | |
Marchi et al. | Generalised discriminative transform via curriculum learning for speaker recognition | |
CN107039036A (en) | A kind of high-quality method for distinguishing speek person based on autocoding depth confidence network | |
CN109637526A (en) | The adaptive approach of DNN acoustic model based on personal identification feature | |
KR20190112682A (en) | Data mining apparatus, method and system for speech recognition using the same | |
CN110047504A (en) | Method for distinguishing speek person under identity vector x-vector linear transformation | |
CN100570712C (en) | Based on anchor model space projection ordinal number quick method for identifying speaker relatively | |
CN101178895A (en) | Model self-adapting method based on generating parameter listen-feel error minimize | |
CN118173092A (en) | Online customer service platform based on AI voice interaction | |
CN105845131A (en) | Far-talking voice recognition method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181211 |