CN105957520B

CN105957520B - A kind of voice status detection method suitable for echo cancelling system

Info

Publication number: CN105957520B
Application number: CN201610519040.6A
Authority: CN
Inventors: 王珂; 明萌; 纪红; 李曦; 张鹤立
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2016-07-04
Filing date: 2016-07-04
Publication date: 2019-10-11
Anticipated expiration: 2036-07-04
Also published as: CN105957520A

Abstract

The present invention is a kind of voice status detection method suitable for echo cancelling system, is related to the technical field of voice interaction of IP based network.The present invention utilizes noise training sample and voice training sample architecture support vector machines (SVM) classifier, signal to be detected is the proximally and distally signal after piecemeal, VAD judgement is carried out to this piecemeal remote signaling based on the SVM classifier of gauss hybrid models using what is constructed, if it is judged that for no voice, stop filter update and filtering, near-end voice signals are directly exported, if it is determined that there is voice in distal end, carry out dual end communication judgement；When being in dual end communication, stops filter coefficient update, near end signal is filtered；Otherwise, device coefficient update and filtering are filtered according to remote signaling.The present invention improves the accuracy of Voice activity detector, avoids both-end mute state being mistaken for dual end communication state, it is therefore prevented that the mistake of filter is updated and filtered without reference to signal.

Description

A kind of voice status detection method suitable for echo cancelling system

Technical field

The present invention relates to the technical field of voice interaction of IP based network, in particular to one kind to be suitable for echo cancelling system Voice status detection method.

Background technique

Echo cancellation technology is widely used in the IP based networks such as TeleConference Bridge, on-vehicle Bluetooth system, IP phone In voice interactive system, the sound to eliminate loudspeaker broadcasting is picked up after a variety of propagateds by microphone, and passes back The acoustic echo formed to system distal end.The core concept of echo cancellor is by a sef-adapting filter analog echo road Diameter, and estimated echo signal is subtracted from the signal that microphone picks up.

Voice status detection plays a crucial role in echo cancellor.It is needed before voice signal enters filter Current speech state is judged first, the voice status according to locating for system determines the working condition of filter.Whether Can accurately be formed a prompt judgement system voice state, be had a great impact to the effect of echo cancellor.

Existing echo cancelling system typically directly uses DTD (Double Talk Detection, double talk detection) Algorithm judges whether system is in dual end communication state, and stops filter coefficient update under dual end communication state, prevents this Filter is dissipated due to the interference by near-end speech in the case of kind.Common DTD algorithm --- Geigel algorithm passes through ratio The range value of nearer end signal and remote signaling judges whether there is near-end speech, in the ratio of near end signal and remote signaling amplitude Value ξ^(g)Think that system is in dual end communication state when greater than particular value T.Work as:

When, it is believed that there are near-end speech, system is in dual end communication state.Wherein | y (k) | it is near-end speech range value, Max | x (k-1) | ..., | x (k-N) | be far-end speech signal top n sampled point maximum amplitude value.Thresholding T is according to echo Path attenuation determines, can usually take 0.5；N is usually equal with filter length.

But the method has the following shortcomings:

1, Geigel algorithm assumes that near-end speech is much larger than the echo signal of distal end, not fully meets echo cancellor Actual conditions, therefore be not very accurate in some cases.

2, without distal end VAD (Voice Activity Detection, Voice activity detector) with regard to directly carrying out DTD It may result in both-end mute state and be mistaken for dual end communication state.

3, only stop filter coefficient update under dual end communication state, in the state that far-end speech is not present continue into Row filtering and coefficient update may cause filter divergence, and not existing distal end language is proximally mistakenly subtracted in signal Sound.

Summary of the invention

In order to overcome the problems, such as that above-mentioned three, the present invention propose the voice status detection method of combination VAD and DTD a kind of, And design new filtering and more new strategy according to testing result to improve Detection accuracy, the erroneous judgement of voice status is avoided, is prevented The mistake of filter updates and filtering.

A kind of voice status detection method suitable for echo cancelling system provided by the invention realizes that steps are as follows:

Step 1: utilizing noise training sample and voice training sample architecture support vector machines classifier.

Characteristics extraction and gauss hybrid models GMM training are carried out to noise training sample and voice training sample respectively, Construct corresponding Gauss super vector.SVM classifier kernel function and voice signal and noise signal are constructed using Gauss super vector Corresponding SVM model obtains SVM classifier using the kernel function and SVM Construction of A Model that have constructed.

Step 2: signal to be detected is the proximally and distally signal after piecemeal.Using having constructed based on Gaussian Mixture mould The SVM classifier of type carries out VAD judgement to this piecemeal remote signaling.

Characteristics extraction and GMM training are carried out to this piecemeal remote signaling, construct Gauss super vector.This piecemeal distal end is believed Number corresponding Gauss super vector, which is input in the SVM classifier constructed, to be made decisions.If being classified as noise, judging result is Without voice, then stop filter update and filtering, directly output near-end voice signals.Otherwise illustrate that there is voice in distal end, carry out down The dual end communication of one step is adjudicated.

Step 3: judging whether system belongs to dual end communication state.

Calculate the normalized crosscorrelation ξ of remote signaling and error signal_XECC, compare normalized crosscorrelation ξ_XECCWith setting Thresholding T_XECC, work as ξ_XECC< T_XECCWhen, there is voice in proximal end, and system is in dual end communication state, stops filter coefficient update, right Near end signal is filtered.Work as ξ_XECC≥T_XECCWhen, proximal end is filtered device coefficient update and filter without voice, according to remote signaling Wave.

Advantages of the present invention with have the active effect that

(1) Voice activity detector is carried out to remote signaling using the algorithm of support vector machine based on gauss hybrid models, The accuracy for improving Voice activity detector, overcome existing for the commonly Voice activity detector method based on energy The problem of inaccuracy is detected under Low SNR.

(2) it carries out far-end speech detection of activity first before double talk detection, is carried out again when distally there is voice Double talk detection can be avoided both-end mute state being mistaken for dual end communication state.It is logical using the both-end based on cross-correlation Detection algorithm is talked about, the accuracy of double talk detection is improved.

(3) the different phonetic state according to locating for system takes different filtering and more new strategy.With traditional echo cancellor System only stops filter coefficient update in dual end communication and compares, and also stops filter coefficient in the state that distal end is without voice It updates and filters, the mistake of filter without reference to signal can be further prevented to update and filter.

Detailed description of the invention

Fig. 1 is the overall flow schematic diagram of the voice status detection method suitable for echo cancelling system of the invention；

Fig. 2 is emulation of the embodiment of the present invention two sections of PCM stream schematic diagrames used；

Fig. 3 is the effect diagram that the embodiment of the present invention is used only that the DTD detection based on energy carries out echo cancellor；

Fig. 4 is the effect diagram that the embodiment of the present invention carries out echo cancellor using the method for the present invention；

Fig. 5 is Sipdroid echo cancellor effect diagram of the embodiment of the present invention using the echo cancellor library before improving；

Fig. 6 is the Sipdroid echo cancellor effect diagram that the embodiment of the present invention uses improved echo cancellor library；

Specific embodiment

Below in conjunction with drawings and examples, the present invention is described in further detail.

The method of the present invention carries out VAD to remote signaling first before DTD, in the absence of VAD detects remote signaling Stop filter coefficient update and filtering, directly to prevent filter divergence and mistakenly filter.Detect there is distal end in VAD DTD is carried out when voice again, and stops filter coefficient update in dual end communication.Vad algorithm used in it is based on GMM The SVM (Support Vector Machine, support vector machines) of (Gaussian Mixture Model, gauss hybrid models) Algorithm, the algorithm utilize GMM construction feature super vector, and GMM super vector is used for characteristic value input and the Kernel of SVM, Accuracy rate is higher than the commonly vad algorithm based on energy or correlation.The DTD algorithm used is believed based on remote signaling and error The DTD of number cross-correlation, accuracy rate is also above the Geigel algorithm commonly based on energy.By the way that distal end VAD and DTD are combined Come, the accuracy of voice status detection can be improved.It, can be to prevent by taking different filtering strategies under different phonetic state The only diverging of filter and the filtering of mistake, substantially improve the effect of echo cancellor.

It is illustrated with reference to Fig. 1 each step of the voice status detection method suitable for echo cancelling system of the invention.

Step 1, using noise training sample and voice training sample architecture SVM classifier, including step S101~ S103。

Step S101: characteristics extraction is carried out to noise signal training sample and voice signal training sample.Here it uses Characteristic value be Mel cepstrum coefficient (MFCC).The specific extraction process of MFCC: carrying out preemphasis, piecemeal and windowing process to signal, Piecemeal after adding window is found out to the frequency spectrum parameter of each piecemeal by Fast Fourier Transform (FFT) (FFT).By the frequency spectrum of each piecemeal Parameter by one group of Mel scale filter as composed by K triangle strip bandpass filter, K Mel bandpass filter number from 0 arrives K-1, and the output of each frequency band is taken logarithm, finds out the logarithmic energy of each output, obtains to each piecemeal voice signal Corresponding K log spectrum.K is positive integer, and general value is 20~30.K obtained log spectrum is finally subjected to cosine Transformation finds out Mel cepstrum coefficient.Log spectrum is transformed into cepstrum frequency domain by discrete cosine transform and obtains Mel cepstrum coefficient Formula is as follows:

Wherein, S_i(k) corresponding obtained log spectrum, K after the bandpass filter for passing through number k for i-th of piecemeal signal For the number of Mel bandpass filter, m_iIt (l) is the l rank parameter of the MFCC of i-th of piecemeal voice signal, L is the MFCC extracted Total order, i indicates corresponding i-th of piecemeal in formula (1), and i is positive integer.

Step S102: noise signal training sample and the corresponding Gauss super vector of voice signal training sample are generated.

The MFCC parameter for being utilized respectively noise signal training sample and voice signal training sample establishes noise signal and language The corresponding gauss hybrid models of sound signal.GMM is substantially a kind of Multi-dimensional probability density function, N rank gauss hybrid models g (x) It is that frame feature is described by the linear combination of N number of single Gaussian Profile in the distribution of feature space, to a certain piecemeal, g (x) is indicated such as Under:

Wherein, x is the L dimensional feature vector that constitutes of MFCC parameter of training sample this piecemeal, and N is the rank of gauss hybrid models Number, p_iIt (x) is i-th of Gaussian component of gauss hybrid models, w_iFor gauss hybrid models component p_i(x) weighted factor.

p_i(x) it is expressed as follows:

Wherein, Σ_iIt is the covariance matrix of i-th of Gaussian component, μ_iIt is the mean vector of i-th of Gaussian component, therefore, The parameter set λ of GMM model can be expressed as follows:

λ=(w_i,μ_i,Σ_i), i=1,2 ..., N (4)

Corresponding gauss hybrid models g (x) can be indicated are as follows:

Wherein, N () indicates Gaussian probability-density function.

The process for establishing GMM model is actually to pass through the process of the parameter of training estimation GMM model.It can be using most Big expectation EM algorithm carries out model parameter update.There are two key steps for the algorithm: expectation E step and maximization M step.E step utilizes Current parameter set calculates the desired value of the likelihood score function of partial data, and M step obtains new ginseng by maximizing expectation function Number.E step and M walk iteration always until convergence.The GMM model for finally distinguishing available voice and noise, is set as g (s) and g (n), s indicates that voice signal, n indicate noise signal.

Gauss super vector is constructed using established gauss hybrid models.Gauss super vector is the parameter of gauss hybrid models It, can be by the GMM Gauss super vector m of voice and noise made of construction_sAnd m_nIt respectively indicates as follows:

For the mean vector of Gaussian component each in g (s),For Gauss each in g (n) point The mean vector of amount.

Step S103: the Gauss super vector construction SVM classifier constructed is utilized.It is utilized respectively noise signal and voice letter Number corresponding Gauss super vector m_nAnd m_sEstablish noise signal and the corresponding SVM model of voice signal.Utilize noise signal and voice The corresponding Gauss super vector m of signal_nAnd m_sConstruct K-L kernel function.The kernel function is dissipated using the K-L between two GMM probability distribution Degree constructs.

By the GMM super vector m of voice and noise_nAnd m_sKernel function K (n, s) expression of construction is as follows:

Determine available SVM classifier after the SVM of kernel function, the SVM of voice signal and noise signal.

Step 2 carries out VAD judgement to this piecemeal remote signaling based on the SVM classifier of GMM using what is constructed.Input The signal to be detected of SVM classifier is the proximally and distally signal after piecemeal.It needs to carry out Fourier transformation first to be transformed into frequency Then domain calculates the characteristic value of signal piecemeal, i.e. MFCC, normalized crosscorrelation etc. according to signal spectrum.It particularly may be divided into step S201~S203.

Step S201: this piecemeal remote signaling MFCC parameter extraction.The specific extraction process of MFCC parameter with step 101, The corresponding MFCC parameter of this piecemeal remote signaling is finally obtained by formula (1).

Step S202: the corresponding Gauss super vector of this piecemeal remote signaling generates.Joined using this piecemeal remote signaling MFCC Number establishes gauss hybrid models, and using established gauss hybrid models construct the corresponding Gauss of this piecemeal remote signaling surpass to Amount.Gauss super vector generation method is with step S102, as shown in formula (6) and (7).

Step S203: the corresponding Gauss super vector of this piecemeal remote signaling is input in the SVM classifier constructed, is made Speech/noise classification is carried out with the SVM algorithm based on GMM.Obtain the VAD court verdict of far-end speech.If being classified as noise, Judging result is no voice, then stops filter update and filtering, directly output near-end voice signals.If being classified as voice, Illustrate that there is voice in distal end, carries out the dual end communication judgement of next step.

Step 3, judges whether system belongs to dual end communication state.

Step S301: error signal.

Adaptive filter coefficient simulates echo path, thus this piecemeal remote signaling and adaptive filter coefficient into The available estimated echo signal x of row convolution^T(n) w (n), error signal e (n) be this piecemeal near end signal d (n) with estimate Count echo signal x^T(n) difference of w (n).

Adaptive filter coefficient is to be constantly updated according to adaptive algorithm using error signal and remote signaling.One Kind is common, and more new algorithm --- the more new formula of LMS algorithm is as follows:

W (n+1)=+ 2 μ e (n) x (n) of w (n) (9)

Wherein, μ is step-length, and w (n) is filter weight vector, and e (n) is error signal, and x (n) is remote signaling.N is represented N-th of moment (sampled point).

Step S302: the normalized crosscorrelation of remote signaling and error signal is calculated.Since the computing cross-correlation of time domain can To be converted to the dot product of frequency domain, i.e. two signal spectrum values are multiplied point by point, therefore can directly utilize remote signaling frequency spectrum X (k) The value of the normalized crosscorrelation is acquired with error signal spectrum E (k), computation complexity is lower.Normalized crosscorrelation is in frequency domain Calculation method:

ξ_XECCIndicate that the normalized crosscorrelation of remote signaling and error signal, k indicate frequency point.

Step S303:DTD judgement.Compare the normalized crosscorrelation ξ of remote signaling and error signal_XECCIt is mutual with normalization It closes the door and limits.When proximal end is without voice, the normalized crosscorrelation ξ of remote signaling and error signal_XECCIt should be equal to 1, and proximal end has When voice, normalized crosscorrelation ξ_XECCLess than 1.Therefore, can be set one be slightly less than 1 constant T_XECCAs threshold value, T_XECC Usual value is between 0.9 to 1, and threshold value real-time update according to testing result.The algorithm of update selects according to the actual situation It takes.One good threshold value should make misinformation probability and miss probability all relatively small.Such as: one can be arbitrarily selected first It is slightly less than 1 constant, it is 0 that near-end speech, which is then arranged, calculates misinformation probability and miss probability, adjusts in a certain range T_XECC, until misinformation probability and miss probability are all smaller.

When normalized crosscorrelation is less than thresholding, it may be assumed that

ξ_XECC< T_XECC (11)

System is in dual end communication state, stops filter coefficient update, is directly believed using original filter coefficient proximal end It number is filtered；Otherwise, near-end speech is not present, only exists far-end speech, had at this moment both been filtered device coefficient update, also carry out Filtering.

Voice status detection method proposed by the present invention is applied in actual echo cancelling system, including two ends End, verifies practical communication effect using VoIP software Sipdroid.

It is emulated first using voice status detection method of the matlab to combination VAD and DTD proposed by the present invention.It is imitative Very voice signal used includes 1 section of 30 seconds far-end speech PCM (Pulse Code Modulation, pulse code modulation) Stream and 1 section of corresponding near-end speech PCM stream, sample frequency is 8000Hz.In echo cancelling system, filter Length is set as 128, and adaptive filter algorithm uses BFDAF algorithm (i.e. the NLMS algorithm of frequency domain), and voice status detection algorithm Using voice status detection method proposed by the present invention.

As shown in Fig. 2, the two section PCM streams used for emulation.It is followed successively by remote signaling waveform, near end signal wave from top to bottom Shape.Abscissa is time, unit s；Ordinate is range value.Using original voice status detection method, i.e. Jin Shiyong is based on The DTD of energy is detected, and echo cancellor effect is as shown in Figure 3.It can be seen from the figure that under the conditions of VAD is unmodified, front half section Echo cancellor effect it is preferable, but there are a small amount of residual echos；The effect of second half section is then less desirable, and primary sound is eliminated It must compare more, the signal after echo cancellor produces larger distortion.

Using voice status detection method proposed by the present invention, the effect of echo cancellor is as shown in Figure 4.Before comparison improves With two sections of PCM streams for carrying out obtaining after echo cancellor respectively after improvement, it can be seen that echo cancellor effect is improving voice shape It improves significantly after state detection method.Residual echo is eliminated more thorough, and near-end speech is also almost without there is distortion phenomenon.

In order to further verify effect of the voice status detection method proposed by the present invention in actual echo elimination system, Corresponding c program is write to this method, and this method is tested using voice communication software Sipdroid.

The step of voice status detection method according to the present invention, which modifies, executes VAD and DTD in the WebRTC of echo cancellor library Part, the echo cancellor library is then called in Sipdroid.Practical both-end is carried out using Sipdroid under various circumstances It converses and records, the voice PCM stream before and after echo cancellor is saved, to carry out echo cancellor effect analysis.

In order to more convenient and clear when carrying out observation analysis after taking out voice flow, every time in test, two callers Count off is successively carried out from 1 to 10.Under various circumstances, repeatedly led to before improving with improved Sipdroid version respectively Words test is to compare.

The multiple speaking test of Sipdroid echo cancellor effect progress first to the echo cancellor library before improving is used, and PCM stream after taking out distal end, proximal end and echo cancellor.Test results are shown in figure 5, and the PCM stream of count off part is only intercepted in figure. Wherein, first segment PCM stream is remote signaling, and second segment PCM stream is near end signal, and third section PCM stream is close after echo cancellor End signal.As it can be seen that echo cancellor effect is less desirable, there is a little residual echo in count off part, and dotted line frame irises out part.Other Test result is largely similar.

Then, to the echo cancellor effect for the Sipdroid for using improved echo cancellor library also use same method into The multiple speaking test of row, and take out the PCM stream after distal end, proximal end and echo cancellor.Fig. 6 is more representational primary test As a result.Similar with Fig. 5, first segment PCM stream is remote signaling in figure, and second segment PCM stream is near end signal, and third section PCM stream is Near end signal after echo cancellor.As it can be seen that echo cancellor effect compares reason after using the improved speech detection method of the present invention Think, the residual echo of count off part eliminates ratio more thoroughly, and as dotted line frame irises out part, while the reservation of primary sound is not also by shadow It rings.Repeatedly test discovery, under various circumstances, the effect of echo cancellor will receive certain influence, and stability need further It improves.But in most cases, all compared with before-improvement using the echo cancellor effect after voice status detection method of the invention Echo cancellor effect have clear improvement.

Claims

1. a kind of voice status detection method suitable for echo cancelling system, which is characterized in that realize that steps are as follows:

Step 1: constructing support vector machines classifier using noise signal training sample and voice signal training sample；

Characteristics extraction is carried out to noise signal training sample and voice signal training sample respectively and gauss hybrid models GMM is instructed Practice, construct corresponding Gauss super vector, then utilizes the kernel function and voice signal of Gauss super vector construction SVM classifier SVM model corresponding with noise signal；SVM classifier is obtained using the kernel function and SVM Construction of A Model that have constructed；

Step 2: signal to be detected is the proximally and distally signal after piecemeal, it is remote to this piecemeal using the SVM classifier constructed End signal carries out VAD judgement；VAD indicates Voice activity detector；

Characteristics extraction and GMM training are carried out to this piecemeal remote signaling, construct Gauss super vector, then this piecemeal remote signaling Corresponding Gauss super vector, which is input in the SVM classifier constructed, to be made decisions；If it is judged that being noise, no language is indicated Sound then stops filter update and filtering, directly output near-end voice signals, otherwise illustrates that there is voice in distal end, carries out in next step Dual end communication judgement；

Step 3: judging whether system belongs to dual end communication state；

Calculate the normalized crosscorrelation ξ of remote signaling and error signal_XECC；Compare normalized crosscorrelation ξ_XECCWith the thresholding of setting T_XECC, work as ξ_XECC< T_XECCWhen, system is in dual end communication state, stops filter coefficient update, filters near end signal Wave；Otherwise, proximal end is filtered device coefficient update and filtering according to remote signaling without voice.

2. a kind of voice status detection method suitable for echo cancelling system according to claim 1, which is characterized in that The first step constructs SVM classifier, includes the following steps:

Step S101: characteristics extraction is carried out to noise signal training sample and voice signal training sample；Used feature Value is Mel cepstrum coefficient MFCC；

The extraction process of MFCC is: carrying out preemphasis, piecemeal and windowing process to signal, the piecemeal after adding window is passed through quick Fu In leaf transformation FFT find out the frequency spectrum parameter of each piecemeal；By the frequency spectrum parameter of each piecemeal by one group by K triangular band pass Mel scale filter composed by filter, and logarithm is taken to the output of each frequency band, obtain log spectrum；If K band logical filter The number of wave device is from 0 to K-1, then corresponding obtained log spectrum is S after the bandpass filter that i-th of piecemeal passes through number k_i (k), the l rank parameter m of the MFCC of i-th of piecemeal_i(l) are as follows:

Wherein, L is total order of the MFCC extracted；

Step S102: the Gauss super vector of noise signal training sample and voice signal training sample is generated；

The MFCC parameter for being utilized respectively noise signal training sample and voice signal training sample establishes noise signal and voice letter Number corresponding gauss hybrid models；

To a certain piecemeal, N rank gauss hybrid models g (x) is indicated are as follows:

Wherein, x is the L dimensional feature vector that constitutes of MFCC parameter of training sample this piecemeal, p_iIt (x) is the i-th of gauss hybrid models A Gaussian component, w_iFor the weighted factor of i-th of Gaussian component；Σ_iIt is the covariance matrix of i-th of Gaussian component, μ_iIt is i-th The mean vector of a Gaussian component；

Gauss hybrid models g (x) is further indicated that are as follows:N () indicates Gaussian probability density letter Number；

The update that gauss hybrid models parameter is carried out using EM algorithm, if finally obtaining the height of voice signal training sample This mixed model is g (s), wherein the mean vector of each Gaussian component isS indicates voice signal；It finally obtains Noise signal training sample gauss hybrid models be g (n), wherein the mean vector of each Gaussian component isN indicates noise signal；Voice signal training sample and noise are constructed using established gauss hybrid models The Gauss super vector m of signal training sample_sAnd m_nIt is respectively as follows:

Step S103: the Gauss super vector construction SVM classifier constructed is utilized；

It is utilized respectively Gauss super vector m_nAnd m_sEstablish noise signal and the corresponding SVM model of voice signal；

Utilize Gauss super vector m_nAnd m_sIt is as follows to construct kernel function K (n, s):

The SVM model for determining kernel function, the SVM model of voice signal and noise signal, obtains SVM classifier.

3. a kind of voice status detection method suitable for echo cancelling system according to claim 1 or 2, feature exist In in the third step, the method for error signal is: this piecemeal remote signaling and adaptive filter coefficient are carried out Convolution obtains estimated echo signal, and error signal is the difference of this piecemeal near end signal and estimated echo signal.

4. a kind of voice status detection method suitable for echo cancelling system according to claim 1 or 2, feature exist In in the third step, according to the normalized crosscorrelation ξ of following formula calculating remote signaling and error signal_XECC:

Wherein, k indicates that frequency point, X (k) are remote signaling frequency spectrum, and E (k) is error signal spectrum.

5. a kind of voice status detection method suitable for echo cancelling system according to claim 1 or 2, feature exist In, in the third step, the thresholding T of setting_XECCFor the value between 0.9 to 1, and real-time update is carried out according to court verdict.