CN104810021B

CN104810021B - The pre-treating method and device recognized applied to far field

Info

Publication number: CN104810021B
Application number: CN201510236032.6A
Authority: CN
Inventors: 魏建强; 崔玮玮; 宋辉; 王昕�; 姜俊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2015-05-11
Filing date: 2015-05-11
Publication date: 2017-08-18
Anticipated expiration: 2035-05-11
Also published as: CN104810021A

Abstract

A kind of pre-treating method and device recognized applied to far field of present invention proposition, should be applied to the pre-treating method of far field identification includes voice signal to be processed being fixed Wave beam forming processing, is fixed the beam signal after Wave beam forming processing；Beam signal after handling the fixed beam formation, carry out sound Echo cancellation and optimal beam selection；According to the beam signal after sound Echo cancellation and optimal beam selection, obtain being applied to the signal after the pre-treatment that far field is recognized.This method can improve pre-treatment effect, and optionally, operand can be reduced when voice signal quantity is larger.

Description

The pre-treating method and device recognized applied to far field

Technical field

The present invention relates to technical field of data processing, more particularly to a kind of pre-treating method and dress recognized applied to far field Put.

Background technology

Far field identification technology, namely remote identification technology, typically to solve speaker apart from 2 meters of speech ciphering equipment it The speech recognition request of outer scene.In order to obtain more stable reliable far field recognition performance, recognized for far field before scene Processing (far field pickup) technology just seems particularly urgent and important.

In the prior art, the flow series connection of far field pickup includes successively：Sound Echo cancellation (Acoustic echo Cancellation, AEC), auditory localization, Adaptive beamformer (Adaptive Beamforming, ABF), single wheat enhancing And post processing.

But, auditory localization module is needed in the prior art, and the degree of accuracy of auditory localization module itself is with regard to undesirable, Er Qieyu Follow-up ABF series connection, can also influence ABF performance, so that pre-treatment effect is influenceed, in addition, AEC is first carried out, when to be processed When the quantity of voice signal is larger, operand is also larger.

The content of the invention

It is contemplated that at least solving one of technical problem in correlation technique to a certain extent.

Therefore, it is an object of the present invention to propose a kind of pre-treating method recognized applied to far field, this method can To improve pre-treatment effect, and optionally, operand can be reduced when voice signal quantity is larger.

It is another object of the present invention to propose a kind of pretreating device recognized applied to far field.

To reach above-mentioned purpose, what first aspect present invention embodiment was proposed is applied to the pre-treating method that far field is recognized, Including：Wave beam forming processing is fixed to voice signal to be processed, the beam signal after Wave beam forming processing is fixed； Beam signal after handling the fixed beam formation, carry out sound Echo cancellation and optimal beam selection；According to sound echo Beam signal after elimination and optimal beam selection, obtains being applied to the signal after the pre-treatment that far field is recognized.

What first aspect present invention embodiment was proposed is applied to the pre-treating method that far field is recognized, it is not necessary to auditory localization mould Block, can avoid the problem of inaccurate pre-treatment effect caused of auditory localization is bad, so as to improve pre-treatment effect, and And, optionally, AEC is carried out again after first carrying out FBF, because the number of beams after usual FBF is relative to voice signal to be processed Quantity it is small, operand can be reduced.

To reach above-mentioned purpose, what second aspect of the present invention embodiment was proposed is applied to the pretreating device that far field is recognized, Including：Fixed beam formation module, for voice signal to be processed being fixed Wave beam forming processing, is fixed wave beam Beam signal after formation processing；Processing module, for the beam signal after fixed beam formation processing, carry out sound to be returned Ripple is eliminated and optimal beam selection；Acquisition module, for being believed according to the wave beam after sound Echo cancellation and optimal beam selection Number, obtain being applied to the signal after the pre-treatment that far field is recognized.

What second aspect of the present invention embodiment was proposed is applied to the pretreating device that far field is recognized, it is not necessary to auditory localization mould Block, can avoid the problem of inaccurate pre-treatment effect caused of auditory localization is bad, so as to improve pre-treatment effect, and And, optionally, AEC is carried out again after first carrying out FBF, because the number of beams after usual FBF is relative to voice signal to be processed Quantity it is small, operand can be reduced.

The additional aspect of the present invention and advantage will be set forth in part in the description, and will partly become from the following description Obtain substantially, or recognized by the practice of the present invention.

Brief description of the drawings

Of the invention above-mentioned and/or additional aspect and advantage will become from the following description of the accompanying drawings of embodiments Substantially and be readily appreciated that, wherein：

Fig. 1 is the schematic flow sheet for being applied to the pre-treating method that far field is recognized that one embodiment of the invention is proposed；

Fig. 2 is the schematic flow sheet for being applied to the pre-treating method that far field is recognized that another embodiment of the present invention is proposed；

Fig. 3 is the schematic flow sheet for being applied to the pre-treating method that far field is recognized that another embodiment of the present invention is proposed；

Fig. 4 is the structural representation for being applied to the pretreating device that far field is recognized that another embodiment of the present invention is proposed；

Fig. 5 is the structural representation for being applied to the pretreating device that far field is recognized that another embodiment of the present invention is proposed；

Fig. 6 is the structural representation for being applied to the pretreating device that far field is recognized that another embodiment of the present invention is proposed.

Embodiment

Embodiments of the invention are described below in detail, the example of the embodiment is shown in the drawings, wherein from beginning to end Same or similar label represents same or similar module or the module with same or like function.Below with reference to attached The embodiment of figure description is exemplary, is only used for explaining the present invention, and is not considered as limiting the invention.On the contrary, this All changes in the range of spirit and intension that the embodiment of invention includes falling into attached claims, modification and equivalent Thing.

Fig. 1 is the schematic flow sheet for being applied to the pre-treating method that far field is recognized that one embodiment of the invention is proposed, the party Method includes：

S11：Wave beam forming processing is fixed to voice signal to be processed, the ripple after Wave beam forming processing is fixed Beam signal.

Wherein, voice signal to be processed can refer to microphone signal, and microphone signal refers to the letter that microphone is picked up Number, including near-end voice signals (phonetic control command), RMR room reverb and various environmental noises etc..

When being recognized in far field, in order to improve recognition performance, it will usually using microphone array (shotgun microphone or omnidirectional Microphone), therefore, voice signal to be processed can specifically refer to microphone array signals, and microphone array signals include many Road microphone signal.

Beam-forming technology can include the ABF used in the prior art, in addition to fixed beam formation (Fixed Beamforming, FBF).

ABF spatial beams characteristic is adaptive change, and FBF spatial beams characteristic is changeless.Space The signal gain response of beam feature such as specific direction.

During FBF processing, optionally, the number for the fixed beam that the fixed beam formation processing is used is multiple, each Fixed beam covering part space, all fixed beams form the covering to whole space.

All standing by wave beam to space, it is ensured that user may detect that user when being located at space optional position Speech, it is to avoid the limitation to customer location.

When the quantity of voice signal to be processed (such as microphone array signals) is larger, in order to reduce operand, FBF is adopted The quantity of fixed beam can be less than the quantity of voice signal to be processed.

For example, the number of the fixed beam is 3, different fixed beams is covered each by 120 degree different of spaces； Or, the number of the fixed beam is 6, and different fixed beams is covered each by 60 degree different of spaces.

S12：Beam signal after handling the fixed beam formation, carry out sound Echo cancellation and optimal beam choosing Select.

Wherein, sound Echo cancellation (Acoustic would generally be included in interference signal, speech recognition interactive system in order to eliminate Echo cancellation, AEC) module, AEC modules are commonly referred to as BargeIn functional modules.

Interference signal is, for example, the music that speech recognition interactive system (hereinafter referred to as system) is produced, phonetic synthesis (text to speech, TTS) signal etc..

Because AEC modules are except that will follow the trail of study from the loudspeaker of system to the acoustic transfer function of microphone (Acoustic transfer function, ATF), will also learn the anaplasia at any time that the various processing modules before it are produced The composition of change, if these changes are faster than the convergence rate of sef-adapting filter in AEC, just occurs that AEC modules always can not The problem of ideal learns to these quick changes, and then cause the interference signal for system plays not eliminate very well.

Because ABF spatial beams characteristic is to change, also, the pace of change of generally ABF wave filter is far longer than The pace of change of the wave filter of AEC modules, so, ABF in the prior art can not be placed on to AEC and come to improve signal to noise ratio.And AEC treatment effect depends on signal to noise ratio, and signal to noise ratio more high disposal effect is better.Due to that ABF can not be placed on before AEC to carry ABF can not be placed on the mode handled before AEC by high s/n ratio, therefore, prior art, can influence AEC effects, Jin Erhui Influence far field recognition effect.

And in the present embodiment, using FBF, because FBF spatial beams characteristic is changeless, come for AEC modules Say to be exactly known, it is not necessary to which AEC modules are tracked study, therefore, FBF can be placed in the present embodiment before AEC.By After being handled by FBF, signal to noise ratio can be improved, therefore, FBF is placed on before AEC, AEC treatment effect will be improved, and then Improve far field recognition effect.

On the other hand, during quantity larger (such as more than 6) of the signal included in microphone array signals, prior art In, first carry out AEC, then the number of the AEC modules of needs is just identical with the quantity of microphone signal, also just than larger.And this In embodiment, first carry out FBF carry out AEC again, it is necessary to AEC modules quantity it is identical with the number of FBF wave beams, and FBF ripple Beam number is typically smaller than the quantity of the larger microphone signal of quantity, and such as FBF number of beams is 3 or 6, then can With the quantity for the AEC modules for significantly reducing needs, operand is reduced.

When optimal beam is selected, it can be selected according to default selection criterion.For example, default selection criterion is most Big signal-to-noise ratio (SNR) Criterion, the then wave beam for selecting signal to noise ratio maximum is used as optimal beam.

In specific processing, it can first carry out AEC and carry out optimal beam selection again, or, it can also first carry out optimal ripple Beam selection carries out AEC again.

S13：According to the beam signal after sound Echo cancellation and optimal beam selection, obtain being applied to before the identification of far field Signal after processing.

After carry out sound Echo cancellation and optimal beam selection, some post processings can be carried out again, further to improve Treatment effect.

After signal after the preceding processing for obtaining recognizing applied to far field, the signal after the pre-treatment can be input to Processing is identified in identifier (far field identification engine).

In the present embodiment, it is not necessary to which auditory localization is handled, therefore it is possible to prevente effectively from is caused due to auditory localization mistake Overall system performance it is unstable and abnormal；, can be effective by selecting optimal beam signal in fixed space beam signal Constraint and limitation of the conventional method for near-end speaker position are broken through, so as to realize that seamlessly adaptation teller is continuous in room Mobile application scenarios, significantly improve overall customer experience；Using fixed beam formation technology, its spatial beams characteristic is all not Change over time, this characteristic is to be arrived well by follow-up AEC modules study, so as to which FBF modules are mentioned Handled before AEC modules.The reference signal of more high s/n ratio on the one hand so can be obtained, is effectively improved follow-up AEC's Convergence rate and performance, on the other hand, due to being generally less than microphone number with FBF spatial beams number, it is possible to Effectively reduce the access times of AEC modules and reduce overall calculation amount.

Fig. 2 is the schematic flow sheet for being applied to the pre-treating method that far field is recognized that another embodiment of the present invention is proposed, should Method includes：

S21：Microphone array signals are fixed with Wave beam forming processing.

, can be first with microphone array (shotgun microphone or omnidirectional in order to improve AEC performances and reduce amount of calculation Microphone) whole space is divided into several spatial beams regions (such as 3 or 6).

Due to using fixed beam formation (Fixed Beamforming, FBF) technology, beam feature is not anaplasia at any time Change, therefore this characteristic is to be arrived well by follow-up AEC modules study.Therefore FBF modules can be mentioned to AEC moulds Handled before block.The reference signal for obtaining more high s/n ratio on the one hand can be so handled using FBF, so as to be effectively improved Follow-up AEC convergence rate and performance；On the other hand, generally it is less than microphone with the spatial beams number of FBF modules formation Number, thus can effectively reduce the access times of AEC modules and reduce overall calculation amount.

S22：It is right using the beam signal number identical sound Echo cancellation module after being handled with the fixed beam formation Beam signal carry out sound Echo cancellation after each fixed beam formation processing, obtains the letter of the wave beam after multiple sound Echo cancellations Number.

FBF modules can export the beam signal in some directions, and these signals are passed through AEC modules to eliminate the interference that it is included The music of signal, such as system plays, TTS, far field recognition performance can be just obviously improved by removing the signal after echo.

S23：In multiple beam signals after sound Echo cancellation, optimal beam selection is carried out, optimal beam letter is selected Number.

Each spatial beams signal after two above resume module, eliminates various environment to greatest extent Interference, including background noise, the music of RMR room reverb and system plays, TTS etc..In this step, the present embodiment can be according to one Fixed criterion (such as maximum signal noise ratio principle etc.), selects optimal spatial beams signal from several spatial beams signals, It is used as the output signal of the step.The auditory localization module in conventional solution is so eliminated, meter is not only effectively reduced Calculation amount, and error propagation effect can be avoided, that is, overall system performance is unstable caused by auditory localization mistake And exception.Eliminate simultaneously to the relatively-stationary limitation in near-end speaker position in conventional solution, so as to further improve Consumer's Experience；The present embodiment instead of auditory localization module by automatically selecting Optimal Signals in some fixed beam signals, The application scenarios for seamlessly adapting to teller's continuous moving in room can thus be realized.

S24：Single wheat passage enhancing is carried out to the beam signal after sound Echo cancellation and optimal beam selection and is post-processed, And the signal after single wheat is strengthened and post-processed is defined as being applied to the signal after the pre-treatment that far field is recognized.

Similar with traditional technical scheme, various single microphone noise cancellation techniques can be used for further eliminating remaining noise And concatenate special post-processing technology in rear end, such as gain amplification, dynamic range control (Dynamic range control, DRC) etc., so as to preferably improve far field recognition performance.

In the present embodiment, on the basis of a upper embodiment, it can first carry out AEC and carry out optimal direction beam selection again, Need not now limit does not have system interference signal, and applicable scene is wider.

Fig. 3 is the schematic flow sheet for being applied to the pre-treating method that far field is recognized that another embodiment of the present invention is proposed, should When method be may apply in the absence of system interference signal, this method includes：

S31：Microphone array signals are fixed with Wave beam forming processing.

The content of fixed beam formation processing may refer to the associated description in above-described embodiment, will not be repeated here.

S32：From the beam signal after multiple fixed beam formation processing, optimal beam selection is carried out, one is selected Optimal beam signal.

There is special module to carry out the detection of talk situation in AEC modules, can substantially there is three kinds of states, only near-end speech Signal, double speaking state (near-end speech and far-end speech) and the only state of remote signaling, far-end speech is the sound of system plays Happy or TTS signals etc..When knowing currently only near-end voice signals by the detection of the dedicated module in AEC modules, it is possible to It is determined that in the absence of system interference signal, so as to first carry out optimal beam selection, being carried out for example with maximum signal noise ratio principle Selection.The mode of specific optimal beam selection may refer to the associated description of above-described embodiment, will not be repeated here.

S33：Using a sound Echo cancellation module, to the optimal beam signal carry out sound Echo cancellation.

After optimal beam selection, the only signal, therefore can be carried out only with an AEC module all the way of output AEC, so as to reduce operand.

S34：Single wheat enhancing is carried out to the beam signal after sound Echo cancellation and optimal beam selection and is post-processed, and will Signal after single wheat enhancing and post processing is defined as being applied to the signal after the pre-treatment that far field is recognized.

In the present embodiment, in no system interference signal, it can first carry out optimal direction beam selection and carry out AEC again, So as to reduce the quantity of AEC modules, operand is reduced.

Fig. 4 is the structural representation for being applied to the pretreating device that far field is recognized that another embodiment of the present invention is proposed, should Device 40 includes：

Fixed beam formation module 41, for voice signal to be processed being fixed Wave beam forming processing, consolidate Determine the beam signal after Wave beam forming processing；

Processing module 42, for the fixed beam formation handle after beam signal, carry out sound Echo cancellation and Optimal beam is selected；

For example, with reference to Fig. 5, when the beam signal after fixed beam formation processing is multiple, the processing module 42 include：

Sound Echo cancellation module 51, the beam signal number after being handled with the fixed beam formation is identical, and described solid The connection of Wave beam forming module is determined, for the beam signal carry out sound Echo cancellation after each fixed beam formation processing, obtaining Beam signal after multiple sound Echo cancellations；

Optimal beam selecting module 52, is connected with the sound Echo cancellation module, for multiple after sound Echo cancellation In beam signal, optimal beam selection is carried out, optimal beam signal is selected.

In another example, referring to Fig. 6, when the beam signal after fixed beam formation processing is multiple, and, when in the absence of During system interference signal, the processing module 42 includes：

Optimal beam selecting module 61, is connected with fixed beam formation module, for being formed from multiple fixed beams In beam signal after processing, optimal beam selection is carried out, an optimal beam signal is selected；

One sound Echo cancellation module 62, is connected with the optimal beam selecting module, for believing the optimal beam Number carry out sound Echo cancellation.

Acquisition module 43, for according to the beam signal after sound Echo cancellation and optimal beam selection, being applied to Signal after the pre-treatment of far field identification.

Optionally, the acquisition module 43 specifically for：

Single wheat enhancing is carried out to the beam signal after sound Echo cancellation and optimal beam selection and is post-processed, and by single wheat Signal after enhancing and post processing is defined as being applied to the signal after the pre-treatment that far field is recognized.

It should be noted that in the description of the invention, term " first ", " second " etc. are only used for describing purpose, without It is understood that to indicate or imply relative importance.In addition, in the description of the invention, unless otherwise indicated, the implication of " multiple " Refer at least two.

Any process described otherwise above or method description are construed as in flow chart or herein, represent to include Module, fragment or the portion of the code of one or more executable instructions for the step of realizing specific logical function or process Point, and the scope of the preferred embodiment of the present invention includes other realization, wherein can not be by shown or discussion suitable Sequence, including according to involved function by it is basic simultaneously in the way of or in the opposite order, carry out perform function, this should be of the invention Embodiment person of ordinary skill in the field understood.

It should be appreciated that each several part of the present invention can be realized with hardware, software, firmware or combinations thereof.Above-mentioned In embodiment, the software that multiple steps or method can in memory and by suitable instruction execution system be performed with storage Or firmware is realized.If, and in another embodiment, can be with well known in the art for example, realized with hardware Any one of row technology or their combination are realized：With the logic gates for realizing logic function to data-signal Discrete logic, the application specific integrated circuit with suitable combinational logic gate circuit, programmable gate array (PGA), scene Programmable gate array (FPGA) etc..

Those skilled in the art are appreciated that to realize all or part of step that above-described embodiment method is carried Rapid to can be by program to instruct the hardware of correlation to complete, described program can be stored in a kind of computer-readable storage medium In matter, the program upon execution, including one or a combination set of the step of embodiment of the method.

In addition, each functional unit in each embodiment of the invention can be integrated in a processing module, can also That unit is individually physically present, can also two or more units be integrated in a module.Above-mentioned integrated mould Block can both be realized in the form of hardware, it would however also be possible to employ the form of software function module is realized.The integrated module is such as Fruit is realized using in the form of software function module and as independent production marketing or in use, can also be stored in a computer In read/write memory medium.

Storage medium mentioned above can be read-only storage, disk or CD etc..

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means to combine specific features, structure, material or the spy that the embodiment or example are described Point is contained at least one embodiment of the present invention or example.In this manual, to the schematic representation of above-mentioned term not Necessarily refer to identical embodiment or example.Moreover, specific features, structure, material or the feature of description can be any One or more embodiments or example in combine in an appropriate manner.

Although embodiments of the invention have been shown and described above, it is to be understood that above-described embodiment is example Property, it is impossible to limitation of the present invention is interpreted as, one of ordinary skill in the art within the scope of the invention can be to above-mentioned Embodiment is changed, changed, replacing and modification.

Claims

1. a kind of pre-treating method recognized applied to far field, it is characterised in that including：

Wave beam forming processing is fixed to voice signal to be processed, the beam signal after Wave beam forming processing is fixed；

Beam signal after handling the fixed beam formation, carry out sound Echo cancellation and optimal beam selection；

According to the beam signal after sound Echo cancellation and optimal beam selection, obtain being applied to after the pre-treatment that far field is recognized Signal.

2. according to the method described in claim 1, it is characterised in that the fixed beam that the fixed beam formation processing is used Number is multiple, and each fixed beam covering part space, all fixed beams form the covering to whole space.

3. method according to claim 2, it is characterised in that the number of the fixed beam is 3, different fixation ripples Beam is covered each by 120 degree different of spaces；Or, the number of the fixed beam is 6, and different fixed beams cover respectively 60 degree different of space of lid.

4. according to the method described in claim 1, it is characterised in that the fixed beam that the fixed beam formation processing is used Number is multiple, and, the quantity of the fixed beam is less than the quantity of voice signal to be processed.

5. the method according to claim any one of 1-4, it is characterised in that the ripple after fixed beam formation is handled When beam signal is multiple, it is described the fixed beam formation is handled after beam signal, carry out sound Echo cancellation and optimal Beam selection, including：

Using the beam signal number identical sound Echo cancellation module after being handled with the fixed beam formation, to each fixation Beam signal carry out sound Echo cancellation after Wave beam forming processing, obtains the beam signal after multiple sound Echo cancellations；

In multiple beam signals after sound Echo cancellation, optimal beam selection is carried out, optimal beam signal is selected.

6. the method according to claim any one of 1-4, it is characterised in that the ripple after fixed beam formation is handled When beam signal is multiple, and, when in the absence of system interference signal, the signal wave after the processing to the fixed beam formation Beam signal, carry out sound Echo cancellation and optimal beam selection, including：

From the beam signal after multiple fixed beam formation processing, optimal beam selection is carried out, an optimal beam is selected Signal；

Using a sound Echo cancellation module, to the optimal beam signal carry out sound Echo cancellation.

7. the method according to claim any one of 1-4, it is characterised in that described according to sound Echo cancellation and optimal ripple Beam signal after beam selection, obtains being applied to the signal after the pre-treatment that far field is recognized, including：

Single wheat enhancing is carried out to the beam signal after sound Echo cancellation and optimal beam selection and is post-processed, and single wheat is strengthened It is defined as being applied to the signal after the pre-treatment that far field is recognized with the signal after post processing.

8. a kind of pretreating device recognized applied to far field, it is characterised in that including：

Fixed beam formation module, for voice signal to be processed being fixed Wave beam forming processing, is fixed wave beam Beam signal after formation processing；

Processing module, for the beam signal after fixed beam formation processing, carry out sound Echo cancellation and optimal ripple Beam is selected；

Acquisition module, for according to the beam signal after sound Echo cancellation and optimal beam selection, obtaining being applied to far field knowledge Signal after other pre-treatment.

9. device according to claim 8, it is characterised in that the beam signal after fixed beam formation is handled is When multiple, the processing module includes：

Sound Echo cancellation module, the beam signal number after being handled with the fixed beam formation is identical, with the fixed beam Module connection is formed, for the beam signal carry out sound Echo cancellation after each fixed beam formation processing, obtaining multiple sound Beam signal after Echo cancellation；

Optimal beam selecting module, is connected with the sound Echo cancellation module, for multiple wave beams letter after sound Echo cancellation In number, optimal beam selection is carried out, optimal beam signal is selected.

10. device according to claim 8, it is characterised in that the beam signal after fixed beam formation is handled When being multiple, and, when in the absence of system interference signal, the processing module includes：

Optimal beam selecting module, is connected with fixed beam formation module, for after multiple fixed beam formation processing Beam signal in, carry out optimal beam selection, select an optimal beam signal；

One sound Echo cancellation module, is connected with the optimal beam selecting module, for being carried out to the optimal beam signal Sound Echo cancellation.

11. the device according to claim any one of 8-10, it is characterised in that the acquisition module specifically for：