CN115527549A

CN115527549A - Echo residue suppression method and system based on special structure of sound

Info

Publication number: CN115527549A
Application number: CN202211220177.3A
Authority: CN
Inventors: 陈佳路; 项京朋; 邱峰海
Original assignee: Beijing Sound+ Technology Co ltd
Current assignee: Beijing Sound+ Technology Co ltd
Priority date: 2022-09-30
Filing date: 2022-09-30
Publication date: 2022-12-27

Abstract

The application provides an echo residue suppression method based on a special sound structure, which comprises the following steps: acquiring a multi-channel voice time domain signal by using a microphone array, and converting the multi-channel voice time domain signal into a multi-channel frequency domain signal; performing optimized echo cancellation on the multi-channel frequency domain signal to obtain multi-channel data with echo signal residues, and performing beam space scanning; constructing an echo signal residual probability function based on spatial characteristics according to a positioning result of beam space scanning; and performing echo residue suppression on the multi-channel data with echo signal residue by using an echo residue suppression algorithm based on echo signal residue probability function optimization to obtain a target signal. The method and the device can effectively distinguish the residual echo and the near-end voice signal, and effectively detect the weak voice signal under the condition that the near-end voice is weak, so that the performance of subsequent single-channel voice enhancement and automatic gain control processing is improved, the echo is further suppressed when the large echo is residual, and the function of effectively retaining the near-end voice is achieved when the near-end voice is available.

Description

Echo residue suppression method and system based on special sound structure

Technical Field

The application relates to the technical field of voice data processing, in particular to an echo residue suppression method and system based on a special sound structure.

Background

With the wide use of voice interaction devices such as smart speakers, conference systems, etc., these devices continuously pursue portability and miniaturization of the devices while ensuring large volume, so that the sound generating unit is inevitably closer to the microphone, and therefore, when a near-end speaker is far away from the device, the signal-to-back ratio of signals received by the microphone is often very low, thereby bringing a series of adverse effects to subsequent voice enhancement technologies. In the currently common speech enhancement technical scheme, echo cancellation processing is preferentially performed.

The traditional echo cancellation method usually utilizes the correlation between the reference signal and the received signal to adaptively control the updating of the filter, but the method has performance limit, and particularly, the satisfactory effect cannot be obtained by simply depending on the correlation characteristic under the duplex condition. In recent years, researchers at home and abroad have also proposed some echo suppression methods based on machine learning, which require on-line or off-line supervision and learning of residual echo characteristics to finally realize echo suppression, but such methods cannot be applied to all types of residual noise, and can destroy the phase difference characteristics among channel signals, which is not beneficial to subsequent processing such as direction of arrival estimation and beam forming. Therefore, the method commonly used in the actual product at present is still processed by using the traditional echo cancellation method based on the Least Mean square error algorithm (LMS).

In practical application, only a loudspeaker or a near-end voice signal exists under the condition of simplex, so that the existing adaptive echo cancellation method based on the LMS algorithm can more accurately control the updating of the filter coefficient, and a better voice interaction effect is obtained. In duplex situations, in order to eliminate noise residue caused by the fact that the update rate of the filter is often reduced in order to eliminate the near-end speech, the ratio of the echo residue to the near-end speech component in the filtered output signal cannot be estimated, and therefore, a great deal of uncertainty is brought to subsequent post-processing methods such as post-filtering and automatic gain control.

Therefore, the existing echo cancellation method based on the LMS algorithm has certain limitations in practical application.

Disclosure of Invention

In order to solve the above problems, the present application provides an echo residue suppression method based on a special structure of a sound, which can effectively distinguish a residual echo from a near-end speech signal, and particularly, under the condition that the near-end speech is weak, the method can effectively detect the weak speech signal, thereby improving the subsequent echo residue suppression and automatic gain control performance, further suppressing the echo when a large echo remains, and effectively preserving the function of the near-end speech when the near-end speech exists, and having an important application value.

The application provides an echo residual suppression method based on a special sound structure, which comprises the following steps:

removing loudspeaker audio components in the multi-channel frequency domain signal according to an optimized echo cancellation technology to obtain multi-channel data with echo signal residues; the echo signal is a loudspeaker audio component in the multi-channel frequency domain signal;

counting the beam output energy of the multi-channel data with echo signal residue in each direction according to a beam space scanning technology; each direction is positioned in a plane formed by array elements of the microphone array;

according to the characteristic that a wave beam output energy peak curve of an echo signal in each direction shows periodic intensity change along with an angle, constructing an echo signal residual probability function based on spatial characteristics; the spatial characteristics comprise that beam output energy peak curves of direct sound energy components of echo signals in all directions are periodic along with angles due to the fact that the loudspeaker has the same sound path difference when reaching each array element of the microphone array;

and according to an echo residue suppression algorithm optimized by taking the echo signal residue probability function as an auxiliary parameter, carrying out echo residue suppression on the multi-channel data with echo signal residue to obtain a target signal.

In a possible implementation manner, the optimizing of the optimized echo cancellation algorithm includes optimizing an echo cancellation algorithm of a next frame signal of the multi-channel frequency domain signal by using an echo signal residual probability function calculated from a current frame signal of the multi-channel frequency domain signal.

In one possible embodiment, the echo residual suppression algorithm comprises an optimized single-channel speech enhancement algorithm,

the echo residual suppression is performed on the multi-channel data with echo signal residual, and comprises,

and according to the optimized single-channel speech enhancement algorithm, carrying out echo residue suppression operation on any single-channel data in the multi-channel data with echo signal residue to obtain a first processing signal.

In another possible embodiment, the echo residual suppression algorithm comprises an optimized single-channel speech enhancement algorithm,

the echo residue suppression of the multi-channel data with echo signal residue includes,

according to the arrival direction estimation and beam forming algorithm, carrying out target direction estimation and beam forming operation on the multi-channel data with the echo signal residue to obtain single-channel data with the echo signal residue;

and according to the optimized single-channel speech enhancement algorithm, carrying out echo residue suppression operation on the single-channel data with echo signal residue to obtain a first processing signal.

In one possible embodiment, the echo-residue suppression algorithm further comprises an optimized automatic gain control algorithm,

the performing echo residue suppression on the multi-channel data with echo signal residue further comprises,

and according to the optimized automatic gain control algorithm, carrying out echo residue suppression on the first processing signal to obtain a target signal.

The application provides an echo residue suppression system based on stereo set special construction, includes:

the loudspeaker is positioned above or below the center of the microphone array of which the array elements form a uniform circular array; for emitting a signal of the loudspeaker audio component;

the microphone array, the array element of which forms a uniform circular array, is used for obtaining a multi-channel voice signal which is composed of a near-end speaker voice component and a loudspeaker audio component and converting the multi-channel voice signal into a multi-channel frequency domain signal.

The microphone array is also used for removing the audio frequency component of the loudspeaker in the multi-channel frequency domain signal according to the optimized echo cancellation technology to obtain multi-channel data with echo signal residue; the echo signal is a loudspeaker audio component in a multi-channel frequency domain signal;

counting the beam output energy of multi-channel data with echo signal residue in each direction according to a beam space scanning technology; each direction is positioned in a plane formed by array elements of the microphone array;

and according to an echo residue suppression algorithm optimized by taking the echo signal residue probability function as an auxiliary parameter, performing echo residue suppression on the multi-channel data with echo signal residue to obtain a target signal.

In one possible implementation, the echo residual suppression algorithm comprises an optimized single-channel speech enhancement algorithm,

and according to the optimized single-channel speech enhancement algorithm, performing echo residue suppression operation on the single-channel data with echo signal residue to obtain a first processing signal.

The method utilizes the characteristics that the sound path differences of each array element from a loudspeaker sounding unit to a microphone array are the same, and the sound path differences of each array element from an external near-end signal to the microphone array are different, constructs an echo signal residual probability function based on the sound path differences, is used for effectively distinguishing echo residues and near-end signals, can obtain a larger value when echo residues exist and obtain a smaller value when the residual echo is smaller, also can utilize the judgment function to assist in improving the echo cancellation performance, performs subsequent processing such as subsequent voice enhancement and automatic gain control, and can further inhibit the residual echo and retain the near-end signal.

Drawings

Fig. 1 is a schematic diagram of a pickup scene and a special sound structure provided in an embodiment of the present application;

fig. 2 is a flowchart of an echo residual suppression method based on a special acoustic structure according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an experimental scenario and a testing apparatus provided in an embodiment of the present application;

fig. 4 is a spectrogram after echo cancellation of a microphone receiving signal based on a pure echo condition according to an embodiment of the present application;

fig. 5 is a spectrogram after echo cancellation of a microphone received signal based on pure near-end speech according to an embodiment of the present application;

fig. 6 is a spectrogram after echo cancellation based on a microphone received signal with coexisting small echo and near-end speech according to an embodiment of the present application;

fig. 7 is a spectrogram after echo cancellation based on a microphone received signal with a large echo and a near-end voice coexisting according to an embodiment of the present application;

fig. 8 is a flowchart of an echo residue determination function construction module algorithm based on a special sound structure according to an embodiment of the present application;

fig. 9 is a flowchart of an echo residual suppression method module algorithm based on a special sound structure according to an embodiment of the present application;

FIG. 10 is a comparison graph of the speech spectrum of a conventional post-processing and integrated processing signal based on a residual echo probability function provided by an embodiment of the present application;

fig. 11 is a schematic structural diagram of an echo residual suppression system based on a special acoustic structure according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the embodiments of the present application will be described below with reference to the accompanying drawings.

In the description of the embodiments of the present application, the words "exemplary," "for example," or "exemplary" are used to indicate examples, illustrations, or illustrations. Any embodiment or design described herein as "exemplary," "e.g.," or "e.g.," is not to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the words "exemplary," "for example," or "exemplary" is intended to present relevant concepts in a concrete fashion.

In the description of the embodiments of the present application, the term "and/or" is only one kind of association relationship describing an associated object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, B exists alone, and A and B exist at the same time. In addition, the term "plurality" means two or more unless otherwise specified. For example, the plurality of systems refers to two or more systems, and the plurality of screen terminals refers to two or more screen terminals.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to implicitly indicate the indicated technical feature. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.

The method provided by the application is suitable for microphone arrays and loudspeaker playing systems with special structures, such as desktop conference systems, intelligent sound equipment and the like. Without loss of generality, the application takes a small desktop conference sound with a uniform circular array microphone as an example for explanation, wherein a loudspeaker unit needs to be positioned above or below the center of the sound, and the principle of other array types is similar, and is not separately explained in the application.

Fig. 1 is a schematic diagram of a pickup scene and a special sound structure provided in the present application. A typical application scenario and special sound structure is shown in figure 1,

the left diagram in fig. 1 is a pickup scene, the pickup device is a microphone array which is uniformly distributed in a circular shape, and signals received by the microphone array include a target signal, an echo signal, an environmental noise signal, and the like. Specifically, when a near-end speaker makes a call in a space, the microphone array may receive both a target signal and an echo signal of the near-end speaker and a target signal and an echo signal sent by the speaker.

The right diagram in fig. 1 is a special sound structure, a microphone circular array with uniformly distributed M elements is placed in a three-dimensional space, the center of the array coincides with an original point O, and the distance between an array element and a circle center is delta ₁ . Without loss of generality, M array elements of the microphone array are numbered as mic1, mic2 and 8230in sequence according to the anticlockwise direction, wherein the direction of mic1 is set as 0 degree, the direction of mic2 is set as 360/M degrees, and so on, the loudspeaker unit S1 is located at the position below the center of the microphone array. The value of M can be determined arbitrarily, for example, in fig. 1, assuming that M =4, the microphone array includes 4 microphone elements.

In one example, with the rise of intelligent office equipment, especially the popularization of conference systems, microphone array based multi-channel speech enhancement techniques such as echo cancellation, direction of arrival estimation, and beam forming are widely used. The echo cancellation technology utilizes a reference signal to cancel a speaker audio component in a signal received by a microphone through a self-adaptive filtering method, so that the performance of subsequent processing, such as speech enhancement processing of direction of arrival, beam forming and the like, is improved, and the speech interaction quality is improved.

In practical applications, echo removal processing is usually performed on each microphone received signal, direction of Arrival (DOA) estimation and beam forming processing are performed on the processed signal, a directional beam is formed in a near-end target Direction, a target signal is extracted, reverberant sound components in other directions are suppressed, and finally, subsequent processing operations such as single-channel speech enhancement, equalization, automatic gain control and the like are performed, so that a final enhanced target speech signal is obtained. However, when the echo is strong, the echo cancellation method based on the LMS algorithm often has echo residues, and the size of the echo residues is not only related to the volume of the loudspeaker, but also related to various factors such as transfer function and loudspeaker linearity. When echo residue is large, voice interaction quality can be seriously influenced, so that how to effectively judge the size of the echo residue after echo cancellation processing and how to distinguish the echo residue from a near-end voice signal has very important application value.

Based on this, the embodiment of the present application provides an echo residue suppression method based on a special structure of a sound system by using the characteristics that the sound path differences of each array element from a loudspeaker sound production unit to a microphone array are the same, and the sound path differences of each array element from an external near-end signal to the microphone array are different, which is suitable for speech and audio signals, and can be applied to a real-time speech communication system and a non-real-time speech signal enhancement technology. The method comprises the following steps:

(1) And carrying out beam space scanning on the multi-channel data after echo cancellation, and constructing an echo signal residual probability function based on spatial characteristics according to a positioning result, wherein the echo signal residual probability function is used for effectively distinguishing echo residues from near-end signals, and can obtain a larger value when echo residues exist and obtain a smaller value when residual echoes are smaller.

(2) And optimizing a plurality of processing modules based on the probability function, wherein the optimization comprises echo cancellation, single-channel voice enhancement, automatic gain control and the like.

The experimental result shows that the technical scheme provided by the application can effectively distinguish echo residues and near-end voice signals, and particularly can effectively detect weak voice signals under the condition of long-distance weak near-end voice, so that the subsequent echo residue suppression and automatic gain control performance is improved, the echo is further suppressed when large echo residues exist, and meanwhile, the function of effectively retaining the near-end voice when the near-end voice exists is realized.

Fig. 2 is a flowchart of an echo residual suppression method based on a special acoustic structure according to the present application.

As shown in fig. 2, the method includes steps S201 to S205 as follows.

In step S201, acquiring a multi-channel speech signal composed of a near-end speaker speech component and a speaker audio component by using a microphone array; converting the multi-channel voice signal into a multi-channel frequency domain signal; array elements of the microphone array form a uniform circular array; the speaker is located above or below the center position of the microphone array.

Based on the special sound structure shown in fig. 1, the microphone array elements in the array are numbered as mic1, mic2, \8230, according to the anticlockwise rotation sequence, wherein the mic1 direction is set to be 0 degree, the mic2 direction is set to be 360/M degree, and so on. Suppose that the signal x received by the ith microphone element _i (n) is:

x _i (n)＝s _i (n)+e _i (n)+v _i (n) (1)

wherein i =1,2, \ 8230aM, s _i (n)、e _i (n) and v _i And (n) respectively representing a target voice signal, an echo signal and an environmental noise signal received by the ith microphone array element. The near-end speaker voice is a target voice signal, and the echo signal is obtained after various reflections of the signal output by the loudspeaker. Time domain signal x (t) = [ x ] received by array ₁ (t)，x ₂ (t)，...，x _M (t)]Obtaining the first frame and N through Short-time Fourier Transform (STFT) _FFT The kth spectral component of the point FFT:

wherein M =1,2, \8230M, X _m (k, l) denotes the received signal of the m-th microphone element, the first part

Comprises a direct sound component and a reverberant sound component of the voice of a near-end speaker,

represents omega _d (k) Direction-corresponding guide vector, S _d (k, l) is the signal energy corresponding to this direction, where d =0 corresponds to the direct sound component and the others to reverberant sound components. Similarly, the second part

Including the direct sound component of the loudspeaker audio signal and the reflection thereof by other componentsThe reverberant sound component formed by the irradiated surface,

represents Ω _g (k) Direction-corresponding guide vector, E _g And (k, l) is the echo signal energy corresponding to the direction, wherein g =0 corresponds to the direct sound component, and the others are reverberant sound components. Similarly, the third part V (k, l) represents the noise signal received from all the elements in the microphone array, including V ₁ (k，l)…V _m (k，l)…V _M (k, l) wherein M =1,2, \8230M, V _m And (k, l) represents the noise signal received by the m-th microphone array element. In desktop conference sound applications, the second part is due to the close proximity of the speaker to the microphone array

The main direct sound component is the main direct sound component, and most energy components are occupied.

Based on the special acoustic structure shown in fig. 1, it is assumed that the microphone receives a signal X _i (k, l), i =1, 2., M, processed by a classical LMS-based echo cancellation algorithm, the resulting output signal is denoted Z _i (k, l), i =1,2, ·, M. Taking the traditional SRP-PHAT method as an example to perform beam scanning, the specific technical scheme is as follows:

1 st microphone echo offset output signal Z ₁ (k, l) and 2 nd microphone echo-cancelled output signal Z ₂ The cross-correlation of (k, l) is:

wherein ω =2 π f _k ，f _k For the frequency corresponding to the kth frequency point, τ _1２ Representing the delay between the two output signals, is determined primarily by the distance between the sound source relative to each pair of microphone elements. Considering the condition of multiple pairs of output signals, pairwise combining and pairing all the output signals, and cumulatively adding the results of combined pairing calculation to obtain the controllable beam formingOutput power of the device:

wherein tau is _nm Representing the delay between the nth microphone echo cancelled output signal and the mth microphone echo cancelled output signal. Considering the SRP-PHAT algorithm, removing the amplitude influence of each frequency point and only keeping the phase information, and ordering:

these are obtained from formulas (4) and (5):

in practical applications, it is generally assumed that the sound source (or the projection of the incoming wave direction on the horizontal plane) and the microphone array are located on the same plane, and the near-end sound source signal is in θ _i Incident light then

By theta _i And (6) determining. When a plane formed by array elements of the microphone array is scanned by 360 degrees, the formula (6) can be rewritten as follows:

in order to reduce the jitter phenomenon of the beam scanning output result, the method can be used for

Smoothing in time domain, e.g. P _s (θi，l)＝α _s P _s (θi，l-1)+(1-α _s )·P(θ _i L) in which _s For the smoothing factor, the application takes alpha _s Simulation was performed as an example = 0.8. For the beam output for each frame,the treatment is usually carried out by a normalization method:

based on the special acoustic structure shown in fig. 1, the acoustic path differences from the loudspeaker generating unit to the microphone array elements are the same, beam scanning is performed according to the form of formula (7), and the beam output result periodically repeats at intervals of (360/M) degrees. The sound path difference from the near-end speaker voice to each microphone array element cannot be the same, so that the phenomenon of periodically repeated wave beam output is difficult to occur in practical application.

As shown in fig. 3, which is an experimental scenario and a testing apparatus of the present application, a conference audio includes a uniform circular array with a radius of 6cm, which is composed of 4 microphone arrays, and a speaker is located right below the center of the apparatus. Taking actual test data as an example, fig. 4 (a) - (c) to fig. 7 (a) - (c) sequentially show a pure echo situation, a pure near-end voice, a situation where both exist simultaneously (small echo) and a situation where both exist simultaneously (large echo), (a) an original microphone array receives a signal spectrogram, (b) a single-path output signal spectrogram after echo cancellation and corresponding data obtained by calculation through the equations (7) and (8), and (c) a normalized beam output signal spectrogram, wherein a scanning interval angle is 10 degrees, and a frequency range is 300Hz to 4000Hz.

From the above results, the following conclusions can be found:

(1) The results of fig. 4 (a) - (c) show that in the speaker only sounding case, the normalized beam output results exhibit a distinct periodicity with angle, i.e. large peaks appear every 90 degrees from 0 degrees, while peaks appear every 90 degrees from 45 degrees, since the symmetry of the microphone placement positions results in the scanned beam output between adjacent microphones being symmetrical about the middle angle of the two microphone array elements.

(2) The results of fig. 5 (a) - (c) show that, in the case of only the near-end signal, the beam output result exhibits obvious peak characteristics of a single sound source, and even other weak pseudo peaks have no obvious periodic repetition phenomenon.

(3) When echo residues and a near-end target exist simultaneously (namely, the situation of 'double talk'), the results of fig. 6 (a) - (c) show that when the echo residues are small, the beam output result still presents stronger single sound source peak characteristics, and the results of fig. 7 (a) - (c) show that when the echo residues are strong, the beam output result still presents obvious periodic characteristics. When the echo residual signal energy is closer to the near-end energy, the beam output has a certain periodic characteristic, but the peak amplitude and intensity are interfered by the near-end speech.

In practical applications, in order to determine the echo residual amount, the echo residual degree is often determined according to the correlation between the output signal after the echo cancellation process and the reference signal, or the convergence of the filter, and these methods all use the correlation characteristics between different signals to process, and do not use the spatial characteristics of the speaker and the near-end sound source to determine.

In one example, based on the special acoustic structure shown in fig. 1, by using the characteristics that the difference between the acoustic paths from the speaker to each microphone array element is substantially the same, and the difference between the acoustic path from the external near-end signal to each microphone array element is larger, and the characteristics of the scanning beam in different situations as described in fig. 4 to 7, in order to improve the performance of the adaptive beam forming algorithm, the application proposes to construct an echo residual function based on the spatial characteristic determination result, where the echo residual function obtains a larger value when the echo residual is larger and a smaller value when the echo residual is smaller, so as to effectively determine whether there is an echo residual, and determine the energy relationship between the echo residual and the near-end speech signal.

As shown in fig. 8, a flowchart of an echo residue determination function construction module algorithm based on a special sound structure provided in the embodiment of the present application is mainly divided into the following two steps:

(1) Performing beam scanning by adopting a maximum controllable Response Power beam forming method (SRP-PHAT) based on phase transformation, and counting beam output energy in each direction;

(2) And constructing an echo residue judgment function according to the wave beam output result, so that a higher probability function value can be obtained when the echo residue is larger, and a lower probability function value can be obtained when the echo residue is smaller, thereby providing auxiliary parameters for subsequent processing and further improving the echo residue inhibition capability.

The above two steps are embodied in step S202-step S204 with respect to the construction process of the echo residue determination function.

In step S202, removing the audio component of the speaker in the multi-channel frequency domain signal according to the optimized echo cancellation technique to obtain multi-channel data with echo signal residue; the echo signal is a loudspeaker audio component in the multi-channel frequency domain signal.

And optimizing the optimized echo cancellation algorithm, wherein the optimization is carried out on the echo cancellation algorithm of the next frame signal of the multi-channel frequency domain signal by utilizing the echo signal residual probability function calculated by the current frame signal of the multi-channel frequency domain signal.

In step S203, beam output energy of multi-channel data with echo signal residue in each direction is counted according to a beam space scanning technique; each direction is located in the plane formed by the array elements of the microphone array.

In step S204, an echo signal residual probability function based on spatial characteristics is constructed according to the characteristic that a beam output energy peak curve of the echo signal in each direction shows periodic intensity variation along with an angle; the spatial characteristics include that the beam output energy peak curves of the direct sound energy components of the echo signals in all directions are periodic with the angle due to the fact that the sound path difference of each array element of the microphone array from the loudspeaker is the same.

Based on the special acoustic configuration shown in fig. 1, the present application will be specifically described below, taking as an example a case where the microphone array includes 4 microphone elements, assuming that M = 4.

The specific calculation method of the echo signal residual probability function is as follows:

(1) Normalizing the result of the beam scanning of the l-th frame in equation (8)

Let the first possible peak area be ψ = [ ψ = ₁ ，ψ ₂ ，...ψ _M ]I.e., 0 degrees, 90 degrees, 180 degrees, and 270 degrees. Record the possible second peak region as φ = [ φ = [ ] ₁ ，φ ₂ ，...φ _M ]Angle indices of 45 degrees, 135 degrees, 225 degrees, and 315 degrees;

(2) For 4 peak indexes of the first peak area, calculating the correspondence of each peak index angle and 2 adjacent angles

The average value of (c) is calculated as follows:

similarly, for the 4 peak indices of the second peak region, the angle of each peak index and the corresponding of the adjacent 2 angles are calculated

To obtain:

(3) When near-end voice and echo residual exist simultaneously, the echo residual corresponds to the enhancement of the near-end signal voice energy

The periodic peak structures exhibited are destroyed and become less and less pronounced. By utilizing the characteristics, a threshold value th0 is set, the threshold value th0=0.5 is set in the application, and M is calculated ₁ (ψ _i L), the number μ of i =1,2,3,4 greater than the threshold ₁ And M ₂ (φ _i L), the number μ of i =1,2,3,4 greater than the threshold _２ . When mu is ₁ And mu _２ Larger indicates better conformity to large echo residuals in these regionsObviously, a state judgment method is designed according to the following steps:

(4) When Flag (l) =1 according to the result obtained in the previous step (3), this indicates that this time is reached

There are very pronounced periodic peak structures present, i.e. very strong echo residues. When Flag (l) = -1, this indicates

Almost no periodic peak structures appear at the potential angles, at which point almost no echo remains can be considered. And when Flag (l) =0, this indicates that

Although there are few peaks in the potential angle, the periodicity is not strong enough, and at this time, it can be considered that echo residue and near-end speech exist at the same time, and further determination processing is required.

(5) When Flag (l) =0, for M ₁ (ψ _i L), the smallest two values of i =1,2,3,4 are averaged to obtain U ₁ (l) (ii) a Similarly, for M ₂ (φ _i L), the smallest two values of i =1,2,3,4 are averaged to obtain U ₂ (l) .1. The When U is formed ₁ (l) Or U ₂ (l) The larger the peak value indicating the potential angle, the smaller the two values are selected in order to avoid the influence of the near-end signal on the peak value determination of the potential angle as much as possible.

(6) According to Flag (l), U ₁ (l) And U ₂ (l) Designing a DOA-based echo residue judgment function P _echo (l)，P _echo (l) The larger the echo residue, the more the echo residue is, the specific calculation formula is:

by using the above steps (1) to (6), the normalization result can be obtained from the beam scanning

Computing an echo residual decision function P _echo (l) .1. The Fig. 4 (d) to 7 (d) show the residual echo determination functions calculated by the equation (12) in the case of the data corresponding to fig. 4 (a) to (c) to fig. 7 (a) to (c). According to the results, the method can obtain higher P under the condition of only echo _echo (l) And a smaller P is obtained when only the near-end signal is available _echo (l) When echo residue and near-end speech signal exist simultaneously, P is increased along with the increase of echo residue _echo (l) And also gradually increases. Therefore, the method provided by the application can effectively judge the severity of the residual echo after the echo cancellation processing, and the information is helpful for the optimal design of various subsequent processing modules.

In practical application, in order to ensure the quality of near-end speech under a duplex condition, the update rate of a filter is often reduced, so that more noise remains are caused, and the distortion of the near-end speech can be caused even though the echo removing effect is better due to an excessively large update rate, so that how to reasonably set the update step length of the filter in practical application has a very important meaning.

The specific analysis is as follows:

under the duplex condition, in order to avoid noise residual caused by the fact that the updating rate of a filter is often reduced when near-end speech is not eliminated, the proportion of echo residual quantity to near-end speech components in an output signal after filtering cannot be estimated generally, and therefore, how to reasonably set the updating step size of the filter in practical application has very important significance, the application provides an echo cancellation filter updating optimization method based on an echo residual judgment function. The specific method comprises the following steps:

for the mth microphone, the l frame and the k frequency point receiving signal X _m (k, l) with the loudspeaker reference signal noted as R (k, l), the filtered output Z _m (kL) and a residual signal E _m (k, l) are respectively:

E _m (k，l)＝X _m (k，l)-Z _m (k，l-1) (14)

wherein, W _m (k, l) is the echo cancellation filter coefficient, the classical LMS method updates the echo cancellation filter using the normalized least mean square algorithm:

W _m (k，l)＝W _m (k，l-1)+2μ(k，l)E _m (k，l)X _m (k，l) (15)

in order to accelerate the convergence speed, a proper step size μ (k, l) needs to be selected, and the step size is solved by a common method according to a mean square error minimization criterion, wherein the specific formula is as follows:

μ(k，l)＝1/|X _m (k，l)| ² (16)

the application provides a step length control method based on an echo residue judgment function, so that the step length is automatically increased when the echo residue is more, the filter is accelerated to be updated, a better echo suppression effect is realized, the step length is automatically reduced when the echo residue is less, the filter is slowed down to be updated, and an optimized echo cancellation filter is as follows:

wherein, P _echo (l-1) the echo residual judgment function calculated for the previous frame, since the echo cancellation step precedes the calculation of SRP-PHAT, and P _echo (l) The frame does not change drastically, so P of the previous frame can be used _echo (l-1) step length judgment is assisted, gamma is a parameter for preventing the denominator from being excessively small and is added, and gamma =0.001 is taken in the application.

In one example, subsequent processing of the output after echo cancellation processing is typically required to further extract and enhance the near-end speech signal. Common post-processing modules include speech enhancement and automatic gain control. The speech enhancement technology includes traditional single-channel speech enhancement, multi-channel speech enhancement, and machine learning-based speech enhancement technology. In order to solve the above problems, the present application provides a single-channel noise reduction optimization method based on a residual echo probability function, and an automatic gain control optimization method based on the residual echo probability function.

The above process of optimizing the single-channel speech enhancement algorithm and the automatic gain control algorithm based on the echo signal residual probability function is embodied in step S205 as described below.

In step S205, echo residue suppression is performed on the multi-channel data with echo signal residue according to an echo residue suppression algorithm optimized based on the echo signal residue probability function as an auxiliary parameter to obtain a target signal.

In one example, the echo residual suppression algorithm comprises an optimized single-channel speech enhancement algorithm,

echo residue suppression is performed on multi-channel data with echo signal residue, including,

according to the optimized single-channel voice enhancement algorithm, echo residue suppression operation is carried out on any single-channel data in the multi-channel data with echo signal residue to obtain a first processing signal.

In another example, the echo residual suppression algorithm comprises an optimized single-channel speech enhancement algorithm,

echo residual suppression is performed on multi-channel data with echo signal residual, including,

according to the arrival direction estimation and beam forming algorithm, carrying out target direction estimation and beam forming operation on multi-channel data with echo signal residues to obtain single-channel data with echo signal residues;

In practical application, each channel after echo cancellation can be respectively subjected to subsequent processing, or a multi-channel signal can be subjected to beam forming according to DOA information, and a single-channel signal after beam forming is subjected to subsequent processing.

In the present application, only a certain channel output signal after echo cancellation is taken as an example to perform single-channel speech enhancement optimization processing. The method of single-channel speech enhancement optimization processing based on beamformed output signals is similar and will not be described in detail in this application. The specific analysis is as follows:

the traditional single-channel speech enhancement algorithms, such as classical spectral subtraction, recursive average (MCRA) and optimal modified log-spectral amplitude estimation (OM-LSA) are all processed in the frequency domain, and corresponding gains G (k, l) are calculated by calculating the speech existence probability of the frequency point, and a final target speech estimation signal is obtained:

in the output signal after the echo cancellation processing, since the residual echo is unstable in many cases and even has a certain speech structure characteristic, it is very easy to be determined as the target speech signal and thus retained, thereby affecting the processing effect.

In order to solve the problem, the OM-LSA algorithm is taken as an example in the application, an optimization method based on a residual echo probability function is provided, and the characteristic that the residual echo and a near-end voice signal can be distinguished by utilizing the function is utilized, so that a voice enhancement result is improved. The specific steps of the OM-LSA algorithm are briefly described here:

let H ₀ Indicating that the target speech is not present, H ₁ When the occurrence of speech is indicated, the echo-processed signal is Z (k, l) = S (k, l) + D (k, l), where S (k, l) is the target speech component and D (k, l) is the components such as noise residual and background noise. The estimated value of the target speech, calculated according to the OM-LSA method, can be expressed as:

wherein q (k, l) = P (H) ₀ (k, l)) represents the prior speech absence probability, which is generally calculated as follows:

q(k，t)＝1-P _local (k，l)P _global (k，l)P _frame (l) (20)

wherein, P _local (k, l) represents the local speech presence probability determined by the prior signal-to-noise ratio ε (k, l), the solution to ε (k, l) being found with reference to the following equation (26), P _local The way (k, l) is calculated as follows:

wherein epsilon _min Representing the upper and lower bounds of the a priori signal-to-noise ratio. To reduce variance, P _local (k, l) is typically averaged using the prior SNR of two adjacent frequency bins, and P _global And (k, l) represents the global speech existence probability determined by the prior signal-to-noise ratio, and the prior signal-to-noise ratios of a plurality of adjacent frequency points are generally adopted for averaging, for example, 20 to 30 adjacent frequency points are used for averaging. And P is _frame (l) And the frame voice existence probability determined by the prior signal-to-noise ratio is represented, and the frame voice existence probability obtained by averaging the prior signal-to-noise ratios of all frequency points is adopted. On the other hand, in the formula (19)

The conditional speech occurrence probability is represented, and the specific calculation method is as follows:

wherein,

the gain function, which indicates when speech is not present, can be set to a lower fixed value, G, in general _H1 (k, t) represents the gain function when speech occurs, and the calculation method is as follows:

wherein,

the ratio of the a posteriori signal to noise,

for the estimated noise power spectrum, the specific calculation method is as follows:

wherein,

for smoothing parameters, p' (k, l) = α _p p′(k，l-1)+(1-α _p ) I (k, l) is the conditional speech occurrence probability.

Wherein,

Z _min (k，l)＝min{Z _min (k，l-1)，Z _s (k, l) } for determining the ratio of the current frame rate point to the stationary noise. Z _s (k，l)＝α _s Z _s (k，l-1)+(1-α _s ) Z (k, l) is smoothed in a first-order smoothing manner. In the formula (11)

Expressing the prior signal-to-noise ratio, and the calculation mode is as follows:

the calculation result is substituted for the formula (19) to obtain the final voice estimation value

The above process is analyzed, and the key factors influencing the result accuracy comprise effective noise power spectrum estimation, estimation of prior speech non-existence probability and the like. The existing method mainly judges through the energy characteristics of the spectrum structure, does not utilize the spatial characteristics of echo signals for processing, and utilizes the probability function P based on residual echo proposed in step S204 _echo (l) The optimization is performed, and the function is as described in formula (12), and mainly includes the following steps:

(1) By P _echo (l) The conditional speech occurrence probability formula is optimized by first optimizing formula (25), at P _echo (l) The smaller frame, i.e. the frame with larger echo residue, adopts the following optimization expression:

when echo is large, especially when the residual signal has certain speech non-stationary characteristic, the original formula may be wrongly judged as H ₁ But can still be judged as H after being corrected by the application ₀ 。

(2) Absence of the probability formula q (k, t) =1-P for a priori speech _local (k，l)P _global (k，l)P _frame (l) Optimized, frame speech existence probability P _frame (l) The calculation formula is optimized as follows:

wherein k is _f In relation to the number of smoothed frequency points, k is taken here _f Is a frequency point corresponding to 3500 Hz. By the method, the lower frame voice existence probability can be obtained when the echo residue is larger, so that the frame voice existence probability can be effectively improvedHigh estimation performance of the power spectrum of the residual echo.

(3) Gain function when speech is not present

Normally set to a lower fixed value G _min This value is generally not too small or likely to cause distortion of the target speech in the event of a false decision, but it also results in a reduction in the suppression of residual echo. Therefore, we propose a gradual lower limit setting method, i.e.

Wherein G (P) _echo (l) Is a piecewise function, G (P) in the case of definite large echo residual _echo (l) Can be set to a smaller value G _m2 In the explicit no-echo case G (P) _echo (l) ) value is set to G _m1 In other cases G (P) _echo (l) Following P _echo (l) Increasing and becoming smaller. The specific expression is as follows:

wherein, this application is G _m1 ＝0.1，σ ₁ ＝0.2，G _m1 ＝0.025，σ ₂ ＝0.8。

By optimizing the OM-LSA algorithm in the above way, the residual noise can be further suppressed under the condition of large echo residue, so that the suppression performance of the residual echo is improved.

In summary, the above-mentioned single-channel speech enhancement operation is performed on the multi-channel data with echo signal residue to obtain the estimated value of the target speech

As a first processed signal.

In one example, the echo residual suppression algorithm further comprises an optimized automatic gain control algorithm,

performing echo residual suppression on the multi-channel data with echo signal residual, further comprising,

and according to the optimized automatic gain control algorithm, performing echo residue suppression on the first processing signal to obtain a target signal. The specific analysis is as follows:

after the post-filtering processing, the gain of the output signal of the loudspeaker is usually required to be within a preset reasonable interval through automatic gain control, so as to avoid the situation of being neglected. However, the conventional automatic gain method determines the gain by determining whether the current frame signal is a speech signal or a noise signal, but when the output signal contains echo residue, especially when the echo is strong, it is difficult to distinguish the echo residue from a near-end weak speech signal by simply using the signal spectrum characteristic to determine whether the current frame signal is a speech signal, so that the echo residue may be erroneously determined as a weak speech signal to be amplified.

In order to solve the above problems, the present application provides an automatic gain control optimization method based on a residual echo probability function, which performs automatic gain control processing on an output signal enhanced by echo cancellation and subsequent processing, and specifically includes the following steps:

(1) Estimating the target voice

Performing ISTFT processing to obtain time domain signal

For time domain signals

Performing framing processing, taking each 512 points as 1 frame under the condition of 16kHz sampling, recording each frame of time domain signal as sfrm, and calculating average energy;

(2) Calculating a priori gain frontGain, wherein the specific calculation method comprises the following steps:

if(frontGain*max(abs(sfrm))＞＝ref)

frontGain＝ref*frontGain/(frontGain*max(abs(sfrm)))；

else

frontGain＝frontGain；

end

where ref is a fixed reference, here taken to be 0.3.abs () performs an absolute value operation on each received point of the time-domain signal.

(3) The posterior gain postGain is calculated by the following specific calculation method:

if(P _echo (l)＜0.3)

postGain＝0.975*frontGain；

else

if(P _frame (l)＜0.3)

postGain＝frontGain；

else

if(frontGain*averEnergy＞0.5)

postGain＝0.975*frontGain；

else

postGain＝1.025*frontGain；

end

wherein, P _echo (l) According to the estimated noise residual judgment function of the current frame, P, using equation (12) _frame (l) The estimated speech existence probability of the current frame is used according to formula (28).

(4) Calculating the output signal yfrm = postGain sfrm;

(5) And (5) repeating the steps (1) to (4) for a new array of input signals.

In summary, the above-described automatic gain control processing operation is performed on the first processed signal to obtain each frame signal yfrm, and the frame signals are integrated to obtain y (t) as the target signal.

Fig. 9 is a flowchart of a module algorithm of an echo residual suppression method based on a special sound structure according to an embodiment of the present application, and modularly describes the technical solutions of step S201 to step S205 in fig. 2, and the specific steps are as follows:

(1) For a voice interaction system with a special sound structure, framing a received signal, converting the received signal into a frequency domain signal by using STFT, and performing optimized echo cancellation processing on the received signal of each microphone by using the LMS algorithm based on residual echo probability function control provided in the step 202;

(2) Performing beam scanning on the two-dimensional plane by using an SRP-PHAT algorithm, and processing by using formulas (7) to (8) to obtain a normalized beam output result;

(3) Calculating an echo residual judgment function P using equations (9) to (12) based on the normalized beam output result _echo (l)；

(4) Determining a function P using echo residuals _echo (l) Optimizing a single-channel noise reduction method in a subsequent processing module, and respectively optimizing the conditional voice occurrence probability, the prior voice non-occurrence probability and the gain function when the voice does not occur by using formulas (27), (28) and (29);

(5) Determining a function P using echo residuals _echo (l) And (5) optimizing an automatic gain control method in a subsequent processing module, and processing the signal subjected to single-channel noise reduction in the step (4) to obtain a final output signal.

FIG. 10 is a graph of the speech spectrum of the processed signal after conventional processing using OM-LSA and common automatic gain control (different from P) and the integrated processing based on the probability function of residual echo provided by the embodiment of the present application _echo (l) The control strategy) of the present application, and the comprehensive method of the present application includes a plurality of methods for assisting control based on the residual echo probability function, which are proposed in steps S201 to S205.

The actual desktop conference system consisting of the microphone array special structure comprising 4 array elements as shown in fig. 1 is used for testing, and the signal-to-return ratio is-40 dB.

In fig. 10, from top to bottom: the method comprises the steps of (a) an original received signal, (b) a loudspeaker reference signal, (c) single-path output after echo cancellation, (d) a traditional subsequent processing method output result, and (e) a comprehensive method output result, wherein the result shows that the residual noise is more in the output result of the traditional subsequent processing method under the condition of only large echo (0 s-2s,5.8s-6.2s,9.0s-9.3 s). Meanwhile, when near-end voice and echo residues exist at the same time (2.8 s-5.8s,6.2s-9.0 s), the echo residues in the middle and low frequency parts in the output signal of the traditional successive processing method still have obvious echo residues, and the echo residues of the method provided by the application are obviously smaller than those of the traditional method. In summary, the method provided by the present application can significantly improve the residual echo suppression performance.

Fig. 11 is a schematic structural diagram of an echo residual suppression system 1100 based on a special acoustic structure according to an embodiment of the present application, including:

the loudspeaker 1101 is positioned above or below the center position of the microphone array of which the array elements form a uniform circular array; for emitting a signal of the loudspeaker audio component.

The microphone array 1102, whose array elements form a uniform circular array, is used to acquire a multi-channel speech signal composed of a near-end speaker speech component and a speaker audio component, and convert the multi-channel speech signal into a multi-channel frequency domain signal.

The microphone array is also used for removing the audio frequency components of the loudspeaker in the multi-channel frequency domain signals according to the optimized echo cancellation technology to obtain multi-channel data with echo signal residues; the echo signal is a loudspeaker audio component in the multi-channel frequency domain signal;

according to the beam space scanning technology, the beam output energy of multi-channel data with echo signal residues in all directions is counted; each direction is positioned in a plane formed by array elements of the microphone array;

according to the characteristic that a wave beam output energy peak curve of an echo signal in each direction shows periodic intensity change along with an angle, constructing an echo signal residual probability function based on spatial characteristics; the spatial characteristics comprise that beam output energy peak curves of direct sound energy components of echo signals in all directions are periodic along with angles due to the fact that all array elements of a microphone array reached by a loudspeaker have the same sound path difference;

according to the optimized single-channel speech enhancement algorithm, echo residue suppression operation is carried out on any single-channel data in the multi-channel data with echo signal residue to obtain a first processing signal.

according to the arrival direction estimation and beam forming algorithm, carrying out target direction estimation and beam forming operation on the multi-channel data with echo signal residue to obtain single-channel data with echo signal residue;

and according to the optimized automatic gain control algorithm, performing echo residue suppression on the first processing signal to obtain a target signal.

The application provides an echo residue suppression method based on a special sound structure, and an echo residue and near-end signal distinguishing method, wherein the sound path difference of a sound production unit from each microphone array element is the same through designing the position relation of a special microphone array and a loudspeaker sound production unit, and the sound path difference from other external sound sources to each microphone array element is different. By utilizing the characteristics, the echo residue and near-end signal distinguishing method based on the special sound structure is provided, and the characteristics that the sound path difference of each array element from the loudspeaker sounding unit to the microphone array is the same, and the sound path difference of each array element from the external near-end signal to the microphone array is different are utilized to construct an echo signal residue probability function based on the spatial characteristics, so that the echo residue and the near-end signal can be effectively distinguished, a larger value can be obtained when the echo residue exists, and a smaller value can be obtained when the residual echo exists, the echo cancellation performance can be assisted to be improved by utilizing the judgment function, subsequent processing such as subsequent voice enhancement and automatic gain control can be carried out, and the residual echo can be further inhibited, and the near-end signal can be kept.

In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to be performed in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.

It is to be understood that the various numerical references referred to in the embodiments of the present application are merely for descriptive convenience and are not intended to limit the scope of the embodiments of the present application. It should be understood that, in the embodiment of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiment of the present application.

The above-mentioned embodiments, objects, technical solutions and advantages of the present application are further described in detail, it should be understood that the above-mentioned embodiments are only examples of the present application, and are not intended to limit the scope of the present application, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present application should be included in the scope of the present application.

Claims

1. An echo residual suppression method based on a special sound structure comprises the following steps:

acquiring a multi-channel voice signal consisting of the voice of a near-end speaker and the audio component of a loudspeaker by using a microphone array; converting the multi-channel voice signal into a multi-channel frequency domain signal; array elements of the microphone array form a uniform circular array; the loudspeaker is positioned above or below the center position of the microphone array;

according to the characteristic that a wave beam output energy peak curve of an echo signal in each direction shows periodic intensity change along with an angle, constructing an echo signal residual probability function based on spatial characteristics; the spatial characteristics comprise that the wave beam output energy peak curves of the direct sound energy components of echo signals in all directions are periodic along with angles due to the fact that the sound path differences of all array elements of the microphone array reached by the loudspeaker are the same;

2. The method of claim 1, wherein the optimizing of the optimized echo cancellation algorithm comprises optimizing an echo cancellation algorithm for a next frame signal of the multi-channel frequency domain signal using a residual probability function of the echo signal calculated for a current frame signal of the multi-channel frequency domain signal.

3. The method of claim 1, wherein the echo residual suppression algorithm comprises an optimized single channel speech enhancement algorithm,

4. The method of claim 1, wherein the echo residual suppression algorithm comprises an optimized single channel speech enhancement algorithm,

5. The method according to claim 3 or 4, wherein said echo residual suppression algorithm further comprises an optimized automatic gain control algorithm,

6. An echo residual suppression system based on a special structure of sound, comprising:

7. The system of claim 6, wherein the optimization of the optimized echo cancellation algorithm comprises optimizing an echo cancellation algorithm for a next frame signal of the multi-channel frequency domain signal using a residual probability function of the echo signal calculated from a current frame signal of the multi-channel frequency domain signal.

8. The system of claim 6, wherein the echo suppression algorithm comprises an optimized single channel speech enhancement algorithm,

and according to the optimized single-channel speech enhancement algorithm, carrying out echo residual suppression operation on any single-channel data in the multi-channel data with echo signal residues to obtain a first processing signal.

9. The system of claim 6, the echo residual suppression algorithm comprising an optimized single channel speech enhancement algorithm,

10. The system according to claim 8 or 9, wherein said echo residual suppression algorithm further comprises an optimized automatic gain control algorithm,

the echo residual suppression is performed on the multi-channel data with echo signal residual, and the method further comprises,