CN112466325A

CN112466325A - Sound source positioning method and apparatus, and computer storage medium

Info

Publication number: CN112466325A
Application number: CN202011340094.9A
Authority: CN
Inventors: 陈喆; 胡宁宁; 曹冰
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2020-11-25
Filing date: 2020-11-25
Publication date: 2021-03-09
Anticipated expiration: 2040-11-25
Also published as: CN112466325B

Abstract

The embodiment of the application discloses a sound source positioning method and device and a computer storage medium, wherein the method comprises the following steps: respectively acquiring a first voice signal and a second voice signal corresponding to a sound source to be positioned through a first sound acquisition module and a second sound acquisition module; respectively carrying out frequency domain conversion processing on the first voice signal and the second voice signal to obtain a first frequency domain signal and a second frequency domain signal; if the first frequency domain signal and the second frequency domain signal comprise voice signals, determining the first frequency domain signal and the second frequency domain signal as sound source characteristic signals; determining a target azimuth angle corresponding to a sound source to be positioned according to a preset positioning model, a sound source characteristic signal and a preset angle calculation model; the preset positioning model is used for determining probability values corresponding to different azimuth angles.

Description

Sound source positioning method and apparatus, and computer storage medium

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a sound source localization method and apparatus, and a computer storage medium.

Background

With the rise of intelligent voice, the microphone array pickup technology has gradually developed into a popular technology in the voice recognition processing process. The sound source positioning method based on the microphone array is widely applied to video conferences, voice enhancement, intelligent robots, intelligent homes, vehicle-mounted communication equipment and the like. For example, in a video conference system, sound source localization can realize that a camera is aimed at a speaker in real time; when the method is applied to a hearing-aid device, sound source position information can be provided for a hearing-impaired person.

However, in order to identify a sound source target more accurately and ensure positioning accuracy, the sound source positioning method in the related art cannot balance positioning accuracy and calculated amount well, and is often large in calculated amount and high in calculation complexity, and is not suitable for smaller electronic devices.

Disclosure of Invention

The embodiment of the application provides a sound source positioning method and device and a computer storage medium, which can reduce the calculation complexity and further realize good balance between the positioning precision and the calculated amount on the premise of ensuring the positioning precision.

The technical scheme of the embodiment of the application is realized as follows:

in a first aspect, an embodiment of the present application provides a sound source localization method, where the method includes:

respectively acquiring a first voice signal and a second voice signal corresponding to a sound source to be positioned through a first sound acquisition module and a second sound acquisition module;

respectively carrying out frequency domain conversion processing on the first voice signal and the second voice signal to obtain a first frequency domain signal and a second frequency domain signal;

determining the first frequency domain signal and the second frequency domain signal as sound source characteristic signals if the first frequency domain signal and the second frequency domain signal comprise voice signals;

determining a target azimuth angle corresponding to the sound source to be positioned according to a preset positioning model, the sound source characteristic signal and a preset angle calculation model; the preset positioning model is used for determining probability values corresponding to different azimuth angles.

In a second aspect, an embodiment of the present application provides a sound source localization apparatus, which includes an acquisition unit, a conversion unit, and a determination unit,

the acquisition unit is used for respectively acquiring a first voice signal and a second voice signal corresponding to a sound source to be positioned through a first sound acquisition module and a second sound acquisition module;

the conversion unit is configured to perform frequency domain conversion processing on the first voice signal and the second voice signal respectively to obtain a first frequency domain signal and a second frequency domain signal;

the determining unit is configured to determine the first frequency-domain signal and the second frequency-domain signal as sound source feature signals if the first frequency-domain signal and the second frequency-domain signal include speech signals; determining a target azimuth angle corresponding to the sound source to be positioned according to a preset positioning model, the sound source characteristic signal and a preset angle calculation model; the preset positioning model is used for determining probability values corresponding to different azimuth angles.

In a third aspect, an embodiment of the present application provides a sound source positioning device, which includes a processor, and a memory storing instructions executable by the processor, and when the instructions are executed by the processor, the sound source positioning device implements the sound source positioning method described above.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, on which a program is stored, which, when executed by a processor, implements the sound source localization method as described above.

The embodiment of the application provides a sound source positioning method and device and a computer storage medium, wherein the sound source positioning device respectively collects a first voice signal and a second voice signal corresponding to a sound source to be positioned through a first sound collection module and a second sound collection module; respectively carrying out frequency domain conversion processing on the first voice signal and the second voice signal to obtain a first frequency domain signal and a second frequency domain signal; if the first frequency domain signal and the second frequency domain signal comprise voice signals, determining the first frequency domain signal and the second frequency domain signal as sound source characteristic signals; determining a target azimuth angle corresponding to a sound source to be positioned according to a preset positioning model, a sound source characteristic signal and a preset angle calculation model; the preset positioning model is used for determining probability values corresponding to different azimuth angles. That is to say, in the embodiment of the present application, after the sound source positioning device collects the audio frames sent by the sound source to be positioned through different sound collection modules and obtains the first speech signal and the second speech signal, the sound source positioning device may first perform frequency domain conversion processing on the speech signals respectively, so as to convert the speech signals in the time domain into the first frequency domain signal and the second frequency domain signal. Furthermore, the sound source positioning device can determine the frequency domain signal including the voice signal as a sound source characteristic signal, so that a target azimuth corresponding to the sound source to be positioned is further determined based on the sound source characteristic signal, a preset positioning model which can be used for determining probability values of different azimuths of the sound source to be positioned and a preset angle calculation model, and accurate positioning of the sound source position is realized. Therefore, the sound source positioning method provided by the application carries out frequency domain conversion processing on the voice signals obtained by different sound acquisition channels, and the sound source positioning device further determines the target azimuth angle corresponding to the sound source to be positioned by combining the preset positioning model after determining that the converted frequency domain signals comprise the voice signals, so that the data required to be processed by the preset positioning model is greatly reduced, the positioning precision can be ensured, and the good balance between the positioning precision and the calculated amount is further realized. And because the sound source positioning method has small calculation amount, the method is suitable for various small-sized equipment, and the positioning method has higher application flexibility.

Drawings

Fig. 1 is a schematic flow chart of a first implementation process of a sound source localization method according to an embodiment of the present application;

fig. 2 is a schematic diagram of a sound source localization apparatus according to an embodiment of the present application;

fig. 3 is a schematic flow chart illustrating an implementation process of a sound source localization method according to an embodiment of the present application;

fig. 4 is a schematic flow chart illustrating an implementation process of a sound source localization method according to an embodiment of the present application;

fig. 5 is a schematic flow chart of an implementation of a sound source localization method according to an embodiment of the present application;

fig. 6 is a first schematic diagram of a sound source localization processing procedure according to an embodiment of the present application;

fig. 7 is a schematic diagram of a sound source localization processing procedure according to an embodiment of the present application;

fig. 8 is a first schematic structural diagram of a sound source positioning device according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a sound source positioning device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the specific embodiments described herein are illustrative of the relevant application and are not limiting of the application. It should be noted that, for the convenience of description, only the parts related to the related applications are shown in the drawings.

Before further detailed description of the embodiments of the present invention, terms and expressions mentioned in the embodiments of the present invention are explained, and the terms and expressions mentioned in the embodiments of the present invention are applied to the following explanations.

With the rise of intelligent voice, the microphone array pickup technology has gradually developed into a popular technology in the voice recognition processing process. The sound source positioning method based on the microphone array is an important research direction of a voice recognition processing process, and has wide application occasions such as video conferences, voice enhancement, intelligent robots, intelligent homes, vehicle-mounted communication equipment and the like. For example, in a video conference system, sound source localization can realize that a camera is aimed at a speaker in real time; when the method is applied to a hearing-aid device, sound source position information can be provided for a hearing-impaired person.

Currently, sound source localization methods can be classified into two categories according to their localization parameters, including: an interaural difference-based localization and a head-related transfer function-based localization. On one hand, however, poor positioning between ears is easily interfered by reverberation and noise in the actual environment, and the positioning performance is poor; on the other hand, although good sound source localization in three-dimensional space is achieved based on head-related transfer function localization, the method is too complex in calculation and strong in individuality of head-related transfer function, and when different individuals or surrounding environments are different, the actual transfer function may be different from the original transfer function, and accurate localization cannot be achieved.

Furthermore, in order to solve the above problems, a sound source target is more accurately identified, and positioning accuracy is ensured, a sound source positioning method based on a neural network is proposed in the related art, but the sound source positioning method has the problem that the positioning accuracy and the calculated amount cannot be well balanced, and is high in positioning accuracy, but the calculated amount is often large, the calculation complexity is high, and the method is not suitable for smaller electronic equipment.

In order to solve the problems of the existing sound source positioning mechanism, embodiments of the present application provide a sound source positioning method and apparatus, and a computer storage medium. Specifically, after the sound source positioning device collects audio frames sent by a sound source to be positioned through different sound collection modules and obtains a first voice signal and a second voice signal, the sound source positioning device may perform frequency domain conversion processing on the voice signals respectively to convert the voice signals in a time domain into the first frequency domain signal and the second frequency domain signal. Furthermore, the sound source positioning device can determine the frequency domain signal including the voice signal as a sound source characteristic signal, so that a target azimuth corresponding to the sound source to be positioned is further determined based on the sound source characteristic signal, a preset positioning model which can be used for determining probability values of different azimuths of the sound source to be positioned and a preset angle calculation model, and accurate positioning of the sound source position is realized. Therefore, the sound source positioning method provided by the application carries out frequency domain conversion processing on the voice signals obtained by different sound acquisition channels, and the sound source positioning device further determines the target azimuth angle corresponding to the sound source to be positioned by combining the preset positioning model after determining that the converted frequency domain signals comprise the voice signals, so that the data required to be processed by the preset positioning model is greatly reduced, the positioning precision can be ensured, and the good balance between the positioning precision and the calculated amount is further realized. And because the sound source positioning method has small calculation amount, the method is suitable for various small-sized equipment, and the positioning method has higher application flexibility.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

Fig. 1 is a schematic flow chart illustrating an implementation process of a sound source positioning method according to an embodiment of the present application, and as shown in fig. 1, in an embodiment of the present application, a method for a sound source positioning device to perform sound source positioning may include the following steps:

step 101, a first voice signal and a second voice signal corresponding to a sound source to be positioned are respectively collected through a first sound collection module and a second sound collection module.

In the embodiment of the application, the sound source positioning device can collect and process the audio frames sent by the sound source to be positioned through the first sound collection module and the second sound collection module respectively, so as to obtain the first voice signal and the second voice signal.

It should be understood that, in the embodiments of the present application, the sound source localization apparatus may be any electronic device equipped with a sound sensor, which is capable of achieving sound source localization through the "pick-up" function of the sound sensor. Mobile phones, notebook computers, tablet computers, desktop computers, portable gaming devices, in-vehicle devices, wearable devices (e.g., smart glasses), etc., not limited to the home and consumer electronics category; or a security robot, a service robot and a teleconferencing system which can identify the service object which makes the sound, or a submarine and a warship in the military field.

Note that, in the embodiments of the present application, the sound source localization apparatus is an electronic device provided with a sound sensor array. Specifically, the sound source positioning device is provided with a first sound collection module and a second sound collection module, and the two sound collection modules form a sound sensor array.

In particular, the first sound collection module and the second sound collection module may be disposed at different positions on the terminal, i.e., corresponding to spatially different positions. Optionally, the first sound collection module and the second sound collection module are disposed at symmetrical positions of the sound source positioning device, and may be disposed at vertically symmetrical positions, at horizontally symmetrical positions, or at front-back symmetrical positions.

For example, in an embodiment of the present application, fig. 2 is a schematic diagram of a sound source positioning device provided in the embodiment of the present application, and as shown in fig. 2, assuming that the sound source positioning device is Augmented Reality (AR) glasses, and the sound collection module may be a microphone, the sound source positioning device may set the first microphone 10 and the second microphone 20 at symmetrical positions on left and right sides of the AR glasses, respectively, and at this time, the first microphone 10 and the second microphone 20 form a microphone array.

It should be understood that, in the embodiments of the present application, the sound source localization apparatus is not limited to include only the first sound collection module and the second sound collection module, and the sound source localization apparatus may include L sound collection modules; wherein L is equal to or greater than 2. Further, the sound sensor array formed by the plurality of sound collection modules may be an array corresponding to different spatial positions and arranged according to a certain shape rule.

It should be noted that, in the embodiment of the present application, the sound source positioning device may perform audio frame acquisition from the same sound source, that is, the sound source to be positioned, through the first sound acquisition module and the second sound acquisition module which are disposed at different positions, that is, the same audio frame emitted by the sound source to be positioned is acquired by the microphones at different positions in the microphone array.

It should be understood that the microphones at different spatial positions are not located at the same distance from the same sound source, and therefore the speech information collected by the microphones at different positions is different in the same audio frame from the same sound source. Therefore, in the embodiment of the present application, after an audio frame is sent out by a sound source to be located at a certain time, the sound source locating device acquires a first voice signal through the first sound acquisition module in the sound sensor array, and acquires a second voice signal through the second sound acquisition module at the same time, where information of the two voice signals may be different. That is to say, in the embodiment of the present application, after the sound source localization apparatus collects the same audio frame emitted by the sound source to be localized through different sound collection channels, different first speech signals and second speech signals can be obtained.

Further, in the embodiment of the present application, after the sound source positioning device respectively collects the first voice signal and the second voice signal corresponding to the sound source to be positioned through the first sound collection module and the second sound collection module, the sound source positioning device may further perform frequency domain conversion processing on the voice signals.

Step 102, performing frequency domain conversion processing on the first voice signal and the second voice signal respectively to obtain a first frequency domain signal and a second frequency domain signal.

In the embodiment of the present application, after the sound source positioning device obtains the first voice signal and the second voice signal through the first sound collection module and the second sound collection module, the sound source positioning device may further perform frequency domain conversion processing on the two paths of voice signals, so as to obtain the first frequency domain signal and the second frequency domain signal.

It should be noted that, in the embodiment of the present application, the voice signal is a voice digital signal. Specifically, the sound source positioning device performs analog/digital conversion on the voice signals before performing frequency domain conversion processing, that is, when two sound collection modules respectively perform collection processing on audio frames emitted by a sound source to be positioned, the audio frames are converted into electric signals, sampled and subjected to a/D conversion, so as to obtain voice digital signals through conversion.

Further, the sound source positioning device performs frequency domain conversion processing on the two paths of voice digital signals, that is, the voice digital signals are converted from a time domain to a frequency domain.

In the embodiment of the present application, the sound source localization apparatus may perform analog-to-Digital conversion and frequency-domain conversion on the voice analog Signal by using a hardware processor, for example, a functional processor such as a Digital Signal Processing (DSP) or an Advanced RISC Machine (ARM).

Specifically, fig. 3 is a schematic diagram of a second implementation flow of the sound source localization method according to the embodiment of the present application, and as shown in fig. 3, the method for obtaining the first frequency domain signal and the second frequency domain signal by performing frequency domain conversion processing on the first speech signal and the second speech signal by the sound source localization device includes:

and 102a, respectively carrying out windowing processing on the first voice signal and the second voice signal to obtain a windowed first signal and a windowed second signal.

And 102b, respectively carrying out Fourier transform processing on the windowed first signal and the windowed second signal to obtain a first frequency domain signal and a second frequency domain signal.

Specifically, in the embodiment of the present application, the sound source positioning device may perform windowing on two voice digital signals obtained after analog-to-digital conversion, so as to obtain two windowed voice signals, and then continue to perform Fast Fourier Transform (FFT) processing on the two windowed voice signals, so as to further obtain two frequency domain signals.

It should be understood that, in the embodiment of the present application, the sound source positioning device performs processing on the speech signal in real time based on a frame division manner, that is, according to a preset frame length N, signals are sequentially acquired in real time frame by frame, and then are sequentially subjected to frequency domain conversion processing frame by frame.

Illustratively, the sound source positioning device respectively collects audio frames emitted by a sound source to be positioned through two microphones, and the length of the frame signal is N, where N is an integer greater than zero. Obtaining a first voice signal x through sampling and A/D conversion₁(n) and a second speech signal x₂(n); furthermore, the sound source positioning device is used for positioning x through the DSP module or the ARM module₁(n) and x₂(n) performing windowing respectively by using the windowing functions omega (n) to obtain first windowed signals x₁(n) ω (n) and a second windowed signal x₂(n) ω (n); then, the sound source positioning device continues to perform FFT conversion processing on the two paths of windowed signals to obtain

First frequency domain signal:

X_1,m(k)＝FFT[x_1,m(n)ω(n)](formula 1)

Second frequency domain signal:

X_2,m(k)＝FFT[x_2,m(n)ω(n)](formula 2)

In formula 1 and formula 2, ω (N) is a window function, k is a frequency point, m is a frame number, and N is 1, 2.. N.

Further, in this embodiment of the application, after the sound source positioning device performs frequency domain conversion processing on the first speech signal and the second speech signal to obtain the first frequency domain signal and the second frequency domain signal, the sound source positioning device may further determine whether the frequency domain signal includes the speech signal, and then perform corresponding subsequent processing according to a determination result.

And 103, if the first frequency domain signal and the second frequency domain signal comprise voice signals, determining the first frequency domain signal and the second frequency domain signal as sound source characteristic signals.

In the embodiment of the present application, after the sound source localization apparatus performs frequency domain conversion processing on the first speech signal and the second speech signal to obtain the first frequency domain signal and the second frequency domain signal, if the sound source localization apparatus determines that the first frequency domain signal and the second frequency domain signal include speech signals, the sound source localization apparatus may further determine the two frequency domain signals as sound source feature signals.

It should be understood that there may be a silent period during the audio playing process of the sound source to be positioned, that is, the audio playing is stopped for a certain period of time due to a special reason. For example, during the call, the speaker listens to the other party for speaking, or the speaker pauses for a certain period of speech due to thinking, a brief message, or the like, or pauses in the middle of speaking (such as hesitation, stuttering, breathing, and the like), so that the current frame does not contain any speech information.

Therefore, in the embodiment of the present application, the sound source positioning device may perform the voice endpoint detection processing on the frequency domain signal, so as to determine whether the audio frame sent by the sound source to be positioned includes a voice signal, that is, identify a mute period in the audio frame signal.

Optionally, in an embodiment of the present application, the sound source positioning device may implement identification of a silence period in the audio frame signal through a Voice Activity Detection (VAD) module.

Further, in the embodiment of the present application, after the sound source localization apparatus performs the voice endpoint detection processing on the frequency domain signal, if the detection result indicates that the frequency domain signal does not include the voice signal, that is, the sound source to be localized does not currently emit an audio frame and is in the silent period. And then the sound source positioning device can discard the two paths of frequency domain signals, and continue to collect the next audio frame, convert the frequency domain signals and judge whether the audio frame comprises the voice signals.

On the other hand, if the detection result is that the frequency domain signal includes a voice signal, that is, the sound source to be positioned currently emits an audio frame, which is not in the mute period. And then the sound source positioning device can determine the two paths of frequency domain signals as sound source characteristic signals.

It should be noted that, in the embodiment of the present application, the sound source feature signal refers to two paths of frequency domain signals including a speech signal, and may be used as a target parameter for finally determining the sound source position.

Therefore, the method discards the frequency domain signals which do not comprise the voice signals and belong to silence, and only uses the two paths of frequency domain signals comprising the voice signals as the sound source characteristic signals, thereby realizing the identification and the elimination of the silence period from the audio frame signals of the sound source to be positioned, saving the telephone traffic resources and the bandwidth resources under the condition of not reducing the service quality, and being beneficial to reducing the end-to-end time delay felt by a user.

Further, in the embodiment of the present application, after the sound source localization apparatus determines the frequency domain signal including the voice signal as the sound source characteristic signal, the sound source localization apparatus may further achieve accurate localization of the sound source to be localized according to the sound source characteristic signal, the preset localization model and the preset angle calculation model.

And step 104, determining a target azimuth angle corresponding to the sound source to be positioned according to the preset positioning model, the sound source characteristic signal and the preset angle calculation model.

In the embodiment of the present application, after the sound source localization apparatus determines the frequency domain signal including the voice signal as the sound source characteristic signal, the sound source localization apparatus may further achieve accurate localization of the sound source to be localized according to the sound source characteristic signal, the preset localization model, and the preset angle calculation model.

It should be noted that, in the embodiments of the present application, the target azimuth refers to the spatial position corresponding to the sound source to be located, that is, the offset angle of the sound source to be located relative to the sound source positioning device.

Optionally, the sound source positioning device may determine that the two paths of frequency domain signals corresponding to the current audio frame include a voice signal, that is, after determining that the two paths of frequency domain signals include a sound source characteristic signal, the sound source positioning device may directly determine an azimuth angle corresponding to a sound source to be positioned based on one sound source characteristic signal and a preset positioning model, so as to further realize accurate positioning of the sound source to be positioned, that is, "frame-by-frame positioning".

Optionally, the sound source positioning device may also determine an azimuth angle corresponding to the sound source to be positioned according to multiple groups of frequency domain signals corresponding to multiple times and determined as the sound source characteristic signals, the preset positioning model and the preset angle calculation model, so as to realize accurate positioning of the sound source to be positioned, that is, "multi-frame positioning".

Specifically, in the "multi-frame positioning", the sound source positioning device may preset the number of target frames for sound source positioning, and when it is determined that the number corresponding to the frequency domain signal of the sound source characteristic signal satisfies the number of target frames, the sound source positioning device may determine the target azimuth angle according to the plurality of sound source characteristic signals, a preset positioning model, and a preset angle calculation model.

It should be noted that, in the embodiment of the present application, the preset positioning model is used to determine probability values corresponding to different azimuth angles of the sound source to be positioned, that is, probability values of the sound source at each possible position, where the probability values corresponding to different positions are different. Furthermore, the sound source positioning device can further combine a plurality of angle probability values and a preset angle calculation model to realize accurate positioning of the sound source to be positioned.

The embodiment of the application provides a sound source positioning method, wherein a sound source positioning device respectively collects a first voice signal and a second voice signal corresponding to a sound source to be positioned through a first sound collection module and a second sound collection module; respectively carrying out frequency domain conversion processing on the first voice signal and the second voice signal to obtain a first frequency domain signal and a second frequency domain signal; if the first frequency domain signal and the second frequency domain signal comprise voice signals, determining the first frequency domain signal and the second frequency domain signal as sound source characteristic signals; determining a target azimuth angle corresponding to a sound source to be positioned according to a preset positioning model, a sound source characteristic signal and a preset angle calculation model; the preset positioning model is used for determining probability values corresponding to different azimuth angles. That is to say, in the embodiment of the present application, after the sound source positioning device collects the audio frames sent by the sound source to be positioned through different sound collection modules and obtains the first speech signal and the second speech signal, the sound source positioning device may first perform frequency domain conversion processing on the speech signals respectively, so as to convert the speech signals in the time domain into the first frequency domain signal and the second frequency domain signal. Furthermore, the sound source positioning device can determine the frequency domain signal including the voice signal as a sound source characteristic signal, so that a target azimuth corresponding to the sound source to be positioned is further determined based on the sound source characteristic signal, a preset positioning model which can be used for determining probability values of different azimuths of the sound source to be positioned and a preset angle calculation model, and accurate positioning of the sound source position is realized. Therefore, the sound source positioning method provided by the application carries out frequency domain conversion processing on the voice signals obtained by different sound acquisition channels, and the sound source positioning device further determines the target azimuth angle corresponding to the sound source to be positioned by combining the preset positioning model after determining that the converted frequency domain signals comprise the voice signals, so that the data required to be processed by the preset positioning model is greatly reduced, the positioning precision can be ensured, and the good balance between the positioning precision and the calculated amount is further realized. And because the sound source positioning method has small calculation amount, the method is suitable for various small-sized equipment, and the positioning method has higher application flexibility.

Further, fig. 4 is a schematic flow chart of a third implementation of the sound source positioning method provided in the embodiment of the present application, as shown in fig. 4, after the sound source positioning device obtains the first frequency domain signal and the second frequency domain signal, that is, after step 102, and if the first frequency domain signal and the second frequency domain signal include a speech signal, before determining the first frequency domain signal and the second frequency domain signal as the sound source feature signal, that is, before step 103, the method for performing sound source positioning by the sound source positioning device includes:

and 105, respectively performing voice endpoint detection processing on the first frequency domain signal and the second frequency domain signal to obtain a first voice energy value corresponding to the first voice signal and a second voice energy value corresponding to the second voice signal.

And 106, if the first voice energy value is greater than or equal to a preset energy threshold value and the second voice energy value is greater than or equal to the preset energy threshold value, determining that the first frequency domain signal and the second frequency domain signal comprise voice signals.

It should be noted that, in the embodiment of the present application, when the sound source positioning device performs the voice endpoint detection processing on the two paths of frequency domain signals to determine whether the corresponding audio frame includes a voice signal, the sound source positioning device may use the calculation of the voice signal energy value to implement the above determination process.

Specifically, the calculation method of the speech energy value is shown by the following formula:

wherein, in formula 3, X_m(k) For the frequency domain signal corresponding to the current audio frame, X_m-1(k) Is the frequency domain signal corresponding to the previous audio frame, N is the frame length, VAD (k, m) is the voice signal energy value.

It should be understood that when the energy of the voice signal is lower than a certain threshold value, it is considered as a mute state, otherwise, it is considered as a voice signal is included. Therefore, in the embodiment of the present application, the sound source localization apparatus is preset with the voice signal energy threshold value

It is determined that a voice signal is included. Therefore, after the speech signal energy value is calculated based on the frequency domain signal, the speech energy value can be compared with a preset energy threshold value, and whether the speech signal is included is determined according to the comparison result, that is, whether the frequency domain signal can be used as the sound source characteristic signal is determined.

Specifically, after performing voice endpoint detection processing on two paths of frequency domain signals, the sound source positioning device calculates to obtain a first voice energy value corresponding to the first frequency domain signal and a second voice energy value corresponding to the second frequency domain signal; if the first speech energy value is greater than the preset energy threshold value and the second speech energy value is also greater than the preset energy threshold value, the sound source positioning device can determine that the audio frames corresponding to the two paths of frequency domain signals comprise speech signals.

On the other hand, if the first speech energy value is smaller than the preset energy threshold, or the second speech energy value is smaller than the preset speech energy value, the sound source localization apparatus may determine that the audio frame corresponding to the frequency domain signal does not include the speech signal.

The sound source positioning device can discard the frequency domain signals which do not comprise the voice signals and belong to silence, only two paths of frequency domain signals comprising the voice signals are used as sound source characteristic signals, the silence period is identified and eliminated from the audio frame signals of the sound source to be positioned, telephone traffic resources and bandwidth resources are saved under the condition that the service quality is not reduced, and the end-to-end time delay felt by a user is reduced.

Further, fig. 5 is a schematic diagram of a fourth implementation flow of the sound source localization method provided in the embodiment of the present application, and as shown in fig. 5, the method for determining, by a sound source localization apparatus, a target azimuth corresponding to the sound source to be localized according to a preset localization model, the sound source characteristic signal, and a preset angle calculation model includes:

step 104a, inputting a plurality of sound source characteristic signals corresponding to a plurality of moments into a preset positioning model, and outputting a plurality of angle probability combinations corresponding to the plurality of sound source characteristic signals; the angle probability combination represents the corresponding relation between the azimuth angle and the probability.

And step 104b, determining a target azimuth angle based on the plurality of angle probability combinations and a preset angle calculation model.

It should be noted that, in the embodiment of the present application, when the sound source positioning device performs "multi-frame positioning", a target buffer area is first set to pre-store the frequency domain signal determined as the sound source characteristic signal, and the number of the preset target frames corresponding to the preset buffer area is M.

Specifically, the sound source positioning device may store the frequency domain signal currently determined as the sound source characteristic signal in the target buffer area, and input the multi-frame frequency domain signal into the preset positioning model when the number of signals in the target buffer area is equal to M. Wherein, each audio frame is composed of two paths of frequency domain signals to form a characteristic matrix as follows:

wherein, in formula 4, Imag represents to extract the imaginary part, Real represents to extract the Real part, and m represents the several audio frames.

Further, the feature matrix corresponding to the M audio frames is shown as follows:

at this time, in equation 5, the dimension of the matrix formed by the M audio frames is 10 × 2052.

Further, the feature matrix shown in formula 5 is subjected to convolution, pooling and other processing through a preset positioning model, a plurality of initial probability values corresponding to a plurality of azimuth angles are determined firstly when each audio frame of a sound source to be positioned is determined, then an initial azimuth angle corresponding to the maximum initial probability value is continuously selected from the plurality of initial probability values through the preset positioning model to serve as an alternative azimuth angle corresponding to the audio frame, and the corresponding relation between the alternative azimuth angle and the corresponding probability value is used as an angle probability combination of alternatives when a target azimuth angle is determined; correspondingly, M audio frames correspond to M alternative angle probability combinations.

It should be noted that, in the embodiment of the present application, the predetermined positioning model is constructed based on a Convolutional Neural Network (CNN).

For example, in the embodiment of the present application, assuming that the number M of the preset target frames is equal to 10, the neural network corresponding to the preset positioning model sequentially sets one input layer, 3 convolutional layers and 3 pooling layers, two full-link layers and one output layer. And when the frame buffer number of the target buffer area is equal to 10, inputting a plurality of sound source characteristic signals into a preset positioning model. The input feature parameters of the input layer are the frequency domain signals containing the speech signals determined in step 103, that is, the feature matrix composed of the real part and the imaginary part of the frequency domain signals shown in the above formula 5. Furthermore, the convolution layers all adopt 64 filters of 2 × 2, the convolution step length is not limited, zero filling is carried out on the output of the previous layer before convolution so as to ensure that the characteristic size is not reduced before and after convolution, and the ReLU function is adopted as the activation function; the pooling layers are all subjected to maximum pooling of 2 multiplied by 2, the step length is not limited, and zero filling is carried out on the output of the previous layer before pooling. After three layers of convolution and pooling, the multidimensional output of the last pooling layer is expanded into one-dimensional output, namely 512 × 1 one-dimensional features, and then feature results are mapped into 10 results through the first full connection layer and the second full connection layer, namely results corresponding to 10 audio frames. Further, the preset positioning model is converted into probabilities through Softmax, and the probabilities represent probability values corresponding to 10 different azimuth angles, that is, in 10 audio frames, the correspondence between the maximum probability value candidate azimuth angle and the maximum angle probability value of each audio frame is the 10 angle probability combination.

Furthermore, the sound source positioning device determines the azimuth angle corresponding to the maximum probability value in the multiple angle probability combinations as the target azimuth angle corresponding to the sound source to be positioned through a preset calculation model. Specifically, the specific method for determining the target azimuth is as follows:

wherein, θ in equation 6_iAngle value, p (theta), representing sound source localization_c|φ_i) Representing the maximum probability value corresponding to the ith audio frame in the input signal.

The embodiment of the application provides a sound source positioning method, a sound source positioning device carries out frequency domain conversion processing on a voice signal collected through a sound sensor array, and after the obtained frequency domain signal comprises the voice signal, a target azimuth angle corresponding to a sound source to be positioned is further determined by combining the frequency domain signal, a preset positioning model and a preset angle calculation model, so that the sound source is accurately positioned, data needing to be processed by the preset positioning model are greatly reduced, positioning accuracy can be ensured, and good balance between the positioning accuracy and calculated quantity is further realized. And because the sound source positioning method has small calculation amount, the method is suitable for various small-sized equipment, and the positioning method has higher application flexibility.

Further, exemplarily, fig. 6 is a schematic view of a first sound source positioning processing procedure provided in the embodiment of the present application, and as shown in fig. 6, it is assumed that the sound source positioning device is a pair of smart glasses, and the first sound collection module and the second sound collection module are microphones, where the two microphones are symmetrically disposed on a left frame and a right frame of the pair of smart glasses.

Specifically, the terminal collects current audio frames sent by a sound source to be located through a first microphone and a second microphone respectively, and further obtains a first voice signal through the first microphone and a second voice signal through the second microphone (step S1); further, the sound source positioning device performs windowing and FFT transformation on the two obtained voice signals, i.e., the conversion from the time domain signal to the frequency domain signal, so as to obtain a first frequency domain signal and a second frequency domain signal (step S2). The sound source positioning device performs a voice detection process on the two paths of frequency domain signals through a voice endpoint detection module, i.e., a VAD module (step S3), so as to determine whether the current audio frame includes a voice signal according to the voice energy values respectively corresponding to the two paths of frequency domain signals (step S4).

Further, if it is determined that the first speech energy value corresponding to the first frequency domain signal and the speech energy value corresponding to the second frequency domain signal are both greater than the preset energy threshold, it is determined that the current speech signal is included, then the sound source positioning device may input the two frequency domain signals into a preset positioning model (step S5), and further obtain initial probability values corresponding to different azimuth angles of the current audio frame through the preset positioning model, and further determine an azimuth angle corresponding to the maximum probability value in the initial probability values as a target azimuth angle when the current audio frame is sent by the sound source to be positioned by using the preset positioning model.

Further, exemplarily, fig. 7 is a schematic view of a sound source positioning processing procedure provided in the embodiment of the present application, and as shown in fig. 7, it is assumed that the sound source positioning device is a pair of smart glasses, and the first sound collection module and the second sound collection module are microphones, where the two microphones are symmetrically disposed on a left frame and a right frame of the pair of smart glasses.

Further, if it is determined that the first frequency-domain signal and the second frequency-domain signal both include a speech signal, the sound source localization apparatus may buffer the two frequency-domain signals corresponding to the same frame in a preset storage space (step S6). Otherwise, if the speech signal is not included, the next frame of speech signal is collected continuously, i.e. the step S1 is skipped to.

It should be noted that the frame signals buffered in the preset storage space are limited, and therefore, after the sound source positioning device buffers the two frequency domain signals into the preset storage space, the sound source positioning device needs to simultaneously determine whether the number of the frame signals buffered in the preset storage space is equal to the number of target frames, for example, 10 frames (step S7). If the number of frames is equal to 10, the sound source localization apparatus may input the 10 frames of frequency domain signals into a preset localization model, so as to obtain probability values corresponding to different azimuth angles when a plurality of audio frames are obtained, that is, a plurality of angle probability combinations (step S8); further, the multiple angle probability combinations are input into a preset angle calculation model (step S9), and then the azimuth corresponding to the maximum probability value is used to determine a target azimuth corresponding to the sound source to be positioned, that is, to determine the accurate position of the sound source. On the other hand, if not equal to 10 frames, the sound source localization apparatus may continue to wait for the buffering of the frequency domain signal until the signal of the preset storage space satisfies 10 frames.

The embodiment of the application provides a sound source positioning method, wherein after a sound source positioning device acquires a first synchronous voice signal and a second synchronous voice signal from a same sound source to be positioned through different sound acquisition modules, the sound source positioning device can firstly and respectively perform frequency domain conversion processing on the voice signals so as to convert the voice signals in a time domain into the first frequency domain signals and the second frequency domain signals. Furthermore, the sound source positioning device may input the frequency domain signal including the voice signal into a preset positioning model for determining probability values of different azimuth angles of the sound source, and further determine a target azimuth angle corresponding to the sound source to be positioned by combining with a preset calculation model, so as to determine the position of the sound source. Therefore, the sound source positioning method provided by the application carries out frequency domain conversion processing on the voice signals, and the sound source positioning device further determines the target azimuth angle corresponding to the sound source to be positioned by combining the preset positioning model after determining that the converted frequency domain signals comprise the voice signals, so that the data required to be processed by the preset positioning model is greatly reduced, the positioning precision can be ensured, and the good balance between the positioning precision and the calculated amount is further realized. And because the sound source positioning method has small calculation amount, the method is suitable for various small-sized equipment, and the positioning method has higher application flexibility.

Based on the foregoing embodiments, in another embodiment of the present application, fig. 8 is a schematic structural diagram of a sound source localization apparatus in an embodiment of the present application, as shown in fig. 8, a sound source localization apparatus 30 in an embodiment of the present application may include a collecting unit 31, a converting unit 32, a determining unit 33, a detecting unit 34, and an executing unit 35,

the acquisition unit 31 is configured to acquire a first voice signal and a second voice signal corresponding to a sound source to be positioned through a first sound acquisition module and a second sound acquisition module, respectively;

the converting unit 32 is configured to perform frequency domain conversion processing on the first voice signal and the second voice signal respectively to obtain a first frequency domain signal and a second frequency domain signal;

the determining unit 33 is configured to determine the first frequency-domain signal and the second frequency-domain signal as sound source feature signals if the first frequency-domain signal and the second frequency-domain signal include speech signals; determining a target azimuth angle corresponding to the sound source to be positioned according to a preset positioning model, the sound source characteristic signal and a preset angle calculation model; the preset positioning model is used for determining probability values corresponding to different azimuth angles.

Further, in an embodiment of the present application, the first sound collection module and the second sound collection module are disposed at symmetrical positions.

Further, in an embodiment of the present application, the converting unit 32 is specifically configured to perform windowing on the first voice signal and the second voice signal respectively to obtain a windowed first signal and a windowed second signal; and performing Fast Fourier Transform (FFT) processing on the windowed first signal and the windowed second signal respectively to obtain a first frequency domain signal and a second frequency domain signal.

Further, in an embodiment of the present application, the detecting unit 34 is configured to, after obtaining a first frequency domain signal and a second frequency domain signal, and if the first frequency domain signal and the second frequency domain signal include a speech signal, before determining the first frequency domain signal and the second frequency domain signal as sound source feature signals, respectively perform speech endpoint detection processing on the first frequency domain signal and the second frequency domain signal, and obtain a first speech energy value corresponding to the first speech signal and a second speech energy value corresponding to the second speech signal.

Further, in an embodiment of the present application, the determining unit 33 is further configured to determine that the first frequency-domain signal and the second frequency-domain signal include the voice signal if the first voice energy value is greater than or equal to a preset energy threshold and the second voice energy value is greater than or equal to the preset energy threshold.

Further, in an embodiment of the present application, the determining unit 33 is specifically configured to input a plurality of sound source characteristic signals corresponding to a plurality of time instants to the preset positioning model, and output a plurality of angle probability combinations corresponding to the plurality of sound source characteristic signals; the angle probability combination represents the corresponding relation between the azimuth angle and the probability; and determining the target azimuth angle based on the plurality of angle probability combinations and the preset angle calculation model.

Further, in an embodiment of the present application, the determining unit 33 is further specifically configured to determine, by using the angle calculation model, an azimuth angle corresponding to a maximum probability in the multiple angle probability combinations as the target azimuth angle.

Further, in an embodiment of the present application, the executing unit 35 is configured to, after performing frequency domain conversion processing on the first voice signal and the second voice signal respectively to obtain a first frequency domain signal and a second frequency domain signal, if the first frequency domain signal and the second frequency domain signal do not include a voice signal, continue to perform the acquisition processing, the frequency domain conversion processing, and the voice endpoint detection processing of the next audio frame.

In the embodiment of the present application, further, fig. 9 is a schematic structural diagram of a sound source positioning device proposed in the embodiment of the present application, as shown in fig. 9, the sound source positioning device 30 proposed in the embodiment of the present application may further include a processor 36, a memory 37 storing executable instructions of the processor 36, and further, the sound source positioning device 30 may further include a communication interface 38, and a bus 39 for connecting the processor 36, the memory 37, and the communication interface 38.

In an embodiment of the present Application, the Processor 36 may be at least one of an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a ProgRAMmable Logic Device (PLD), a Field ProgRAMmable Gate Array (FPGA), a Central Processing Unit (CPU), a controller, a microcontroller, and a microprocessor. It is understood that the electronic devices for implementing the above processor functions may be other devices, and the embodiments of the present application are not limited in particular. The sound source localization apparatus 30 may further comprise a memory 37, which memory 37 may be connected to the processor 36, wherein the memory 37 is configured to store executable program code comprising computer operating instructions, and the memory 37 may comprise a high speed RAM memory, and may further comprise a non-volatile memory, such as at least two disk memories.

In the embodiment of the present application, a bus 39 is used to connect the communication interface 38, the processor 36, and the memory 37 and the intercommunication among these devices.

In the embodiment of the present application, the memory 37 is used for storing instructions and data.

Further, in an embodiment of the present application, the processor 36 is configured to collect, by the first sound collection module and the second sound collection module, a first voice signal and a second voice signal corresponding to a sound source to be positioned, respectively; respectively carrying out frequency domain conversion processing on the first voice signal and the second voice signal to obtain a first frequency domain signal and a second frequency domain signal; determining the first frequency domain signal and the second frequency domain signal as sound source characteristic signals if the first frequency domain signal and the second frequency domain signal comprise voice signals; determining a target azimuth angle corresponding to the sound source to be positioned according to a preset positioning model, the sound source characteristic signal and a preset angle calculation model; the preset positioning model is used for determining probability values corresponding to different azimuth angles.

In practical applications, the Memory 37 may be a volatile Memory (volatile Memory), such as a Random-Access Memory (RAM); or a non-volatile Memory (non-volatile Memory), such as a Read-Only Memory (ROM), a flash Memory (flash Memory), a Hard Disk (Hard Disk Drive, HDD) or a Solid-State Drive (SSD); or a combination of the above types of memories and provides instructions and data to the processor 36.

In addition, each functional module in this embodiment may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware or a form of a software functional module.

Based on the understanding that the technical solution of the present embodiment essentially or a part contributing to the prior art, or all or part of the technical solution, may be embodied in the form of a software product stored in a storage medium, and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the method of the present embodiment. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The embodiment of the application provides a sound source positioning device, which is characterized in that a first sound acquisition module and a second sound acquisition module are respectively used for acquiring a first voice signal and a second voice signal corresponding to a sound source to be positioned; respectively carrying out frequency domain conversion processing on the first voice signal and the second voice signal to obtain a first frequency domain signal and a second frequency domain signal; if the first frequency domain signal and the second frequency domain signal comprise voice signals, determining the first frequency domain signal and the second frequency domain signal as sound source characteristic signals; determining a target azimuth angle corresponding to a sound source to be positioned according to a preset positioning model, a sound source characteristic signal and a preset angle calculation model; the preset positioning model is used for determining probability values corresponding to different azimuth angles. That is to say, in the embodiment of the present application, after the sound source positioning device collects the audio frames sent by the sound source to be positioned through different sound collection modules and obtains the first speech signal and the second speech signal, the sound source positioning device may first perform frequency domain conversion processing on the speech signals respectively, so as to convert the speech signals in the time domain into the first frequency domain signal and the second frequency domain signal. Furthermore, the sound source positioning device can determine the frequency domain signal including the voice signal as a sound source characteristic signal, so that a target azimuth corresponding to the sound source to be positioned is further determined based on the sound source characteristic signal, a preset positioning model which can be used for determining probability values of different azimuths of the sound source to be positioned and a preset angle calculation model, and accurate positioning of the sound source position is realized. Therefore, the sound source positioning method provided by the application carries out frequency domain conversion processing on the voice signals obtained by different sound acquisition channels, and the sound source positioning device further determines the target azimuth angle corresponding to the sound source to be positioned by combining the preset positioning model after determining that the converted frequency domain signals comprise the voice signals, so that the data required to be processed by the preset positioning model is greatly reduced, the positioning precision can be ensured, and the good balance between the positioning precision and the calculated amount is further realized. And because the sound source positioning method has small calculation amount, the method is suitable for various small-sized equipment, and the positioning method has higher application flexibility.

An embodiment of the present application provides a computer-readable storage medium, on which a program is stored, which when executed by a processor implements the sound source localization method as described above.

Specifically, the program instructions corresponding to a sound source localization method in the present embodiment may be stored on a storage medium such as an optical disc, a hard disc, a usb disk, or the like, and when the program instructions corresponding to a sound source localization method in the storage medium are read or executed by an electronic device, the method includes the steps of:

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of implementations of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart block or blocks and/or flowchart block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart block or blocks in the flowchart and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present application, and is not intended to limit the scope of the present application.

Claims

1. A sound source localization method, characterized in that the method comprises:

2. The method of claim 1, wherein the first sound collection module and the second sound collection module are disposed in symmetrical positions.

3. The method according to claim 1, wherein the performing frequency domain conversion processing on the first speech signal and the second speech signal to obtain a first frequency domain signal and a second frequency domain signal respectively comprises:

windowing the first voice signal and the second voice signal respectively to obtain a windowed first signal and a windowed second signal;

and respectively carrying out Fourier transform processing on the windowed first signal and the windowed second signal to obtain the first frequency domain signal and the second frequency domain signal.

4. The method of claim 1, wherein after obtaining the first frequency-domain signal and the second frequency-domain signal and before determining the first frequency-domain signal and the second frequency-domain signal as sound source feature signals if the first frequency-domain signal and the second frequency-domain signal comprise speech signals, the method further comprises:

respectively performing voice endpoint detection processing on the first frequency domain signal and the second frequency domain signal to obtain a first voice energy value corresponding to the first voice signal and a second voice energy value corresponding to the second voice signal;

if the first voice energy value is greater than or equal to a preset energy threshold value and the second voice energy value is greater than or equal to the preset energy threshold value, determining that the first frequency domain signal and the second frequency domain signal comprise the voice signal.

5. The method according to claim 1, wherein the determining a target azimuth corresponding to the sound source to be positioned according to a preset positioning model, the sound source characteristic signal and a preset angle calculation model comprises:

inputting a plurality of sound source characteristic signals corresponding to a plurality of moments into the preset positioning model, and outputting a plurality of angle probability combinations corresponding to the plurality of sound source characteristic signals; the angle probability combination represents the corresponding relation between the azimuth angle and the probability;

and determining the target azimuth angle based on the angle probability combinations and the preset angle calculation model.

6. The method of claim 5, wherein determining the target azimuth based on the plurality of angle probability combinations and the predetermined angle calculation model comprises:

and determining the azimuth angle corresponding to the maximum probability in the angle probability combinations as the target azimuth angle by using the preset angle calculation model.

7. The method according to any one of claims 1 to 4, wherein after the performing the frequency domain conversion processing on the first speech signal and the second speech signal respectively to obtain a first frequency domain signal and a second frequency domain signal, the method further comprises:

and if the first frequency domain signal and the second frequency domain signal do not comprise a voice signal, continuing to perform acquisition processing, frequency domain conversion processing and voice endpoint detection processing of a next audio frame.

8. A sound source localization device, characterized in that the sound source localization device comprises an acquisition unit, a conversion unit and a determination unit,

9. A sound source localization arrangement comprising a processor, a memory having stored thereon instructions executable by the processor, the instructions when executed by the processor implementing the method according to any of claims 1-7.

10. A computer-readable storage medium, on which a program is stored, for use in a sound source localization arrangement, characterized in that the program, when executed by a processor, implements the method according to any one of claims 1-7.