CN111048113B

CN111048113B - Sound direction positioning processing method, device, system, computer equipment and storage medium

Info

Publication number: CN111048113B
Application number: CN201911311585.8A
Authority: CN
Inventors: 张明远
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-12-18
Filing date: 2019-12-18
Publication date: 2023-07-28
Anticipated expiration: 2039-12-18
Also published as: CN111048113A

Abstract

The application relates to a sound direction positioning processing method, a device, a system, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring voice data acquired from an environment; predicting at least one sound source direction based on the speech data; image acquisition is carried out according to at least one predicted sound source direction, and the morphological characteristics of the language expression part of the target object are identified from the acquired images; and determining a final sound source direction from the predicted at least one sound source direction according to the matching degree between the morphological characteristics and the voice data. According to the scheme, the accuracy of positioning the sound direction can be improved.

Description

Sound direction positioning processing method, device, system, computer equipment and storage medium

Technical Field

The present invention relates to the field of computer technology and the field of speech processing technology, and in particular, to a method, an apparatus, a system computer device, and a storage medium for positioning and processing a sound direction.

Background

With the rapid development of science and technology, the application of voice technology is more and more widespread, and the voice technology is involved in many application scenes. Such as a speech recognition scenario or a speech direction localization scenario.

In the conventional method, sound is collected by a microphone, and the sound direction is positioned by analyzing sound data. Thus, in some noisy environments, the collected sound data includes a lot of noise, which may cause a problem of inaccurate sound direction positioning.

Disclosure of Invention

Based on this, it is necessary to provide a sound direction localization processing method, apparatus, system, computer device and storage medium for solving the problem of inaccurate sound direction localization in the conventional method.

A sound direction positioning processing method comprises the following steps:

acquiring voice data acquired from an environment;

predicting at least one sound source direction based on the speech data;

image acquisition is carried out according to at least one predicted sound source direction, and the morphological characteristics of the language expression part of the target object are identified from the acquired images;

and determining a final sound source direction from the predicted at least one sound source direction according to the matching degree between the morphological characteristics and the voice data.

In one embodiment, the voice data is at least two paths of voice data from the same sound source; the predicted sound source direction includes a sound emission angle;

predicting at least one sound source direction from the speech data comprises:

Determining phase differences among voice data of all paths;

and predicting the sounding angle corresponding to the voice data according to the phase difference.

In one embodiment, acquiring speech data collected from an environment includes:

collecting voice data from the environment through a voice collection device array to obtain at least two paths of voice data sent from the same sound source; the sound collection device array comprises at least two sound collection devices;

image acquisition in accordance with the predicted at least one sound source direction comprises:

and in the process of keeping the voice data collected by the voice collection device array, controlling the image collection device to collect images according to the predicted at least one sound source direction.

In one embodiment, the predicted at least one sound source direction is at least two; determining a final sound source direction from the predicted at least one sound source direction based on the degree of matching between the morphological feature and the speech data comprises:

obtaining predicted direction values corresponding to the predicted sound source directions respectively; a predicted direction value for characterizing a probability that the speech data originates from a sound source direction;

determining a sound source direction value corresponding to the sound source direction according to a predicted direction value corresponding to the predicted sound source direction and the matching degree corresponding to the image acquired according to the predicted sound source direction; the sound source direction value is used for representing the probability that the predicted sound source direction is the final sound source direction;

The predicted sound source direction corresponding to the maximum sound source direction value is selected as the final sound source direction.

In one embodiment, the target object is a person object; the language expression part is a lip; the morphological characteristics of the language expression part are mouth shape characteristics;

identifying, from the acquired image, a morphological feature of the linguistic expression part of the target object includes:

positioning a face area of a person object from the acquired image;

identifying a lip region in the face region;

the die features are extracted from the lip region.

In one embodiment, the acquired images are a continuous sequence of images; the extracted mouth shape features are continuous mouth shape features corresponding to the image sequence; determining a final sound source direction from the predicted at least one sound source direction based on the degree of matching between the morphological feature and the speech data comprises:

identifying continuous mouth shape features to obtain a first sentence;

performing voice recognition on the voice data to obtain a second sentence;

matching the first sentence with the second sentence to obtain the matching degree between the mouth shape characteristics and the voice data;

and determining a final sound source direction from the predicted at least one sound source direction according to the matching degree.

In one embodiment, the image acquired in the final sound source direction includes at least two target objects; the matching degree is the first matching degree;

the method further comprises the steps of:

extracting voiceprint characteristic data of voice data;

searching voiceprint characteristic data of each target object from the stored voiceprint characteristic data;

matching the extracted voiceprint feature data with the searched voiceprint feature data to obtain a second matching degree corresponding to each target object;

and identifying the sounding object from the target object according to the first matching degree and the second matching degree.

In one embodiment, identifying the target object from the target objects according to the first degree of matching and the second degree of matching comprises:

obtaining a predicted direction value corresponding to the final sound source direction;

determining the sound direction value of each target object according to the predicted direction value, the first matching degree and the second matching degree;

and identifying the target object corresponding to the maximum sound direction value as a sound object.

In one embodiment, searching the voiceprint feature data of each target object from the stored voiceprint feature data includes:

extracting external characteristic data of each target object from images acquired according to the final sound source direction aiming at each target object;

Matching the external feature data with the stored external feature data of the target object;

and acquiring voiceprint feature data stored corresponding to the matched external feature data, and obtaining voiceprint feature data of the target object.

In one embodiment, the target object is a person object; the external feature data comprises face feature data;

extracting the external feature data of each target object from the image acquired according to the final sound source direction comprises the following steps:

positioning a face area corresponding to each target object from the image acquired according to the final sound direction;

and carrying out face recognition on the face area to obtain face feature data.

In one embodiment, the method further comprises:

for the sound object, the extracted extrinsic feature data and voiceprint feature data of the sound object are stored corresponding to the sound object to update the extrinsic feature data and voiceprint feature data stored corresponding to the sound object.

An acoustic direction localization processing apparatus, the apparatus comprising:

the direction prediction module is used for acquiring voice data acquired from the environment; predicting at least one sound source direction based on the speech data;

the image acquisition module is used for carrying out image acquisition according to at least one predicted sound source direction and identifying the morphological characteristics of the language expression part of the target object from the acquired image;

And the direction positioning module is used for determining the final sound source direction from the predicted at least one sound source direction according to the matching degree between the morphological characteristics and the voice data.

In one embodiment, the voice data is at least two paths of voice data from the same sound source; the predicted sound source direction includes a voicing angle. The direction prediction module is also used for determining the phase difference between the voice data of each path; and predicting the sounding angle corresponding to the voice data according to the phase difference.

In one embodiment, the direction prediction module is further configured to collect, through the sound collection device array, voice data from the environment, to obtain at least two paths of voice data from the same sound source; the sound collection device array comprises at least two sound collection devices; and in the process of keeping the voice data collected by the voice collection device array, controlling the image collection device to collect images according to the predicted at least one sound source direction.

In one embodiment, the predicted at least one sound source direction is at least two. The direction positioning module is also used for obtaining predicted direction values corresponding to the predicted sound source directions respectively; a predicted direction value for characterizing a probability that the speech data originates from a sound source direction; determining a sound source direction value corresponding to the sound source direction according to a predicted direction value corresponding to the predicted sound source direction and the matching degree corresponding to the image acquired according to the predicted sound source direction; the sound source direction value is used for representing the probability that the predicted sound source direction is the final sound source direction; the predicted sound source direction corresponding to the maximum sound source direction value is selected as the final sound source direction.

In one embodiment, the target object is a person object; the language expression part is a lip; the morphological features of the language expression site are mouth shape features. The direction positioning module is also used for positioning a face area of the person object from the acquired image; identifying a lip region in the face region; the die features are extracted from the lip region.

In one embodiment, the acquired images are a continuous sequence of images; the extracted mouth shape features are consecutive mouth shape features corresponding to the image sequence. The direction positioning module is also used for identifying continuous mouth shape characteristics to obtain a first sentence; performing voice recognition on the voice data to obtain a second sentence; matching the first sentence with the second sentence to obtain the matching degree between the mouth shape characteristics and the voice data; and determining a final sound source direction from the predicted at least one sound source direction according to the matching degree.

In one embodiment, the image acquired in the final sound source direction includes at least two target objects; the degree of matching is a first degree of matching. The apparatus further comprises:

the sounding object recognition module is used for extracting voiceprint characteristic data of voice data; searching voiceprint characteristic data of each target object from the stored voiceprint characteristic data; matching the extracted voiceprint feature data with the searched voiceprint feature data to obtain a second matching degree corresponding to each target object; and identifying the sounding object from the target objects according to the first matching degree and the second matching degree.

In one embodiment, the sound object recognition module is further configured to obtain a predicted direction value corresponding to the final sound source direction; determining the sound direction value of each target object according to the predicted direction value, the first matching degree and the second matching degree; and identifying the target object corresponding to the maximum sound direction value as a sound object.

In one embodiment, the sound-producing object recognition module is further configured to extract, for each target object, extrinsic feature data of each target object from an image collected in a final sound source direction; matching the external feature data with the stored external feature data of the target object; and acquiring voiceprint feature data stored corresponding to the matched external feature data, and obtaining voiceprint feature data of the target object.

In one embodiment, the target object is a person object; the extrinsic feature data includes face feature data. The sounding object recognition module is also used for positioning a face area corresponding to each target object from the image acquired according to the final sound direction; and carrying out face recognition on the face area to obtain face feature data.

In one embodiment, the apparatus further comprises:

the updating storage module is used for storing the extracted external feature data and voiceprint feature data of the sounding object corresponding to the sounding object so as to update the stored external feature data and voiceprint feature data corresponding to the sounding object.

A sound direction localization processing system, the system comprising: sound collection equipment and image collection equipment;

a sound collection device for obtaining voice data collected from an environment; predicting at least one sound source direction based on the speech data;

the image acquisition equipment is used for carrying out image acquisition according to at least one predicted sound source direction and identifying the morphological characteristics of the language expression part of the target object from the acquired image;

the sound collection device is further configured to determine a final sound source direction from the predicted at least one sound source direction based on the degree of matching between the morphological feature and the speech data.

A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps in the sound direction localization processing method of the embodiments of the present application.

A computer-readable storage medium having stored thereon a computer program which, when executed by a processor, causes the processor to perform the steps in the sound direction localization processing method of the embodiments of the present application.

According to the sound direction positioning processing method, the sound direction positioning processing device, the computer equipment and the storage medium, at least one sound source direction is predicted for the voice data collected from the environment, image collection is carried out according to the predicted at least one sound source direction, and the final sound source direction is determined according to the matching degree between the morphological characteristics of the language expression part of the target object in the collected image and the voice data. That is, the voice data and the image data are combined, and the sound source direction is localized by the multi-modal data. Thus, even in a noisy environment, the sound source direction can be positioned in an assisted manner by the morphological characteristics of the language expression part in the image, so that compared with the traditional method for positioning the sound direction according to the sound data only, the positioning accuracy is improved.

Drawings

FIG. 1 is an application scenario diagram of a sound direction localization processing method in one embodiment;

FIG. 2 is an application scenario diagram of a sound direction localization processing method according to another embodiment;

FIG. 3 is a flow chart of a sound direction localization processing method according to an embodiment;

FIG. 4 is a schematic diagram of a microphone array in one embodiment;

FIG. 5 is a schematic diagram of an image acquisition device in one embodiment;

FIG. 6 is a flowchart illustrating steps of a process for identifying an object of sound production in one embodiment;

FIG. 7 is a flow diagram of a sound direction localization processing method in one embodiment;

FIG. 8 is a block diagram of a sound direction localization processing apparatus in one embodiment;

FIG. 9 is a block diagram of a sound direction localization processing apparatus in another embodiment;

FIG. 10 is a block diagram of a computer device in one embodiment.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Fig. 1 is an application scenario diagram of a sound direction localization processing method in one embodiment. Referring to fig. 1, the application scenario includes a sound collection device 110 and an image collection device 120 that are communicatively connected. The sound collection device 110 is a device for collecting voice data. The image capturing apparatus 120 is an apparatus for capturing an image. It will be appreciated that the sound collection device 110 or the image collection device 120 may be a smart device with computer processing capabilities. That is, the sound direction localization processing method in the embodiments of the present application is performed by the sound collection device 110 or the image collection device 120. For example, the intelligent sound box or the intelligent camera can execute the sound direction positioning processing method in the embodiments of the application.

The target object may speak in the environment. The sound collection device 110 may collect voice data from the environment and predict at least one sound source direction based on the voice data. The sound collection device 110 may control the image collection device 120 to perform image collection according to the predicted at least one sound source direction by communicating with the image collection device 120 in the process of maintaining the sound collection device array to collect voice data. The sound collection device 110 may control the image collection device 120 to identify the morphological feature of the language expression part of the target object from the collected image. The sound collection device 110 may determine a degree of matching between the morphological feature and the speech data. The sound collection device 110 may determine a final sound source direction from the predicted at least one sound source direction according to the degree of matching.

It will be appreciated that the sound collection device 110 and the image collection device 120 may also be conventional devices without computer processing capabilities. The sound direction localization processing method in the embodiments of the present application is performed by the computer device 130 communicatively connected to the sound collection device 110 and the image collection device 120, respectively. The computer device 130 may be a desktop computer or a mobile terminal, which may include at least one of a cell phone, a tablet computer, a notebook computer, a personal digital assistant, a wearable device, and the like. Fig. 2 is a view of an application scenario of a sound direction localization processing method according to another embodiment.

The target object may speak in the environment. The computer device 130 may collect voice data from the environment by controlling the sound collection device 110. The computer device 130 may predict at least one sound source direction based on the voice data. The computer device 130 may control the image capture device 120 to capture images in the predicted at least one sound source direction during the process of maintaining the sound capture device array to capture voice data. The computer device 130 may identify, from the acquired image, a morphological feature of the linguistic expression part of the target object; determining the matching degree between the morphological characteristics and the voice data; and determining a final sound source direction from the predicted at least one sound source direction according to the matching degree.

It can be appreciated that the sound direction localization processing method in the embodiments of the present application is equivalent to using artificial intelligence technology to automatically localize the sound source direction.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

It can be appreciated that the sound direction positioning processing method in the embodiments of the present application may be applied to a speech processing scenario such as speech recognition or speech synthesis processing. Key technologies to the speech technology (Speech Technology) are automatic speech recognition technology (ASR) and speech synthesis technology (TTS) and voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.

Fig. 3 is a flow chart of a sound direction positioning processing method in an embodiment. The sound direction positioning processing method in this embodiment can be applied to a computer device, and the computer device is mainly used as the sound collecting device 110 or the computer device 130 in fig. 1 for illustration. Referring to fig. 3, the method specifically includes the steps of:

S302, acquiring voice data acquired from the environment.

Wherein, the environment is composed of various natural factors. The environment may include a real environment and a virtual environment. The real environment is an environment existing in real life. The virtual environment is an environment obtained by simulating a real environment.

In particular, the voice data may be data that has been collected in advance, and the computer device may directly acquire the voice data collected from the environment. The computer device may also perform voice capture processing to capture voice data from the environment.

The target object may sound in the environment. The computer device may perform voice detection to detect whether there is voice input in the environment and, if so, to collect input voice data from the environment.

Among them, the target object is an object that provides a sound source in the environment, that is, an object that emits sound to input voice data. For example, a person speaks in the environment, and the computer device collects voice data of the person speaking, and the person is the target object. It will be appreciated that when the environment is a virtual environment, then the target object may be a virtual object, i.e. a virtual out target object.

It will be appreciated that when the computer device itself has sound collection capabilities, the computer device itself may collect voice data from the environment. For example, when the computer device is an intelligent sound collection device, then the voice data may be collected from the environment itself. When the computer equipment does not have the sound collection function, the voice data can be collected from the environment by controlling the sound collection equipment.

In one embodiment, the computer device may collect voice data from the environment through a single sound collection channel sound collection device. In another embodiment, the computer device may also acquire voice data from the environment through the sound collection device array of the multiple sound collection channels to obtain at least two paths of voice data. It will be appreciated that at least two sound collection devices are included in the array of sound collection devices.

S304, predicting at least one sound source direction according to the voice data.

The sound source direction refers to the direction of the sound source, i.e. the direction of generating voice data.

In particular, the computer device may analyze the speech data to predict at least one sound source direction.

It will be appreciated that the computer device may collect multiple voice data through multiple voice collection channels and predict at least one sound source direction by analyzing the multiple voice data. It is understood that the multi-path voice data means at least two paths of voice data. A sound collection channel collects a path of voice data.

In other embodiments, the computer device may also collect single-pass speech data through a single-pass speech collection channel, and predict at least one sound source direction based on the single-pass speech data.

In one embodiment, the voice data is at least two paths of voice data from the same sound source; the predicted sound source direction includes a voicing angle. In this embodiment, step S304 includes: determining phase differences among voice data of all paths; and predicting the sounding angle corresponding to the voice data according to the phase difference.

The sound emitting angle refers to the angle of the position of the object emitting sound compared with the sound collecting device.

Phase (phase) is the position in its cycle of a particular moment for a wave.

It will be appreciated that the voice data of different paths are collected by different sound collection channels, and the distances between the different sound collection channels and the target object are different, which may result in different phases between the voice data of the paths collected for the same sound source. Thus, there is a phase difference between the voice data of the paths from the same sound source. In other words, the phase difference can characterize the difference in distance between different sound acquisition channels and the target object. However, the relative positions between the sound collection channels are determined, so that the sound emission angle corresponding to the voice data can be predicted from the phase difference. The sound emission angle may be a plane sound emission angle, that is, an angle of a position of the object emitting the sound on the horizontal plane compared to the sound collection device, with reference to the horizontal plane.

It should be noted that the computer device may calculate the phase difference between the multiple voice data corresponding to the same direction. Further, the sound emission angle can be predicted, that is, at least one sound source direction of a plurality of predictions can be obtained.

In the above-described embodiments, predicting the sound emission angle, i.e., predicting at least one sound source direction, from the phase difference between different paths of speech data from the same sound source makes it possible to rapidly and accurately predict the at least one sound source direction.

S306, image acquisition is carried out according to at least one predicted sound source direction, and the morphological characteristics of the language expression part of the target object are identified from the acquired images.

Among them, the target object is an object that provides a sound source in the environment, that is, an object that emits sound to input voice data. The language expression site is a site for expressing a language. Posture, i.e., gesture, refers to the posture or appearance. It will be appreciated that the target object may be accompanied by a morphological change in the language expression site when speaking. For example, the target object may be accompanied by a change in lips, a change in gestures, a change in body posture, or the like when speaking. The morphological characteristics of the language expression part change along with the sounding of the target object to express different languages. That is, if the language expressed by the language expression site of the target object is different, the morphological characteristics of the language expression site are different.

It is understood that the language expression site may be a site where a posture change can occur. It is understood that the language may be expressed by a change in posture.

In one embodiment, the language expression site includes at least one of a hand, a lip, and other body parts capable of a change in posture, etc. The morphological characteristics of the speech expression part include at least one of gesture characteristics, mouth shape characteristics and body posture characteristics.

In one embodiment, the target object may comprise a person object, i.e. the target object may comprise a person. The language expression site may include a lip. It will be appreciated that the sound produced by the person is ultimately output via the opening and closing of the lips, and therefore the speech expression site may include lips. The morphological features of the language expression site may then include mouth shape features. The mouth shape, i.e. the shape of the mouth of a person, is the shape of the lips when making a certain sound.

It will be appreciated that when the target object is a human object, the language expression site may also include a hand and other body parts where a change in posture can occur. It will be appreciated that the target object may express language by way of gesture changes (i.e., gestures) that occur with the hand, body gesture changes, and the like.

In other embodiments, the target object may also include a robotic object or an animal object, or the like. The language expression site may then comprise the mouth of a robot or the mouth of an animal subject.

In particular, the computer device may perform image acquisition according to the predicted at least one sound source direction, i.e. acquire images located in the sound source direction.

It should be noted that, the process of collecting voice data is a continuous process, and the computer device still collects voice data from the environment synchronously during the process of collecting images. That is, voice data is collected from the environment while image collection is performed in accordance with the predicted at least one sound source direction. Therefore, the voice data in the embodiments of the present application refers to continuously collected voice data.

It will be appreciated that when the computer device itself is provided with image acquisition functionality, image acquisition may be performed by the computer device itself in accordance with the predicted at least one sound source direction. When the computer equipment does not have the image acquisition function, the computer equipment can perform image acquisition according to the predicted at least one sound source direction by controlling the image acquisition equipment.

It should be noted that the computer device may be a device integrating the image capturing function and the sound capturing function. The computer device may be a device having an integrated image capturing function or a sound capturing function alone. The computer device may also be a device that does not have an integrated image capturing function and sound capturing function itself, but captures images by controlling the image capturing device and voice data by controlling the sound capturing device.

It will be appreciated that the target object speaks in the environment and, therefore, after predicting at least one sound source direction from the speech data, the computer device will perform image acquisition on that sound source direction while maintaining the acquisition of the speech data. Thus, if the target object is still speaking, the target object of speaking can be both taken and the content of the speaking can be obtained by collecting the voice data.

It is understood that the predicted at least one sound source direction may be at least two. The computer device may therefore perform image acquisition for each predicted at least one sound source direction.

In particular, the computer device may identify the target object from the acquired image. When the target object is identified, the language expression part of the target object is positioned from the image area corresponding to the target object, and further, the morphological characteristics of the language expression part are extracted.

It will be appreciated that when the target object is identified from the acquired image, step S306 is performed. When the target object is not recognized from the acquired image, it may be determined that the voice data is noise data, and voice detection is continued to detect whether there is a voice input in the environment, and if so, step S302 and the subsequent steps are re-performed.

In one embodiment, when the acquired images are at least two consecutive images (i.e., a sequence of images), the computer device may then perform step S306 separately for each image acquired.

In one embodiment, at least two target objects may be included in the same acquired image, in which case the computer device may then perform step S306 for each target object. That is, from the acquired images, the morphological feature of the language expression site of each target object is identified. For example, a plurality of persons are included in one image, and then the morphological feature of the language expression part of each person can be identified from the image.

In one embodiment, step S302 of acquiring speech data collected from the environment includes: collecting voice data from the environment through a voice collection device array to obtain at least two paths of voice data sent from the same sound source; the sound collection device array comprises at least two sound collection devices. In this embodiment, the image acquisition according to the predicted at least one sound source direction in step S306 includes: in the process of keeping the sound collection device array to collect the voice data, controlling the image collection device to collect the image according to the predicted at least one sound source direction, and synchronously collecting the voice data from the environment through the sound collection device array.

The sound source acquisition device array is an array obtained by arranging at least two sound acquisition devices according to a preset rule. Each sound collecting device in the sound collecting device array is equivalent to a sound collecting channel so as to take voices from the environment and obtain at least two paths of voice data.

In one embodiment, the array of sound collection devices may be an array of microphones. Microphones are included in the microphone array. It will be appreciated that the microphone array may comprise at least two sound collection channels.

It will be appreciated that the phase differences between the various paths of speech data are used to characterize the phase differences of sound collection channels from the same source to different locations in the array of sound collection devices.

Fig. 4 is a schematic diagram of a microphone array in one embodiment. Referring to fig. 4, a sound collection device 402 with multiple microphones integrated therein includes multiple sound collection channels. As can be seen from fig. 4, there is a phase difference between the voice data collected for the user 404 in the sound collection channel located at the region S1 compared with the sound collection channel located at the region S2. Because there is a difference in distance between the location of the user 404 and the areas S1 and S2, there is a phase difference. It will be appreciated that there may also be a phase difference between the voice data collected by the sound collection channels at different locations in the same region.

In the above embodiment, the plurality of paths of voice data are collected by the sound collection device array, so that the phase differences between the plurality of paths of voice data have a certain regularity, and the efficiency of predicting at least one sound source direction is improved. Moreover, compared to a single sound collection device, the accuracy of collecting the voice data is improved, and the accuracy of predicting at least one sound source direction is further improved.

In one embodiment, the image acquisition device may be a panoramic camera. The panoramic camera is a camera which can collect 360-degree surrounding panoramic images by taking a collection point as a center. In other embodiments, the image capturing device may not be a panoramic camera, but may be a single camera capable of rotating a preset angle (e.g., 360 degrees), or at least two cameras capable of rotating. In this way, it is still possible to suffice to be able to perform image acquisition in at least one sound source direction that is arbitrarily predicted. It can be appreciated that the panoramic camera is used for collecting images, so that the integrity of the images can be improved, and the accuracy of sound source positioning is improved.

Fig. 5 is a schematic diagram of an image acquisition apparatus in one embodiment. Referring to fig. 5, the panoramic camera can collect 360-degree images around, so that no dead angle exists in image collection.

S308, determining a final sound source direction from the predicted at least one sound source direction according to the matching degree between the morphological characteristics and the voice data.

It will be appreciated that the morphological characteristics of the language expression site vary as the target object sounds, i.e., the morphological characteristics of the language expression site differ in the case where the target object emits different sounds. Therefore, the morphological characteristics of the language expression part can reflect the sound content of the target object.

Therefore, the computer device can match the morphological feature of the language expression part with the voice data to judge whether the sound content represented by the morphological feature of the language expression part is matched with the voice data or not, or judge the similarity between the sound content represented by the morphological feature of the language expression part and the voice data, so as to obtain the matching degree. In one embodiment, the degree of matching may include a match success or a match failure. In one embodiment, the degree of matching may also include a degree of similarity between the morphological features of the language expression site and the speech data.

It is understood that when the language expression part is a lip, the above-described process of matching the morphological characteristics of the language expression part with the voice data corresponds to a lip recognition process. Namely, the method corresponds to the recognition of the lip language, and the recognized lip language is matched with the voice data.

It can be understood that when the predicted at least one sound source direction is at least two, the morphological feature of the language expression part corresponding to the image collected according to each predicted at least one sound source direction can be respectively matched with the voice data, so as to obtain the matching degree corresponding to each predicted at least one sound source direction.

In one embodiment, when there is only one predicted at least one sound source direction, the computer device may determine that the predicted at least one sound source direction is the final sound source direction when the degree of matching is successful or the degree of similarity is greater than or equal to a preset threshold. It is understood that when the matching degree is a matching failure or the similarity is smaller than a preset threshold, it is determined that the predicted at least one sound source direction does not belong to the final sound source direction. Then the computer device may re-execute step S304 to re-correct the predicted at least one sound source direction based on the degree of matching.

In one embodiment, when the predicted at least one sound source direction is at least two, the computer device may determine, according to the matching degree, a probability that each predicted at least one sound source direction belongs to a final sound source direction, and further select, as the final sound source direction, the predicted at least one sound source direction with the highest probability.

It will be appreciated that the computer device may determine the final sound source direction from the predicted at least one sound source direction based on only one factor of the degree of matching. The computer device may also determine the final sound source direction from the predicted at least one sound source direction, along with several other factors.

The sound direction positioning processing method predicts at least one sound source direction through voice data collected from the environment, and performs image collection according to the predicted at least one sound source direction; and determining the final sound source direction according to the matching degree between the morphological characteristics of the language expression part of the target object in the acquired image and the voice data. That is, the voice data and the image data are combined, and the sound source direction is localized by the multi-modal data. Thus, even in a noisy environment, the sound source direction can be positioned in an assisted manner by the morphological characteristics of the language expression part in the image, so that compared with the traditional method for positioning the sound direction according to the sound data only, the positioning accuracy is improved. Compared with the traditional method that voice is singly used for positioning, the accuracy is improved by 45%, and the interference of more than 90% of environmental noise is effectively filtered.

In one embodiment, the predicted at least one sound source direction is at least two; determining a final sound source direction from the predicted at least one sound source direction based on the degree of matching between the morphological feature and the speech data comprises: obtaining predicted direction values corresponding to the predicted sound source directions respectively; determining a sound source direction value corresponding to the sound source direction according to a predicted direction value corresponding to the predicted at least one sound source direction and a matching degree corresponding to an image acquired according to the predicted at least one sound source direction; the predicted sound source direction corresponding to the maximum sound source direction value is selected as the final sound source direction.

Wherein the predicted direction value is used to characterize the probability that the speech data originates from the predicted at least one sound source direction. A sound source direction value for characterizing a probability that the predicted at least one sound source direction is the final sound source direction.

It will be appreciated that in step S304, the computer device may predict the sound source direction for the voice data, and obtain a predicted direction value of each predicted sound direction. A predicted direction value for characterizing a probability that the speech data originates from the predicted at least one sound source direction. In one embodiment, the computer device may predict the sound source direction for the voice data by a DOA (Direction of arrival, direction of arrival of sound) algorithm to obtain predicted direction values for each predicted sound direction.

It can be understood that the matching degree corresponding to the image collected according to the predicted at least one sound source direction is the matching degree between the morphological feature of the language expression part of the target object in the image collected according to the predicted at least one sound source direction and the voice data.

The computer device may determine a sound source direction value corresponding to the sound source direction according to a predicted direction value corresponding to the predicted at least one sound source direction and a degree of matching corresponding to an image collected according to the predicted at least one sound source direction. Thus, each predicted sound source direction corresponds to a sound source direction value. The computer device may determine a maximum sound source direction value therefrom and take as a final sound source direction a predicted sound source direction corresponding to the maximum sound source direction value.

In one embodiment, the degree of matching is a similarity value between the morphological feature of the language expression part and the voice data. The computer equipment can then predict the predicted direction value corresponding to the sound source direction and determine the sound source direction value corresponding to the sound source direction according to the predicted similarity value corresponding to the image collected in the sound source direction.

Specifically, the computer device may directly add the predicted direction value and the similarity value to obtain a sound source direction value corresponding to the sound source direction. The computer device may also perform weighted average calculation on the predicted direction value and the similarity value according to a preset confidence coefficient, to obtain a sound source direction value corresponding to the sound source direction.

It can be understood that the preset confidence coefficient may be a confidence coefficient which is obtained by analyzing based on prior data and accords with a natural rule and has prior value, and is not arbitrarily specified by people.

In one embodiment, the language expression site is a lip. It can be understood that the matching degree is a lip recognition result, and is equivalent to a similarity value obtained after lip recognition processing. The computer device may determine a sound source direction value corresponding to the sound source direction according to the following formula:

vd=doa w1+labial (face) W2;

where Vd is the sound source direction value for each predicted sound direction. DOA is the predicted direction value corresponding to the predicted sound source direction calculated according to DOA algorithm. The lips (faces) represent the lip recognition result (i.e., the similarity value). The face in brackets is used for indicating that the face area is positioned from the image, and then the lip shape is positioned for lip shape recognition. W1 and W2 are confidence coefficients corresponding to the predicted direction value and the lip recognition result, respectively.

In the above embodiment, according to the predicted direction value corresponding to the predicted at least one sound source direction and the matching degree corresponding to the image collected according to the predicted at least one sound source direction, the sound source direction value corresponding to each predicted sound source direction is determined, which is equivalent to the step-by-step prediction by combining the voice recognition result and the image recognition result after predicting the at least one sound source direction through voice recognition, so that the accuracy of the finally determined sound source direction can be improved.

In one embodiment, the target object is a person object; the language expression part is a lip; the morphological features of the language expression site are mouth shape features. In this embodiment, identifying the morphological feature of the language expression part of the target object from the acquired image includes: positioning a face area of a person object from the acquired image; identifying a lip region in the face region; the die features are extracted from the lip region.

It will be appreciated that other objects than person objects may be included in the captured image. Therefore, the computer device can locate the face region of the person object from the acquired image. Further, a lip region is identified from the located face region.

Specifically, the computer device may match the object in the captured image with a preset person object template, thereby locating the face region of the person object from the captured image. The computer device may also convolve the image to locate the face region therefrom.

In one embodiment, the computer device may determine, according to a preset lip position, a corresponding region from the located face regions, and use the determined region as the lip region. In another embodiment, the computer device may also convolve the facial region image with a pre-trained convolutional neural network to identify the lip region therefrom.

Further, the computer device may extract mouth shape features from the identified lip region.

It will be appreciated that when the acquired image is a single image, the computer device may determine whether the character object is speaking or not based on its corresponding mouth shape characteristics. In one embodiment, when it is determined that the person object is speaking, then it may be determined that the mouth shape feature matches the voice data. In another embodiment, when it is determined that the person object is speaking, the computer device may further identify the identity of the person object through face recognition, and based on the identity, search voiceprint feature data corresponding to the person object. The computer device may extract voiceprint feature data from the voice data, match the retrieved voiceprint feature data with the extracted voiceprint feature data, and if so, determine that the mouth shape feature of the person object matches the voice data.

In one embodiment, the acquired images are a continuous sequence of images; the extracted mouth shape features are consecutive mouth shape features corresponding to the image sequence. In this embodiment, step S308 determines a final sound source direction from the predicted at least one sound source direction according to the matching degree between the morphological feature and the voice data, including: identifying continuous mouth shape features to obtain a first sentence; performing voice recognition on the voice data to obtain a second sentence; matching the first sentence with the second sentence to obtain the matching degree between the mouth shape characteristics and the voice data; and determining a final sound source direction from the predicted at least one sound source direction according to the matching degree.

Specifically, the computer device may perform sentence recognition processing (i.e., perform lip recognition processing) on the continuous mouth shape feature, to obtain a first sentence. The computer device may perform a speech recognition process on the speech data to obtain a second sentence. The computer device may perform a matching process on the first sentence and the second sentence to obtain a degree of matching between the mouth shape feature and the voice data.

In the above embodiment, the face region of the person object is located from the acquired image; identifying a lip region in the face region; the die features are extracted from the lip region. The lip shape characteristics in the speaking process can be accurately extracted, so that the accuracy of the lip shape auxiliary positioning sounding direction is improved.

In one embodiment, the image acquired in the final sound source direction includes at least two target objects; the matching degree between the morphological characteristics of the language expression part and the voice data is the first matching degree. As shown in fig. 6, the method further includes a sound object recognition processing step, specifically including the following steps:

s602, extracting voiceprint characteristic data of voice data.

The voiceprint feature is a unique feature in the sound of each object, which is different from the other objects. The final sound source direction is the direction from which the final sound is emitted. The sound object is the object which makes sound.

It will be appreciated that when only one target object is included in the image acquired in the final sound source direction, the target object may be directly determined as a sound-producing object. When at least two target objects are included in the image acquired in accordance with the final sound source direction, the departure object can be identified from the plurality of target objects in accordance with the processing of steps S602 to 608.

S604, searching the voiceprint characteristic data of each target object from the stored voiceprint characteristic data.

The stored voiceprint feature data is prestored voiceprint feature data.

It is understood that the voiceprint feature data of each target object may not exist in the stored voiceprint feature data, so when the voiceprint feature data of each target object is found from the stored voiceprint feature data, steps S606 to S608 are performed again. When the voiceprint feature data of each target object is not found from the stored voiceprint feature data, no processing is required.

In one embodiment, the computer device may identify the identity of each target object by external feature recognition, and then, according to the determined identity, search the voiceprint feature data of each target object from the stored voiceprint feature data.

In one embodiment, step S604 searches for voiceprint feature data of each target object from the stored voiceprint feature data includes: extracting external characteristic data of each target object from images acquired according to the final sound source direction aiming at each target object; matching the external feature data with the stored external feature data of the target object; and acquiring voiceprint feature data stored corresponding to the matched external feature data, and obtaining the voiceprint feature data of each target object.

Specifically, the computer device may perform external feature recognition processing on each target object in the image acquired according to the final sound source direction, extract external feature data corresponding to each target object, match the extracted external feature data with pre-stored external feature data, and obtain voiceprint feature data corresponding to the matched external feature data, so as to obtain voiceprint feature data of each target object. The voice print characteristic searching method is equivalent to searching pre-stored voice print characteristics of a person according to the corresponding storage relation between the external characteristic data and the voice print data, and can rapidly and accurately determine the voice print characteristics.

In one embodiment, the target object is a person object; the extrinsic feature data includes face feature data. Extracting the external feature data of each target object from the image acquired according to the final sound source direction comprises the following steps: positioning a face area corresponding to each target object from the image acquired according to the final sound direction; and carrying out face recognition on the face area to obtain face feature data.

And S606, matching the extracted voiceprint feature data with the searched voiceprint feature data to obtain a second matching degree corresponding to each target object.

Specifically, for each target object, the computer device may perform similarity matching on the voiceprint feature data extracted for the target object and the searched voiceprint feature data, so as to obtain a second matching degree corresponding to the target object.

It is understood that the second degree of matching may include a similarity value. The second degree of matching may also include a match success or a match failure.

S608, identifying the sounding object from the target objects according to the first matching degree and the second matching degree.

The first matching degree is the matching degree between the morphological characteristics of the language expression part of the target object in the image acquired according to the final sound source direction and the voice data. It will be appreciated that there is a corresponding first degree of matching for each target object in the image acquired in the final sound source direction.

In particular, the computer device may identify the sound object from the target object in combination with the first degree of matching and the second degree of matching.

It is understood that the computer device may identify the sound object from the target object based on the first degree of matching and the second degree of matching. The computer device may also analyze to identify the sound object from the target object based on factors other than the first degree of matching and the second degree of matching, such as a predicted direction value of the predicted sound source direction.

In the embodiment, the multi-mode data such as image recognition and voiceprint recognition are combined to assist in positioning the sounding object, so that the accuracy of positioning the sounding object is improved.

In one embodiment, identifying the sound object from the target object based on the first degree of matching and the second degree of matching comprises: obtaining a predicted direction value corresponding to the final sound source direction; determining the sound direction value of each target object according to the predicted direction value, the first matching degree and the second matching degree; and identifying the target object corresponding to the maximum sound direction value as a sound object.

It will be appreciated that since the prediction of the sound source direction is affected by the different sound-generating objects, the final sound source direction also affects the recognition of the sound-generating objects to some extent. Therefore, the factor of the sound source direction can be taken into consideration when recognizing the sound object.

Specifically, the computer device may obtain a predicted direction value corresponding to the final sound source direction. For the same target object, the computer device may directly sum the predicted direction value and the first matching degree and the second matching degree corresponding to the target object to obtain the sound direction value of each target object. The computer device may also perform weighted average calculation on the predicted direction value and the first matching degree and the second matching degree corresponding to the same target object according to a preset confidence coefficient, so as to obtain a sound direction value of each target object.

It can be understood that when the sounding object is identified, the multi-dimensional and multi-modal factors such as the predicted direction value of the sound source direction, the image identification result, the voiceprint identification result and the like are considered, so that the accuracy of positioning the sounding object is improved.

In one embodiment, the target object is a person object; the language expression part is a lip. The computer device may determine the sound direction value of the target object according to the following formula:

pd=doa w1+labial (face) w2+voiceprint w3;

pd is the sound direction value of each target object. DOA is the predicted direction value corresponding to the final sound source direction calculated according to DOA algorithm. The lips (faces) represent the lip recognition result (i.e., the first degree of matching). Voiceprints represent voiceprint recognition results (i.e., a second degree of matching). The face in brackets is used for indicating that the face area is positioned from the image, and then the lip shape is positioned for lip shape recognition. W1, W2 and W3 are confidence coefficients corresponding to the predicted direction value, the lip recognition result and the voiceprint recognition result, respectively.

In one embodiment, the first degree of matching may include a first similarity value between the morphological feature of the language expression site of the target object and the speech data. The second matching degree may include a second similarity value between the voiceprint feature data extracted for the target object and each of the voiceprint feature data found.

It will be appreciated that this is equivalent to comprehensively considering the sound source direction, the voiceprint feature, and the lip recognition result to recognize the sound object. The accuracy of locating the sound object can be improved. In other embodiments, the computer device may also identify the sound object in combination with the factor of the facial features. For example, the face feature data extracted for the target object is combined with the third matching degree between the found face feature data, and the sound object is identified from the target object according to the prediction direction value, the first matching degree (lip recognition result), the second matching degree (voiceprint recognition result), the third matching degree (face recognition result) and other factors. The accuracy of locating the sound object is further improved.

It will be appreciated that in one embodiment, the method further comprises: extracting voiceprint characteristic data of voice data; searching voiceprint feature data matched with the voiceprint feature data from the stored voiceprint feature data, and further searching external feature data stored corresponding to the matched voiceprint feature data; in the image collected according to the final sound source direction, carrying out external feature recognition on each target object, and extracting corresponding external feature data; respectively matching each extracted external feature data with the searched external feature data; and obtaining the matching degree of the external features. Furthermore, the computer device may identify the sound object from the target object based on the predicted direction value and the extrinsic feature matching degree corresponding to the final sound source direction.

Further, the computer device may further search the voiceprint feature data of each corresponding stored target object according to the extracted external feature data of each target object, and match the searched voiceprint feature data with the voiceprint feature data extracted from the voice data, so as to obtain a voiceprint recognition result.

Similarly, the computer device may identify the sound object from the target object according to the predicted direction value corresponding to the final sound source direction, the body recognition result (such as lip recognition result) of the language expression part, the external feature matching degree, the voiceprint recognition result, and other factors.

It should be noted that, as known from the different embodiments, the external feature recognition and the voiceprint recognition may be performed in different sequences, and the processing of the external feature and the voiceprint recognition may be performed, which is not limited herein.

In one embodiment, the method further comprises: for the sound object, the extracted extrinsic feature data and voiceprint feature data of the sound object are stored corresponding to the sound object to update the extrinsic feature data and voiceprint feature data stored corresponding to the sound object.

It can be appreciated that the identification extracted extrinsic feature data and the voiceprint feature data are correspondingly stored so as to carry out corresponding marking. The voice-print object locating method has the advantages that the voice-print object locating method can play a role in auxiliary correction of stored voice-print feature data and external feature data, so that a voice-print object can be located more quickly and accurately when the stored voice-print feature data and external feature data are used later.

In one embodiment, the computer device may further perform a final sound source direction speech enhancement process on the speech data and the speech data, attenuate noise data in other irrelevant directions, and store the enhanced speech data corresponding to the final sound source direction. Further, the enhanced voice data can be utilized to perform voice recognition processing, so that the accuracy of voice recognition is improved.

In one embodiment, the computer device may use beamforming (a signal processing technique for controlling the direction of propagation and the reception of radio frequency signals) algorithms to enhance the speech data of the final sound source direction and filter out uncorrelated directional noise.

Fig. 7 is a flow chart illustrating a sound direction localization processing method according to an embodiment. Referring to fig. 7, when the system is turned on, first, whether there is a voice input in the environment is detected through a VAD (Voice Activity Detection) voice detection algorithm of a microphone, and if there is a voice input, a DOA algorithm module is turned on to calculate a voice sound source angle so as to predict at least one sound source direction. Meanwhile, the camera synchronously captures images in the predicted sound source direction, whether a person exists or not is confirmed, if yes, the final sound source direction is positioned through lip recognition, and the speaker is positioned by combining voiceprint recognition. The microphone array opens the voice enhancement algorithm module to enhance the voice of the speaker direction, simultaneously attenuates the noise of other irrelevant directions, records the voice of the final enhanced sound source direction, and can carry out the voice recognition on the voice. And the face recognition can be performed subsequently to verify the identified speaker, and the face characteristic data and the voiceprint characteristic data are correspondingly stored. Finally, the enhanced voice data, voiceprint feature data, face feature data, lip feature data (namely mouth shape feature data) and the final sound source direction can be correspondingly stored, and the voiceprint feature data, the face feature data, the lip feature data and the like stored before the correction can be assisted according to the stored data.

As shown in fig. 8, in one embodiment, a sound direction localization processing apparatus 800 is provided, which is provided in a computer device. The computer device may be a terminal or a server. The apparatus 800 includes: a direction prediction module 802, an image acquisition module 804, and a direction positioning module 806, wherein:

a direction prediction module 802, configured to acquire voice data collected from an environment; at least one sound source direction is predicted from the speech data.

The image acquisition module 804 is configured to perform image acquisition according to the predicted at least one sound source direction, and identify, from the acquired image, a morphological feature of a language expression part of the target object.

The direction positioning module 806 is configured to determine a final sound source direction from the predicted at least one sound source direction according to the matching degree between the morphological feature and the voice data.

In one embodiment, the voice data is at least two paths of voice data from the same sound source; the predicted sound source direction includes a voicing angle. The direction prediction module 802 is further configured to determine a phase difference between the voice data of each path; and predicting the sounding angle corresponding to the voice data according to the phase difference.

In one embodiment, the direction prediction module 802 is further configured to collect, through the sound collection device array, voice data from the environment, to obtain at least two paths of voice data from the same sound source; the sound collection device array comprises at least two sound collection devices; and in the process of keeping the voice data collected by the voice collection device array, controlling the image collection device to collect images according to the predicted at least one sound source direction.

In one embodiment, the predicted at least one sound source direction is at least two. The direction positioning module 806 is further configured to obtain predicted direction values corresponding to the predicted directions of the sound sources respectively; a predicted direction value for characterizing a probability that the speech data originates from a sound source direction; determining a sound source direction value corresponding to the sound source direction according to a predicted direction value corresponding to the predicted sound source direction and the matching degree corresponding to the image acquired according to the predicted sound source direction; the sound source direction value is used for representing the probability that the predicted sound source direction is the final sound source direction; the predicted sound source direction corresponding to the maximum sound source direction value is selected as the final sound source direction.

In one embodiment, the target object is a person object; the language expression part is a lip; the morphological features of the language expression site are mouth shape features. The direction positioning module 806 is further configured to position a face region of the person object from the acquired image; identifying a lip region in the face region; the die features are extracted from the lip region.

In one embodiment, the acquired images are a continuous sequence of images; the extracted mouth shape features are consecutive mouth shape features corresponding to the image sequence. The direction positioning module 806 is further configured to identify continuous mouth shape features to obtain a first sentence; performing voice recognition on the voice data to obtain a second sentence; matching the first sentence with the second sentence to obtain the matching degree between the mouth shape characteristics and the voice data; and determining a final sound source direction from the predicted at least one sound source direction according to the matching degree.

a sound object recognition module 808 for extracting voiceprint feature data of the voice data; searching voiceprint characteristic data of each target object from the stored voiceprint characteristic data; matching the extracted voiceprint feature data with the searched voiceprint feature data to obtain a second matching degree corresponding to each target object; and identifying the sounding object from the target objects according to the first matching degree and the second matching degree.

In one embodiment, the sound object recognition module 808 is further configured to obtain a predicted direction value corresponding to the final sound source direction; determining the sound direction value of each target object according to the predicted direction value, the first matching degree and the second matching degree; and identifying the target object corresponding to the maximum sound direction value as a sound object.

In one embodiment, the sound-generating object recognition module 808 is further configured to extract, for each target object, extrinsic feature data of each target object from the image collected in the final sound source direction; matching the external feature data with the stored external feature data of the target object; and acquiring voiceprint feature data stored corresponding to the matched external feature data, and obtaining voiceprint feature data of the target object.

In one embodiment, the target object is a person object; the extrinsic feature data includes face feature data. The sound-producing object recognition module 808 is further configured to locate a face area corresponding to each target object from the image collected in the final sound direction; and carrying out face recognition on the face area to obtain face feature data.

As shown in fig. 9, in one embodiment, the apparatus 800 further comprises: a sound object recognition module 808 and an update storage module 810; wherein:

the update storage module 810 is configured to store, for the sound object, the extracted extrinsic feature data and voiceprint feature data of the sound object, corresponding to the sound object, so as to update the extrinsic feature data and voiceprint feature data stored corresponding to the sound object.

The sound direction positioning processing device predicts at least one sound source direction through voice data collected from the environment and collects images according to the predicted at least one sound source direction; and determining the final sound source direction according to the matching degree between the morphological characteristics of the language expression part of the target object in the acquired image and the voice data. That is, the voice data and the image data are combined, and the sound source direction is localized by the multi-modal data. Thus, even in a noisy environment, the sound source direction can be positioned in an assisted manner by the morphological characteristics of the language expression part in the image, so that compared with the traditional method for positioning the sound direction according to the sound data only, the positioning accuracy is improved.

FIG. 10 is a block diagram of a computer device in one embodiment. Referring to fig. 10, the computer device may be the terminal 110 of fig. 1. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. The memory includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium of the computer device may store an operating system and a computer program. The computer program, when executed, causes the processor to perform a sound direction localization processing method. The processor of the computer device is used to provide computing and control capabilities, supporting the operation of the entire computer device. The internal memory may store a computer program which, when executed by the processor, causes the processor to perform a sound direction localization processing method. The network interface of the computer device is used for network communication. The display screen of the computer device may be a liquid crystal display screen or an electronic ink display screen, etc. The input device of the computer equipment can be a touch layer covered on a display screen, can be keys, a track ball or a touch pad arranged on a terminal shell, and can also be an external keyboard, a touch pad or a mouse and the like. The computer device may be a personal computer, a smart speaker, a mobile terminal or a vehicle-mounted device, the mobile terminal including at least one of a cell phone, a tablet computer, a personal digital assistant or a wearable device.

It will be appreciated by those skilled in the art that the structure shown in fig. 10 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, the sound direction localization apparatus provided in the present application may be implemented in the form of a computer program, where the computer program may run on a computer device as shown in fig. 10, and a non-volatile storage medium of the computer device may store respective program modules constituting the sound direction localization apparatus. Such as the direction prediction module 802, the image acquisition module 804, and the direction location module 806 shown in fig. 8. The computer program constituted by the respective program modules is for causing the computer device to execute the steps in the sound direction localization processing method of the respective embodiments of the present application described in the present specification.

For example, the computer device may acquire voice data collected from the environment through the direction prediction module 802 in the sound direction localization processing apparatus 800 as shown in fig. 8; at least one sound source direction is predicted from the speech data. The computer device may perform image acquisition according to the predicted at least one sound source direction through the image acquisition module 804, and identify the morphological feature of the language expression part of the target object from the acquired image. The computer device may determine a final sound source direction from the predicted at least one sound source direction by the direction localization module 806 based on the degree of matching between the morphological features and the speech data.

In one embodiment, there is provided a sound direction localization processing system including: sound collection equipment and image collection equipment; wherein:

a sound collection device for obtaining voice data collected from an environment; at least one sound source direction is predicted from the speech data.

And the image acquisition equipment is used for carrying out image acquisition according to the predicted at least one sound source direction and identifying the morphological characteristics of the language expression part of the target object from the acquired image.

In one embodiment, the voice data is at least two paths of voice data from the same sound source; the predicted sound source direction includes a sound emission angle; the sound collection device is also used for determining the phase difference between the voice data of each path; and predicting the sounding angle corresponding to the voice data according to the phase difference.

In one embodiment, the sound collection device is an array of sound collection devices; wherein:

the sound collection equipment array is used for collecting voice data from the environment to obtain at least two paths of voice data sent from the same sound source; the sound collection device array comprises at least two sound collection devices; and in the process of keeping the voice data collected by the voice collection device array, controlling the image collection device to collect images according to the predicted at least one sound source direction.

In one embodiment, the predicted at least one sound source direction is at least two. The sound collection equipment is also used for obtaining predicted direction values corresponding to the predicted sound source directions respectively; a predicted direction value for characterizing a probability that the speech data originates from a sound source direction; determining a sound source direction value corresponding to the sound source direction according to a predicted direction value corresponding to the predicted at least one sound source direction and a matching degree corresponding to an image acquired according to the predicted at least one sound source direction; the sound source direction value is used for representing the probability that the predicted sound source direction is the final sound source direction; the predicted sound source direction corresponding to the maximum sound source direction value is selected as the final sound source direction.

In one embodiment, the target object is a person object; the language expression part is a lip; the morphological characteristics of the language expression part are mouth shape characteristics; the image acquisition equipment is also used for positioning a face area of the person object from the acquired image; identifying a lip region in the face region; the die features are extracted from the lip region.

In one embodiment, the acquired images are a continuous sequence of images; the extracted mouth shape features are continuous mouth shape features corresponding to the image sequence; the sound collection equipment is also used for identifying continuous mouth shape characteristics to obtain a first sentence; performing voice recognition on the voice data to obtain a second sentence; matching the first sentence with the second sentence to obtain the matching degree between the mouth shape characteristics and the voice data; and determining a final sound source direction from the predicted at least one sound source direction according to the matching degree.

In one embodiment, the image acquired in the final sound source direction includes at least two target objects; the matching degree is the first matching degree; the sound collection device is also used for extracting voiceprint characteristic data of the voice data; searching voiceprint characteristic data of each target object from the stored voiceprint characteristic data; matching the extracted voiceprint feature data with the searched voiceprint feature data to obtain a second matching degree corresponding to each target object; and identifying the sounding object from the target objects according to the first matching degree and the second matching degree.

In one embodiment, the sound collection device is further configured to obtain a predicted direction value corresponding to the final sound source direction; determining the sound direction value of each target object according to the predicted direction value, the first matching degree and the second matching degree; and identifying the target object corresponding to the maximum sound direction value as a sound object.

In one embodiment, the sound collection device is further configured to notify the image collection device to extract, for each target object, extrinsic feature data of each target object from the image collected in the final sound source direction; matching the external feature data with the stored external feature data of the target object; the sound collection device is also used for obtaining voiceprint feature data stored corresponding to the matched external feature data according to the matched external feature data provided by the image collection device, and obtaining voiceprint feature data of the target object.

In one embodiment, the target object is a person object; the external feature data comprises face feature data; the image acquisition equipment is also used for positioning the face area corresponding to each target object from the image acquired according to the final sound direction; and carrying out face recognition on the face area to obtain face feature data.

In one embodiment, the sound collection device is further configured to store, for the sound object, the extracted extrinsic feature data and voiceprint feature data of the sound object, corresponding to the sound object, to update the extrinsic feature data and voiceprint feature data stored corresponding to the sound object.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the sound direction localization processing method described above. The step of the sound direction localization processing method here may be a step in the sound direction localization processing method of each of the above embodiments.

In one embodiment, a computer readable storage medium is provided, storing a computer program which, when executed by a processor, causes the processor to perform the steps of the sound direction localization processing method described above. The step of the sound direction localization processing method here may be a step in the sound direction localization processing method of each of the above embodiments.

It should be noted that, the "first" and "second" in the embodiments of the present application are used only for distinction, and are not limited in terms of size, sequence, slave, etc.

It should be understood that although the steps in the embodiments of the present application are not necessarily performed in the order indicated by the step numbers. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in various embodiments may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of the sub-steps or stages of other steps or steps.

Those skilled in the art will appreciate that the processes implementing all or part of the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, and the program may be stored in a non-volatile computer readable storage medium, and the program may include the processes of the embodiments of the methods as above when executed. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the invention, which are described in detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims

1. A sound direction localization processing method, the method comprising:

acquiring voice data acquired from an environment;

predicting at least one sound source direction based on the speech data;

image acquisition is carried out according to at least one predicted sound source direction, and mouth shape characteristics of lips of a target object are identified from the acquired images; the mouth shape characteristics of the lips vary with the sound production of the target object to express different sound contents; the collected images are a continuous image sequence; the extracted mouth shape features are continuous mouth shape features corresponding to the image sequence;

Identifying the continuous mouth shape characteristics to obtain a first sentence;

performing voice recognition on the voice data to obtain a second sentence;

matching the first sentence with the second sentence to obtain a first matching degree between the mouth shape feature and the voice data, and determining a final sound source direction from at least one predicted sound source direction according to the first matching degree;

extracting voiceprint feature data of the voice data under the condition that at least two target objects are included in the image acquired according to the final sound source direction;

searching voiceprint characteristic data of each target object from stored voiceprint characteristic data;

and identifying a sound production object from the target object according to the first matching degree and the second matching degree.

2. The method of claim 1, wherein the voice data is at least two paths of voice data from the same sound source; the predicted sound source direction includes a sound emission angle;

said predicting at least one sound source direction from said speech data comprises:

Determining phase differences among the voice data;

3. The method of claim 1, wherein the acquiring voice data collected from the environment comprises:

the image acquisition according to the predicted at least one sound source direction comprises:

4. The method of claim 1, wherein the predicted at least one sound source direction is at least two; said determining a final sound source direction from the predicted at least one sound source direction based on a first degree of matching between the mouth shape feature and the speech data comprises:

obtaining predicted direction values corresponding to the predicted sound source directions respectively; the predicted direction value is used for representing the probability that the voice data originate from the sound source direction;

Determining a sound source direction value corresponding to the sound source direction according to a predicted direction value corresponding to the predicted sound source direction and the first matching degree corresponding to the image acquired according to the predicted sound source direction; the sound source direction value is used for representing the probability that the predicted sound source direction is the final sound source direction;

5. The method of claim 1, wherein the target object is a human object;

the identifying the mouth shape feature of the lip of the target object from the acquired image comprises:

positioning a face area of a person object from the acquired image;

identifying a lip region in the face region;

extracting a mouth shape feature from the lip region.

6. The method of claim 1, wherein the identifying a sound object from the target object based on the first degree of matching and the second degree of matching comprises:

acquiring a predicted direction value corresponding to the final sound source direction;

7. The method of claim 6, wherein searching for voiceprint feature data for each of the target objects from the stored voiceprint feature data comprises:

extracting external feature data of each target object from the image acquired according to the final sound source direction aiming at each target object;

and acquiring voiceprint feature data stored corresponding to the matched external feature data, and obtaining the voiceprint feature data of the target object.

8. The method of claim 7, wherein the target object is a human object; the external feature data comprises face feature data;

the extracting the external characteristic data of each target object from the image collected according to the final sound source direction comprises the following steps:

and carrying out face recognition on the face area to obtain face feature data.

9. The method of claim 7, wherein the method further comprises:

And storing the extracted extrinsic feature data and the voiceprint feature data of the sound production object corresponding to the sound production object so as to update the extrinsic feature data and the voiceprint feature data stored corresponding to the sound production object.

10. An acoustic direction localization processing apparatus, the apparatus comprising:

the image acquisition module is used for carrying out image acquisition according to at least one predicted sound source direction and identifying the mouth shape characteristics of the lips of the target object from the acquired images; the mouth shape characteristics of the lips vary with the sound production of the target object to express different sound contents; the collected images are a continuous image sequence; the extracted mouth shape features are continuous mouth shape features corresponding to the image sequence;

the direction positioning module is used for identifying the continuous mouth shape characteristics to obtain a first sentence; performing voice recognition on the voice data to obtain a second sentence; matching the first sentence with the second sentence to obtain a first matching degree between the mouth shape feature and the voice data, and determining a final sound source direction from at least one predicted sound source direction according to the first matching degree;

The sound production object identification module is used for extracting voiceprint characteristic data of the voice data under the condition that at least two target objects are included in the image collected according to the final sound source direction; searching voiceprint characteristic data of each target object from stored voiceprint characteristic data; matching the extracted voiceprint feature data with the searched voiceprint feature data to obtain a second matching degree corresponding to each target object; and identifying a sound production object from the target object according to the first matching degree and the second matching degree.

11. The apparatus of claim 10, wherein the voice data is at least two paths of voice data from the same sound source; the predicted sound source direction includes a sound emission angle;

the direction prediction module is also used for determining phase differences among the voice data of each path; and predicting the sounding angle corresponding to the voice data according to the phase difference.

12. The apparatus of claim 10, wherein the direction prediction module is further configured to collect, by the sound collection device array, voice data from the environment to obtain at least two paths of voice data from the same sound source; the sound collection device array comprises at least two sound collection devices; and in the process of keeping the voice data collected by the voice collection device array, controlling the image collection device to collect images according to the predicted at least one sound source direction.

13. The apparatus of claim 12, wherein the predicted at least one sound source direction is at least two; the direction positioning module is also used for obtaining predicted direction values corresponding to the predicted sound source directions respectively; the predicted direction value is used for representing the probability that the voice data originate from the sound source direction; determining a sound source direction value corresponding to the sound source direction according to a predicted direction value corresponding to the predicted sound source direction and the first matching degree corresponding to the image acquired according to the predicted sound source direction; the sound source direction value is used for representing the probability that the predicted sound source direction is the final sound source direction; the predicted sound source direction corresponding to the maximum sound source direction value is selected as the final sound source direction.

14. The apparatus of claim 12, wherein the target object is a human object;

the direction positioning module is also used for positioning a face area of the person object from the acquired image; identifying a lip region in the face region; extracting a mouth shape feature from the lip region.

15. The apparatus of claim 10, wherein the sound object recognition module is further configured to obtain a predicted direction value corresponding to the final sound source direction; determining the sound direction value of each target object according to the predicted direction value, the first matching degree and the second matching degree; and identifying the target object corresponding to the maximum sound direction value as a sound object.

16. The apparatus of claim 10, wherein the utterance object recognition module is further configured to extract extrinsic feature data of each of the target objects from an image acquired in the final sound source direction for each of the target objects; matching the external feature data with the stored external feature data of the target object; and acquiring voiceprint feature data stored corresponding to the matched external feature data, and obtaining the voiceprint feature data of the target object.

17. The apparatus of claim 16, wherein the target object is a human object; the external feature data comprises face feature data;

the sound object identification module is also used for positioning a face area corresponding to each target object from the image acquired according to the final sound direction; and carrying out face recognition on the face area to obtain face feature data.

18. The apparatus of claim 16, wherein the apparatus further comprises:

and the updating storage module is used for storing the extracted extrinsic feature data and the voiceprint feature data of the sounding object corresponding to the sounding object so as to update the extrinsic feature data and the voiceprint feature data stored corresponding to the sounding object.

19. A sound direction localization processing system, the system comprising: sound collection equipment and image collection equipment;

the sound collection device is used for obtaining voice data collected from the environment; predicting at least one sound source direction based on the speech data;

the image acquisition equipment is used for carrying out image acquisition according to at least one predicted sound source direction and identifying the mouth shape characteristics of the lips of the target object from the acquired images; the mouth shape characteristics of the lips vary with the sound production of the target object to express different sound contents; the collected images are a continuous image sequence; the extracted mouth shape features are continuous mouth shape features corresponding to the image sequence;

the sound collection equipment is also used for identifying the continuous mouth shape characteristics to obtain a first sentence; performing voice recognition on the voice data to obtain a second sentence; matching the first sentence with the second sentence to obtain a first matching degree between the mouth shape feature and the voice data, and determining a final sound source direction from at least one predicted sound source direction according to the first matching degree;

The sound collection device is further used for extracting voiceprint feature data of the voice data under the condition that at least two target objects are included in the image collected according to the final sound source direction; searching voiceprint characteristic data of each target object from stored voiceprint characteristic data; matching the extracted voiceprint feature data with the searched voiceprint feature data to obtain a second matching degree corresponding to each target object; and identifying a sound production object from the target object according to the first matching degree and the second matching degree.

20. A computer device comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, causes the processor to perform the steps of the method of any of claims 1 to 9.

21. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, causes the processor to perform the steps of the method according to any of claims 1 to 9.