CN112995566B

CN112995566B - Sound source positioning method based on display device, display device and storage medium

Info

Publication number: CN112995566B
Application number: CN201911305166.3A
Authority: CN
Inventors: 陈小平; 熊德林; 陈国丞; 常建伟; 林铮
Original assignee: Foshan Viomi Electrical Technology Co Ltd
Current assignee: Foshan Viomi Electrical Technology Co Ltd
Priority date: 2019-12-17
Filing date: 2019-12-17
Publication date: 2024-04-05
Anticipated expiration: 2039-12-17
Also published as: CN112995566A

Abstract

The application discloses a sound source positioning method based on display equipment, the display equipment and a storage medium, wherein the display equipment comprises a display screen, a microphone array, an adjusting structure connected with the display screen and a camera arranged on the adjusting structure, and the microphone array comprises a plurality of microphones; the method comprises the following steps: when a video call is carried out with a target contact person, determining delay information according to current voice information acquired by a plurality of microphones; acquiring first position information of a current sound source according to the time delay information and the position information of a plurality of microphones; controlling the camera to track a first sound source through an adjusting structure according to the first position information so as to acquire video data; performing face recognition on the video data to obtain a face recognition result; and controlling the camera to track the second sound source through the adjusting structure according to the face recognition result so as to adjust the face image of the current user to the middle position of the current video image.

Description

Sound source positioning method based on display device, display device and storage medium

Technical Field

The present disclosure relates to the field of communications technologies, and in particular, to a sound source positioning method based on a display device, and a storage medium.

Background

With the development of mobile internet technology, large screen terminals have been gradually accepted by users and have become a trend. By taking intelligent electricity as an example, the camera and the microphone are applied to the intelligent television, so that a user can conduct remote video interaction, video call, video conference and other functions are realized, and convenience is provided for life and work of people. In the process of remote video interaction between the user A and the user B by adopting the intelligent television, in order to better enable the user B to know the condition of the user A more clearly, a camera is usually required to be aimed at the user A, and voice information of the user A is acquired in real time, so that video call is realized. However, in the existing smart tv with a camera, the camera cannot be accurately aligned to the user a during the video call.

Disclosure of Invention

The main purpose of the application is to provide a sound source positioning method based on display equipment, display equipment and a storage medium, aiming at placing a face image of a current user in the middle position of a video playing picture and improving the use experience of the user in a video call scene.

In order to achieve the above object, the present application provides a sound source positioning method based on a display device, where the display device includes a display screen, a microphone array, an adjustment structure connected to the display screen, and a camera disposed on the adjustment structure, and the microphone array includes a plurality of microphones; the method comprises the following steps:

When a video call is carried out with a target contact person, determining delay information according to current voice information acquired by a plurality of microphones; acquiring first position information of a current sound source according to the time delay information and the position information of a plurality of microphones; controlling the camera to track a first sound source through the adjusting structure according to the first position information so as to acquire video data; performing face recognition on the video data to obtain a face recognition result; and controlling the camera to track the second sound source through the adjusting structure according to the face recognition result so as to adjust the face image of the current user to the middle position of the current video image.

In addition, to achieve the above object, the present application also provides a display device including a memory and a processor; the memory is used for storing a computer program; the processor is configured to execute the computer program and implement the display device-based sound source localization method as described above when the computer program is executed.

In addition, to achieve the above object, the present application also provides a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to implement the display device-based sound source localization method as described above.

The application provides a sound source positioning method based on display equipment, display equipment and storage medium, in the video call process of a current user using a television, a camera is not required to be controlled manually, the camera can automatically perform accurate video positioning and tracking on the current user of the television, the accuracy of sound source positioning can be further improved through face recognition, the face image of the current user is placed in the middle position of a video call picture, and the use experience of the user in a video call scene is improved.

Drawings

Fig. 1 is a schematic architecture diagram of a communication system according to an embodiment of the present application;

FIG. 2 is a flow chart of a display device-based sound source localization method provided in an embodiment of the present application;

fig. 3 is an application scenario schematic diagram of a display device-based sound source localization method according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a display device according to an embodiment of the present application.

The realization, functional characteristics and advantages of the present application will be further described with reference to the embodiments, referring to the attached drawings.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

The sound source positioning method based on the display device is mainly applied to the display device. The display device may include a television, a refrigerator, a computer, a washing machine, etc. having a display screen for displaying a preview interface. The following description will take a display device as an example of a television set, but the display device is not limited to the television set. The display device includes a display screen, an adjustment structure, a camera, and a microphone array.

The adjusting structure is connected with the display screen, and the camera and the microphone array are arranged on the camera on the adjusting structure. The microphone array includes a plurality of microphones for converting sound signals into electrical signals.

In some embodiments, the adjustment structure includes a drive mechanism, a power mechanism, and a transmission mechanism. The driving mechanism is used for driving the camera to rise or fall. The power mechanism and the transmission mechanism are used for driving the camera to rotate. Specifically, the power mechanism and/or the transmission mechanism can drive the camera to rotate along the first shaft and/or the second shaft. Namely, the power mechanism can drive the camera to rotate along a first axis; the transmission mechanism can drive the camera to rotate along the second shaft. Illustratively, the first axis extends in a horizontal direction and the second axis is perpendicular to the first axis and the second axis extends in a vertical direction.

In some embodiments, the microphone array is disposed on the drive mechanism. The plurality of microphones form an array of a predetermined shape, for example, spaced apart along the length of the display screen, or equally spaced apart, etc. Because the microphone array is added with a space domain on the basis of the time domain and the frequency domain, the received voice signals from different directions in space are processed, and even if a user of a television moves in the video process with a target contact person, the microphone array can also track the moving user direction in real time, thereby realizing directional voice acquisition, improving the signal-to-noise ratio, obtaining high-quality voice signals and improving the video experience of the user.

Illustratively, the camera is disposed on the power mechanism. The power mechanism is arranged on the transmission mechanism. The transmission mechanism is arranged on the driving mechanism. The driving mechanism drives the transmission mechanism to move along the vertical direction, so that the camera is driven to rise or fall. The power mechanism can rotate the camera along a first axis. The transmission mechanism can drive the power mechanism to rotate along a second shaft so as to enable the camera to rotate along the second shaft. Of course, in other embodiments, the first and second axes may also intersect and be non-orthogonal. That is, the included angle between the first axis and the second axis may be set according to practical needs, for example, the included angle between the first axis and the second axis is 10 °, 30 °, 50 °, 70 °, 80 °, and any other suitable angle between 10 ° and 90 °, which is not limited herein.

The display device provided by the embodiment can drive the camera to rotate along the first shaft and/or the second shaft through the power mechanism and/or the transmission mechanism according to actual use requirements, so that the position of the camera can be adjusted in all directions, the effect of capturing images by the camera is improved, different requirements of users or use requirements of different users are met, and the use experience of the users is improved. When the adjusting structure is positioned at the retracted position, the camera is hidden at one side of the non-display surface of the display screen, and the camera is shielded by the display screen, so that the privacy requirement of a user is ensured; meanwhile, the overall appearance and the appearance of the television are not influenced by the camera. When the adjusting structure is located at the lifting position, the camera is located above the display screen of the television, at this time, the camera cannot be shielded by the display screen of the television, and normal shooting, recording or shooting and other works can be performed.

The power mechanism may be any suitable power structure, such as a gear motor drive structure or a gear box having gears and a motor for driving the gears to rotate. The number of power units may be set according to actual needs, for example, one, two, three or more, as long as the camera can be driven to rotate along the first axis. The power mechanism is of a gear motor transmission structure, the camera head can be rotated to a preset position along the first shaft at a high speed, the strain degree of the structural member is low, and the repeated use capacity of the product is improved.

The transmission may be any suitable transmission, for example a gearmotor transmission. When the transmission mechanism is of a gear motor transmission structure, the camera can be rotated to a preset position along the second shaft at a high speed, the strain degree of the structural member is low, and the repeated use capacity of the product is improved.

The drive mechanism may be any suitable drive mechanism, for example a gear motor drive mechanism. The number of the driving mechanisms may be set according to actual demands, for example, one, two, three or more, as long as the camera can be driven to reciprocate in the vertical direction.

Referring to fig. 1, fig. 1 is a schematic architecture diagram of a communication system 100 according to an embodiment of the present application, where the communication system 100 may include a display device (e.g., a television 101) and an electronic device (e.g., an electronic device 102). The television 101 may be connected (e.g., wired or wireless) with the electronic device 102 via one or more communication networks, for example. Illustratively, the television 101 may establish a Wi-Fi connection with the electronic device 102 via a wireless fidelity (wireless fidelity, wi-Fi) protocol, but the communication network may also be implemented using any other network communication protocol, which is not limited herein.

In some embodiments, the communication system 100 may also include an application server 103. The application server 103 may be one or more. The application server 103 is used to communicate with Applications (APP) installed on the television 101 and the electronic device 102 via one or more communication networks.

Taking the video call APP with the video call function as an example, the user a can use the video call APP in the electronic device 102 to perform a video call with the user B. For example, user a may invite user B to make a video call in video call APP of electronic device 102, and in turn, electronic device 102 may send a video call request to application server 103 inviting user B. The video call request may carry parameters such as an identifier of the user B in the video call APP (e.g., a nickname and an account number of the user B). After receiving the video call request, the application server 103 may determine that the receiver of the current call request is the user B according to the identifier of the user B in the video call APP in the video call request, and query that the electronic device associated with the user B is the television 101. Further, the application server 103 may forward the video call request sent by the electronic device 102 to the television 101 of user B.

If the television 101 detects that the user B accepts the video call request sent by the user a at this time, the television 101 may continue to send the acquired voice and image to the electronic device 102 in real time through the application server 103. Meanwhile, the electronic device 102 may also send the acquired voice and image to the television 101 in real time through the application server 103, so that the user a may perform video call with the user B.

Illustratively, the video call APP may be a WeChat APP, a QQ APP, a short message APP, or the like. When the user B uses the video call APP to carry out video call with the contact person, the video call APP can send the voice acquired by the microphone to the contact person in real time, and can send the image acquired by the camera to the contact person in real time.

The specific structure of the electronic device 102 and the television 101 may be the same or different. By way of example, the electronic device 102 may be a cell phone, tablet computer, wearable electronic device with wireless communication capability (e.g., a smart watch), smart television with wireless communication capability, desktop computer, laptop computer (Laptop), etc.

Referring to fig. 2, fig. 2 is a flowchart of a sound source positioning method based on a display device according to an embodiment of the present application. The camera adjusting method may include steps S201 to S205, etc., and specifically may be as follows:

S201, determining delay information according to current voice information acquired by a plurality of microphones when a video call is carried out with a target contact person.

The microphone can collect the sound emitted by the user, so that the current voice information is obtained.

Specifically, the user can install a video call APP having a video call (also referred to as a video phone) function in the television. The user can add one or more contacts when using the video call APP and record basic information of each contact, such as the name, phone number, address, mailbox or group to which the contact belongs.

The audio manager of the television set may include an audio mode. The audio mode may include a talk mode and a normal mode. For example, after the video call APP receives a video call request sent by a target contact, the microphone array may be turned on to collect the voice of the user and the camera may be turned on to collect the image of the user. The audio manager may be periodically requested to query the current audio mode before turning on the camera. When the current audio mode is detected to be a call mode, the user is informed of the video call request.

After the television receives the video call request, the television cannot raise the camera in response to the request of opening the camera by the video call APP before the video call request is detected to be accepted by the user. Accordingly, after receiving the video call request, the video call APP may request the television to raise the camera. At this time, the television can first determine whether the user accepts the video call request. If it is detected that the user accepts the video call request, the television may raise the camera. Thus, when the video call is not connected, the television cannot pop up the camera to interfere with the use experience of the user. Meanwhile, if the user does not answer the video call request, the television can reduce the pop-up of the camera once, so that the service life of the camera is prolonged.

It can be appreciated that when a user sends a video call request to a target contact through a video call APP of a television, the camera and microphone array of the television may be turned on according to actual requirements, for example, when the video call request is sent to the target contact, the camera and microphone array may be turned on, for example, when a preset interval time is required after the video call request is sent to the target contact, and then, when the target contact receives the video call request sent by the user through the television, the camera and microphone array may be turned on. The number of microphones in the microphone array may be set according to practical requirements, for example, two, three, four or more, which is not limited herein.

In some embodiments, the determining the delay information according to the current voice information collected by the microphones includes: receiving current voice information of a current user through the microphone array; and determining the time delay of the current voice to reach different microphones according to the current voice information so as to obtain the time delay information.

Illustratively, taking the target contact Jimmy as an example, the target contact Jimmy uses the electronic device 102 to send a video call request to the current user Peter, inviting Peter to make a video call. After detecting the Jimmy invite Peter to video call operation, the electronic device 102 may send a video call request for inviting Peter to video call to the application server 103. Further, the application server 103 may send the video call request to Peter's television.

When the user is detected to accept the video call request from the target contact person, the camera of the television is lifted through the driving mechanism of the television, namely, the camera of the television is driven to move to the lifting position. If the user is not detected to accept the video call request from the target contact person, the camera of the television does not need to be lifted, so that the phenomenon that the camera is lifted to interfere with the use experience of the user when the video call is not connected is avoided, and the overall attractiveness of the television is affected. The microphone array may be lifted at the same time in the process of lifting the camera, or may not be lifted at any time, or may be lifted at a predetermined time interval after lifting the camera.

When the camera is fully raised from the non-display side of the display screen of the television, the television may turn on the camera to begin capturing images and/or turn on the microphone to capture speech. Or the television may also turn on the camera to start capturing images and/or turn on the microphone to capture voice during the process of raising the camera, and the embodiment is not limited thereto. As shown in fig. 3, after the camera 10 is fully raised from the non-display side of the display screen of the television, the television may display a user interface 301 for video telephony with the target contact Jimmy. The television can display the image content 302 acquired by the camera 10 in real time in the user interface 301. And, the television can display the image content 303 sent from the target contact Jimmy in real time in the user interface 301. Meanwhile, after the user receives the video call request, the television can also send the current voice information acquired by the microphone to the target contact Jimmy in real time, and the television can receive and play the voice content sent by the target contact Jimmy in real time, so that the video call process is realized.

The time delay information may be obtained by selecting any suitable time delay estimation algorithm according to actual needs, for example, a basic Cross correlation method, a generalized Cross correlation method, a Cross power spectrum phase method (Cross-power Spectrum Phase, CSP), a least mean square adaptive filtering method, or the like, which is not limited herein.

Illustratively, the generalized cross-correlation method obtains a cross-correlation function between two voice signals (respectively received by two microphones) by obtaining a cross-power spectrum between voice signals received by any two microphones in the microphone array, giving a certain weight in a frequency domain to suppress the influence of noise and reflection, and inversely transforming to a time domain. The peak position of the cross correlation function is the relative time delay between two voice signals, namely the time delay of two microphones. And the time delay information is the time delay set of different microphones in the microphone array.

S202, acquiring first position information of a current sound source according to the time delay information and position information of a plurality of microphones.

Specifically, after the time delay information is obtained, a current sound source is positioned by adopting a sound source positioning method according to the time delay information and the position information of a plurality of microphones, so that first position information of the current sound source is obtained.

In some embodiments, the obtaining the first position information of the current sound source according to the time delay information and the position information of the plurality of microphones includes: and constructing a plurality of hyperboloids by utilizing the time delay information and the position relation between the time delay information and the plurality of microphones, and obtaining the first position information of the current sound source by calculating the intersection point of the hyperboloids.

Specifically, after the delay from the current sound source to the pair of microphones is obtained, the current sound source is on a hyperboloid taking the position of the pair of microphones as a focus and the sound transmission distance corresponding to the arrival delay as a parameter. When a plurality of pairs of microphones are used, a plurality of time delays can be obtained, a plurality of hyperboloids are obtained, and the sound source position is located at the intersection point of the hyperboloids, so that the first position information of the current sound source is obtained.

S203, controlling the camera to track the sound source for the first time through the adjusting structure according to the first position information so as to acquire video data.

Specifically, in order to enable the camera to shoot the current user image, after the first position information is obtained, the position of the camera is adjusted through the power mechanism and/or the transmission mechanism of the adjusting structure so as to track the sound source for the first time, and video data is collected.

In some embodiments, the adjustment structure includes a power mechanism and a transmission mechanism. The controlling the camera to track the first sound source through the adjusting structure according to the first position information includes: and driving the camera to rotate along a first axis through the power mechanism and/or driving the camera to rotate along a second axis through the transmission mechanism according to the first position information, so that the camera performs first sound source tracking.

In some embodiments, the user image is included in the video data acquired by the first sound source tracking. The user image may be a person image of a surrounding area in front of the television set collected by the camera, and the person in the surrounding area is generally an image of a current user of the television set, that is, an image of a sound source.

Specifically, after the first position information is obtained, the camera may be driven to rotate along the first axis by controlling the power mechanism to act. Of course, after the first position information is obtained, the camera may be driven to rotate along the second axis by controlling the operation of the driving mechanism. The two steps may be performed only by one of them, or both steps may be performed, as long as the camera is capable of performing the first sound source tracking according to the first position information.

S204, performing face recognition on the video data to obtain a face recognition result.

Specifically, after the video data is obtained, face recognition is performed on video pictures in the video data, so as to obtain a face recognition result.

In some embodiments, the performing face recognition on the video data to obtain a face recognition result specifically includes: and inputting the current video picture of the video data into a pre-trained face recognition model to output a face recognition result, wherein the face recognition result comprises a face image of the current user.

The pre-trained face recognition model can be obtained by performing model training on an original neural network based on a large number of face sample images. The original neural network may be a convolutional neural network (CNN, convolutional neural networks), a recurrent neural network (RNN, recurrent neural networks), a Long short term memory (LSTM, long/short term memory) network, a YOLO9000 network, an AlexNet network, or VGGNet, among others.

And S205, controlling the camera to track a second sound source through the adjusting structure according to the face recognition result so as to adjust the face image of the current user to the middle position of the current video image.

Sound source localization is often not accurate enough due to the fact that sound source localization is susceptible to noise, reverberation, etc. In order to further improve the accuracy of sound source positioning, the face image of the current user is adjusted to the middle position of the current video image, and after the face recognition result is obtained, the camera is controlled to carry out secondary sound source tracking through the adjusting structure, so that the effect of capturing the image by the camera is ensured, and the use experience of the user in a video call scene is improved.

The adjusting structure comprises a power mechanism and a transmission mechanism; the controlling the camera to track the second sound source according to the current video picture of the video data and the face image corresponding to the current video picture comprises the following steps: screening the collected face images according to a preset user gesture model to obtain a face image of the current user; and controlling the camera to track a second sound source according to the current video picture of the video data and the face image corresponding to the current video picture so as to place the face image of the current user in the middle position of the current video picture.

The preset user gesture model is a pre-trained user gesture model, which can be obtained by performing model training on the initial neural network based on a large number of user gesture images. The initial neural network may be a convolutional neural network (CNN, convolutional neural networks), a recurrent neural network (RNN, recurrent neural networks), or the like.

In some embodiments, the adjustment structure includes a power mechanism and a transmission mechanism; the controlling the camera to track the second sound source according to the current video picture of the video data and the face image corresponding to the current video picture comprises the following steps: according to the current video picture of the video data and the face image corresponding to the current video picture, the camera is driven by the power mechanism to rotate along a first axis and/or is driven by the transmission mechanism to rotate along a second axis, so that the camera performs secondary sound source tracking.

Specifically, according to the center point position of the current video picture and the center point coordinates of the face area of the corresponding face image, the camera is driven to rotate along a first axis by the power mechanism and/or is driven to rotate along a second axis by the transmission mechanism, so that the camera performs secondary sound source tracking.

In some embodiments, the driving, by the power mechanism, the camera to rotate along a first axis and driving, by the transmission mechanism, the camera to rotate along the second axis according to a current video picture of video data and a face image corresponding to the current video picture includes: judging whether the face image of the current user is in the middle of the current video picture or not according to the current video picture of the video data and the face image corresponding to the current video picture; the face image of the current user is not positioned in the middle of the current video picture, and shooting angle information of the camera is determined; and driving the camera to rotate along a first axis through the power mechanism and/or driving the camera to rotate along a second axis through the transmission mechanism according to shooting angle information and target shooting angle information of the camera.

Specifically, the determining, according to a current video picture of video data and a face image corresponding to the current video picture, whether the face image of the current user is in a middle position of the current video picture specifically includes: judging whether the face image of the current user is at the middle position of the current video picture or not according to the center point coordinates of the current video picture and the center point coordinates of the face image; if the coordinates of the central point of the current video picture are the same as those of the face image, judging that the face image of the current user is at the middle position of the current video picture; and if the coordinates of the central point of the current video picture are different from the coordinates of the central point of the face image, judging that the face image of the current user is not in the middle position of the current video picture.

The center point coordinates of the face image and the center point coordinates of the current video frame may be obtained in the face recognition process, that is, in the process of determining the face image in the video frame, the center point coordinates of the face image and the center point coordinates of the current video frame are also obtained. The center point coordinates of the face image and the center point coordinates of the current video picture may also be obtained by obtaining the center point coordinates of the face image according to the contour size of the face image after the face image is determined by performing face recognition on the current video picture, and obtaining the center point coordinates of the current video picture according to the size of the current video picture.

In some embodiments, the determining the shooting angle information of the camera includes: calculating the area of the current user in the current video picture; determining a first distance between the current user and the camera according to the area; and determining shooting angle information of the camera according to the first distance, the center point coordinates of the current video picture and the center point coordinates of the face area in the current video picture.

In some implementations, the computing the area of the current user in the current video frame includes: and acquiring the area of the face area of the current user in the current video picture. The method for obtaining the area of the face region of the current user in the current video picture can be flexibly set according to actual needs, for example, the area of the face region of the current user in the current video picture can be output in the face recognition process. The determining, according to the area, a first distance between the user and the camera includes: and determining a first distance corresponding to the area according to a preset mapping relation, wherein the first distance refers to the distance between the current user and the camera. The area of the face area of the sample human body shot by the camera under different shooting distances in the shot video picture is recorded in the preset mapping relation.

In some embodiments, the photographing angle information of the camera includes a first photographing angle in a horizontal direction and a second photographing angle in a vertical direction; the determining the shooting angle information of the camera according to the first distance, the center point coordinate of the current video picture and the center point coordinate of the face area in the current video picture includes: determining a first relative position of the camera in the horizontal direction and a second relative position of the camera in the vertical direction according to the center point coordinates of the current video picture and the center point coordinates of the face area in the current video picture; determining the first shooting angle of the camera in the horizontal direction according to the first distance, the center point coordinate of the current video picture and the center point coordinate of the face area in the current video picture; determining the second shooting angle of the camera in the vertical direction according to the center point coordinates of the current video picture and the center point coordinates of the face area in the current video picture; and taking the first relative azimuth, the second relative azimuth, the first shooting angle and the second shooting angle as shooting angle information of the camera.

In some embodiments, the determining the first shooting angle of the camera in the horizontal direction according to the first distance, the center point coordinate of the current video frame, and the center point coordinate of the face region in the current video frame includes: calculating a second distance between the center point of the current video picture and the center point of the face region in the current video picture according to the center point coordinate of the current video picture and the center point coordinate of the face region in the current video picture; and calculating the first shooting angle of the camera in the horizontal direction according to the first distance and the second distance.

In some embodiments, the shooting angle information of the camera includes a first shooting angle in a horizontal direction and a second shooting angle in a vertical direction, and the target shooting angle information includes a first target shooting angle in the horizontal direction and a second target shooting angle in the vertical direction; according to the shooting angle information of the camera and the target shooting angle information, the camera is driven to rotate along a first axis by the power mechanism and rotate along a second axis by the transmission mechanism, and the method comprises the following steps: driving the camera to rotate along the second shaft through the transmission mechanism according to the first shooting angle and the first target shooting angle so as to adjust the shooting angle of the camera in the horizontal direction to the first target shooting angle; and driving the camera to rotate along a first axis by the power mechanism according to the second shooting angle and the second target shooting angle so as to adjust the shooting angle of the camera in the vertical direction to the second target shooting angle.

According to the sound source positioning method based on the display device, in the process that the current user uses the television to conduct video call, the camera does not need to be controlled manually, the camera can automatically conduct accurate video positioning and tracking on the current user of the television, the accuracy of sound source positioning can be further improved through face recognition, face images of the current user are placed in the middle position of a video call picture, and the use experience of the user in a video call scene is improved.

Referring to fig. 4, fig. 4 is a schematic block diagram of a display device according to an embodiment of the present application.

As shown in fig. 4, the display device 400 may include a processor 402, a memory 403, and a communication interface 404 connected by a system bus 401, wherein the memory 403 may include a non-volatile computer readable storage medium and an internal memory.

The non-transitory computer readable storage medium may store a computer program. The computer program comprises program instructions that, when executed, cause the processor to perform any of a number of display device-based sound source localization methods.

The processor 402 is used to store a computer program.

The memory 403 is used for executing the computer program and for implementing the display device based sound source localization method as described above when executing the computer program.

The communication interface 404 is used for communication. Those skilled in the art will appreciate that the structure shown in fig. 4 is merely a block diagram of a portion of the structure associated with the present application and does not constitute a limitation of the display device 400 to which the present application is applied, and that a particular display device 400 may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

It should be appreciated that the bus 301 may be, for example, an I2C (Inter-integrated Circuit) bus, the Memory 403 may be a Flash chip, a Read-Only Memory (ROM) disk, an optical disk, a U-disk or a removable hard disk, etc., the processor 402 may be a central processing unit (Central Processing Unit, CPU), the processor 402 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field-programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or any conventional processor.

Wherein, in some embodiments, the display device comprises a display screen, a microphone array, an adjusting structure connected with the display screen, and a camera arranged on the adjusting structure, wherein the microphone array comprises a plurality of microphones; the processor 402 is configured to execute a computer program stored in the memory 403 to perform the steps of:

In some embodiments, in determining the time delay information from the current voice information collected by the plurality of microphones, the processor 402 is configured to perform: receiving current voice information of a current user through the microphone array; and determining the time delay of the current voice to reach different microphones according to the current voice information so as to obtain the time delay information.

In some embodiments, when the first position information of the current sound source is obtained according to the time delay information and the position information of the plurality of microphones, the processor 402 is configured to perform: and constructing a plurality of hyperboloids by using the time delay information and the position relation between the time delay information and the microphones, and obtaining the first position information of the current sound source by calculating the intersection point of the hyperboloids.

In some embodiments, the adjustment structure includes a power mechanism and a transmission mechanism; the processor 402 is further configured to, when the camera is controlled to perform the first sound source tracking by the adjustment structure according to the first position information, perform: and driving the camera to rotate along a first axis through the power mechanism and/or driving the camera to rotate along a second axis through the transmission mechanism according to the first position information, so that the camera performs first sound source tracking.

In some embodiments, when the camera is controlled by the adjustment structure to perform the second sound source tracking according to the face recognition result so as to adjust the face image of the current user to the middle position of the current video image, the processor 402 is configured to perform: screening the collected face images according to a preset user gesture model to obtain a face image of the current user; and controlling the camera to track a sound source for the second time according to the current video picture of the video data and the face image corresponding to the current video picture so as to place the face image of the current user in the middle position of the current video picture.

In some embodiments, the adjustment structure includes a power mechanism and a transmission mechanism; the processor 402 is configured to perform, when controlling the camera to perform the second sound source tracking according to the current video frame of the video data and the face image corresponding to the current video frame: according to the current video picture of the video data and the face image corresponding to the current video picture, the camera is driven by the power mechanism to rotate along a first axis and/or is driven by the transmission mechanism to rotate along a second axis, so that the camera performs second sound source tracking.

In some embodiments, when the camera is driven to rotate along a first axis by the power mechanism and the camera is driven to rotate along the second axis by the transmission mechanism according to the current video frame of the video data and the face image corresponding to the current video frame, the processor 402 is configured to perform: judging whether the face image of the current user is in the middle of the current video picture or not according to the current video picture of the video data and the face image corresponding to the current video picture; the face image of the current user is not positioned in the middle of the current video picture, and shooting angle information of the camera is determined; and driving the camera to rotate along a first axis through the power mechanism and/or driving the camera to rotate along a second axis through the transmission mechanism according to the shooting angle information of the camera and the target shooting angle information.

In some embodiments, in the determining the shooting angle information of the camera, the processor 402 is configured to perform: calculating the area of the current user in the current video picture; determining a first distance between the current user and the camera according to the area; and determining shooting angle information of the camera according to the first distance, the center point coordinates of the current video picture and the center point coordinates of the face area in the current video picture.

In the foregoing embodiments, the descriptions of the embodiments are focused on, and for those portions of an embodiment that are not described in detail, reference may be made to the foregoing detailed description of the display device-based sound source positioning method, which is not repeated herein.

According to the display device, in the video call process of the current user using the television, the camera does not need to be manually controlled, the camera can automatically conduct accurate video positioning and tracking on the current user of the television, and the accuracy of sound source positioning can be further improved by combining face recognition, so that the face image of the current user is placed in the middle position of a video call picture, and the use experience of the user in the video call scene is improved.

The embodiment of the application also provides a computer readable storage medium, and the computer readable storage medium stores a computer program, wherein the computer program comprises program instructions, and the processor executes the program instructions to realize any sound source positioning method based on the display device.

The computer readable storage medium may be an internal storage unit of the display device of the foregoing embodiment, for example, a hard disk or a memory of the display device. The computer readable storage medium may also be an external storage device of the display device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the display device.

Because the computer program stored in the computer readable storage medium can execute any of the display device-based sound source positioning methods provided in the embodiments of the present application, the beneficial effects that any of the display device-based sound source positioning methods provided in the embodiments of the present application can be achieved, which are detailed in the previous embodiments and are not described herein.

The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments. While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. The sound source positioning method based on the display device is characterized in that the display device comprises a display screen, a microphone array, an adjusting structure connected with the display screen and a camera arranged on the adjusting structure, wherein the microphone array comprises a plurality of microphones; the adjusting structure comprises a power mechanism and a transmission mechanism; the method comprises the following steps:

when a video call is carried out with a target contact person, determining delay information according to current voice information acquired by a plurality of microphones;

acquiring first position information of a current sound source according to the time delay information and the position information of a plurality of microphones;

controlling the camera to track a first sound source through the adjusting structure according to the first position information so as to acquire video data;

performing face recognition on the video data to obtain a face recognition result;

according to the face recognition result, controlling the camera to track a second sound source through the adjusting structure so as to adjust the face image of the current user to the middle position of the current video image;

and controlling the camera to track a second sound source through the adjusting structure according to the face recognition result so as to adjust the face image of the current user to the middle position of the current video image, wherein the method comprises the following steps:

Judging whether the face image of the current user is in the middle of the current video picture or not according to the current video picture of the video data and the face image corresponding to the current video picture;

the face image of the current user is not positioned in the middle of the current video picture, and the area of the current user in the current video picture is calculated;

determining a first distance between the current user and the camera according to the area;

determining shooting angle information of the camera according to the first distance, the center point coordinates of the current video picture and the center point coordinates of the face area in the current video picture;

and driving the camera to rotate along a first axis through the power mechanism and/or driving the camera to rotate along a second axis through the transmission mechanism according to the shooting angle information of the camera and the target shooting angle information.

2. The method of claim 1, wherein determining delay information based on current voice information collected by a plurality of the microphones comprises:

receiving current voice information of a current user through the microphone array;

and determining the time delay of the current voice to reach different microphones according to the current voice information so as to obtain the time delay information.

3. The method of claim 1, wherein the obtaining the first location information of the current sound source according to the time delay information and the location information of the plurality of microphones comprises:

and constructing a plurality of hyperboloids by using the time delay information and the position relation between the time delay information and the microphones, and obtaining the first position information of the current sound source by calculating the intersection point of the hyperboloids.

4. The method of claim 1, wherein controlling the camera for a first sound source tracking by the adjustment structure according to the first position information comprises:

and driving the camera to rotate along a first axis through the power mechanism and/or driving the camera to rotate along a second axis through the transmission mechanism according to the first position information, so that the camera performs first sound source tracking.

5. The method according to any one of claims 1-4, wherein controlling the camera to perform a second sound source tracking by the adjustment structure according to the face recognition result, so as to adjust the face image of the current user to an intermediate position of the current video image, includes:

Screening the collected face images according to a preset user gesture model to obtain a face image of the current user;

and controlling the camera to track a second sound source according to the current video picture of the video data and the face image corresponding to the current video picture so as to place the face image of the current user in the middle position of the current video picture.

6. The method of claim 5, wherein controlling the camera to perform the second sound source tracking according to the current video frame of the video data and the face image corresponding to the current video frame comprises:

according to the current video picture of the video data and the face image corresponding to the current video picture, the camera is driven by the power mechanism to rotate along a first axis and/or is driven by the transmission mechanism to rotate along a second axis, so that the camera performs second sound source tracking.

7. A display device, the display device comprising a memory and a processor;

the memory is used for storing a computer program;

the processor for executing the computer program and for implementing the display device based sound source localization method as claimed in any one of claims 1 to 6 when the computer program is executed.

8. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, causes the processor to implement the display device-based sound source localization method according to any one of claims 1 to 6.