WO2015070558A1

WO2015070558A1 - Video shooting control method and device

Info

Publication number: WO2015070558A1
Application number: PCT/CN2014/074831
Authority: WO
Inventors: 王静; 刘智辉; 张金亮
Original assignee: 华为技术有限公司
Priority date: 2013-11-14
Filing date: 2014-04-04
Publication date: 2015-05-21
Also published as: CN103595953A; CN103595953B

Abstract

Provided are a video shooting control method and device, which relate to the field of video images and can reduce the frequency of video switching while reserving a facial picture of a speaker, so that the picture link is tight, and the output video is more fluent. The method comprises: when a first speaker speaks, controlling a first shooting device to shoot a video of the first speaker; when the current speaker is changed from the first speaker to a second speaker, controlling a second shooting device to shoot a video of the second speaker, wherein the second speaker is the next speaker in a different position from the first speaker; when the speaker is subsequently changed again, controlling the first shooting device and the second shooting device in sequence to shoot a video of the current speaker alternately; and after the video of the current speaker is acquired successfully, outputting the video of the current speaker. The present invention is used in a video conference.

Description

Method and device for controlling video shooting

This application claims priority to Chinese Patent Application No. 201 31 0566974.1, entitled "A Method and Apparatus for Controlling Video Shooting", which is filed in the Chinese Patent Office on January 1, 2011. The entire contents are incorporated herein by reference. Technical field

The present invention relates to the field of video images, and in particular, to a method and apparatus for controlling video capture. Background technique

In general, in a video conference, the camera captures a panoramic view of all participants at a fixed size and at a fixed angle. When the venue is relatively large, the camera may be far away from the speaker. The captured picture cannot determine who is speaking, and cannot see the speaker's facial expression, thus causing loss of valuable information of the conference.

In order to avoid the loss of valuable information of the conference by taking only the panoramic picture, you can use two cameras to simultaneously capture the scene. One of the cameras is always used to capture the panoramic view of the venue, and the other camera is used to track the picture of the speaker.

When someone in the venue alternately speaks, since the camera that tracks the speaker's picture needs to rotate/push and pull the camera before successfully acquiring the current speaker's picture, the video captured during this process is unstable and unobstructed, during which time the picture is displayed. You need to switch to the panorama of the venue first. However, this kind of switching will result in the interface being not tightly connected, and the video transmitted to the remote site will not be smooth, which will give the viewer a very uncomfortable feeling. Summary of the invention

Embodiments of the present invention provide a method and apparatus for controlling video capture, which can reduce the number of video switching while keeping the speaker's face picture, make the picture tightly connected, and output the video more smoothly.

In a first aspect, a method of controlling video capture is provided, including:

Controlling, by the first speaker, the first camera to capture a video of the first speaker; Controlling, by the second camera, the second camera to capture a video of the second speaker when the current speaker changes from the first speaker to the second speaker, wherein the second speaker is different from the first speaker Next speaker;

When the speaker change occurs again, the first camera and the second camera are sequentially controlled to alternately capture the video of the current speaker;

After successfully acquiring the video of the current speaker, the video of the current speaker is output. With reference to the first aspect, in a first possible implementation, the outputting the video of the current speaker includes: outputting a video of the current speaker in full screen;

In conjunction with the first possible implementation of the first aspect, in a second possible implementation manner of the first aspect, the displaying, by the full screen, the video of the current speaker includes:

Before the video of the current speaker is successfully acquired, the video of the previous speaker of the current speaker is output in full screen;

After successfully acquiring the video of the current speaker, the video of the current speaker is output in full screen. In conjunction with the first aspect, in a third possible implementation of the first aspect, the outputting the video of the current speaker comprises: simultaneously outputting the current speaker and the current speaker in a picture-in-picture format a video of the previous speaker;

The picture-in-picture includes a first picture and a second picture included in the first picture that is smaller than the first picture, and the current speaker is output in the first picture, The previous speaker of the current speaker is output in the second picture.

In conjunction with the third possible implementation of the first aspect, in a fourth possible implementation manner of the first aspect, the method further includes:

Controlling, by the first camera device, a video of a third speaker when the current speaker changes from the second speaker to the third speaker, wherein the third speaker is the second speaker The next speaker with a different location;

The simultaneously outputting the video of the current speaker and the previous speaker of the current speaker in the form of picture-in-picture includes:

Before successfully acquiring the video of the third speaker: outputting the second in the first screen a speaker, outputting a solidified picture of the first speaker in the second picture; or outputting the second speaker in the first picture, and outputting the already started shooting in the second picture The third speaker in the process has not been successfully acquired;

After successfully acquiring the video of the third speaker: outputting the third speaker in the first picture, and outputting the second speaker in the second picture.

With reference to the first aspect, in a fifth possible implementation manner of the first aspect, the outputting the video of the current speaker includes: simultaneously outputting the current speaker and the current speaker in a dual screen manner a video of the previous speaker;

The output screen includes two partial screens that are not included in each other. One part of the screen outputs the current speaker, and the other part of the screen outputs the previous speaker of the current speaker.

In conjunction with the fifth possible implementation of the first aspect, in a sixth possible implementation manner of the first aspect, the method further includes:

The simultaneously outputting the video of the current speaker and the previous speaker of the current speaker in the form of a two-picture includes:

Before successfully acquiring the video of the third speaker: outputting a solidified picture of the first speaker in the part of the screen, and outputting the second speaker in the other part of the picture; or Outputting, in a part of the screen, the third speaker who has started shooting but has not successfully acquired, and outputs the second speaker in the other part of the screen;

After successfully acquiring the video of the third speaker: the third speaker is outputted in the portion of the screen, and the second speaker is outputted in the other portion of the screen.

In conjunction with the first aspect, in a seventh possible implementation of the first aspect, before the controlling the first camera to capture the video of the first speaker, the method further includes:

In the initial state, the first camera and the second camera are controlled to capture a video of the entire venue and output the captured video. In conjunction with the first aspect, or any one of the first to seventh possible implementations of the first aspect, in the eighth possible implementation of the first aspect, the first camera is controlled by the first camera Before the video, the method further includes:

Providing a tracking flag for the first imaging device and the second imaging device, wherein the tracking flag of the first imaging device is initially a first tracking flag, and the tracking flag of the second imaging device is initially a second Tracking mark

The controlling the first camera to capture the video of the first speaker when the first speaker speaks includes: controlling, when the first speaker speaks, the first camera having the first tracking flag to capture the first speaker a video, after successfully acquiring the video of the first speaker, setting a tracking flag of the first camera device from the first tracking flag to the second tracking flag, and simultaneously a tracking flag is set from the second tracking flag to the first tracking flag;

When the current speaker changes from the first speaker to the second speaker, controlling the second camera to capture the video of the second speaker includes: changing the current speaker from the first speaker to the second speaker a speaker, controlling a second camera having the first tracking flag to capture a video of the second speaker, and after successfully acquiring the video of the second speaker, the tracking flag of the second camera is The first tracking flag is set as the second tracking flag, and the tracking flag of the first camera is set from the second tracking flag to the first tracking flag.

In conjunction with the eighth possible implementation of the first aspect, in a ninth possible implementation manner of the first aspect, the first camera device and the first device are sequentially controlled when a speaker change occurs subsequently The second camera device alternately captures the video of the current speaker: the camera device having the first tracking flag is controlled to capture the video of the current speaker every time the speaker changes, and after successfully acquiring the video of the current speaker, The tracking marks of the first imaging device and the second imaging device are interchanged.

In conjunction with the ninth possible implementation of the first aspect, in a tenth possible implementation manner of the first aspect, controlling the camera to capture the video of the speaker includes:

Using the sound source localization technology, the camera is controlled to capture the video of the speaker.

In conjunction with the tenth possible implementation of the first aspect, the eleventh possible implementation in the first aspect In the current mode, the controlling the camera to capture the video of the speaker by using the sound source localization technology comprises: controlling the camera to capture the video of the speaker by using the sound source localization technology and combining preset position or image recognition technology.

In conjunction with the first aspect, or any one of the first to eleventh possible implementations of the first aspect, in a twelfth possible implementation of the first aspect, the current speaker is from the first When the speaker changes to the second speaker, controlling the second camera to capture the video of the second speaker includes:

Determining whether the second speaker position is in an output screen of the first speaker; if the second speaker position is not in an output screen of the first speaker, controlling the second camera to shoot a video of the second speaker;

If the second speaker position is in the output screen of the first speaker, further determining whether the second speaker position is within a setting area of the output screen of the first speaker;

Controlling, by the first camera device, a video of the second speaker if the second speaker position is within the set area;

And if the second speaker position is not within the set area, controlling the first camera to track the second speaker to position the second speaker in the set area.

In a second aspect, an apparatus for controlling video capture is provided, including:

a control unit, configured to control, when the first speaker speaks, the first camera to capture a video of the first speaker;

The control unit is further configured to: when the current speaker changes from the first speaker to the second speaker, control the second camera to capture a video of the second speaker, where the second speaker is The next speaker whose first speaker position is different;

The control unit is further configured to sequentially control the first camera device and the second camera device to sequentially capture a video of a current speaker when a speaker change occurs subsequently;

And a processing unit, coupled to the control unit, configured to output a video of the current speaker after successfully acquiring the video of the current speaker.

With reference to the second aspect, in a first possible implementation manner of the second aspect, the processing unit is specifically configured to: Setting a full-screen display of the current speaker's video;

The video of the current speaker is output in full screen.

With reference to the first possible implementation of the second aspect, in a second possible implementation manner of the second aspect, the processing unit is specifically configured to:

The video of the previous speaker of the current speaker is output in full screen before the video of the current speaker is successfully acquired; after the video of the current speaker is successfully acquired, the video of the current speaker is output in full screen.

With reference to the second aspect, in a third possible implementation manner of the second aspect, the processing unit is further configured to:

Setting a video of the current speaker and a video of a previous speaker of the current speaker in a picture-in-picture format;

The picture-in-picture includes a first picture and a second picture included in the first picture that is smaller than the first picture, and the current speaker is displayed in the first picture. Displaying the previous speaker of the current speaker in the second screen;

Simultaneously outputting a video of the current speaker and the previous speaker of the current speaker in the form of picture-in-picture;

In conjunction with the third possible implementation of the second aspect, in a fourth possible implementation of the second aspect, the control unit is further configured to:

The processing unit is specifically configured to:

Before successfully acquiring the video of the third speaker: outputting the second speaker in the first screen, and outputting a solidified picture of the first speaker in the second screen; or Outputting the second speaker in the first screen, and outputting, in the second screen, the third speaker that has started shooting but has not successfully acquired;

After successfully acquiring the video of the third speaker: outputting the third in the first screen The speaker outputs the second speaker in the second picture.

With reference to the second aspect, in a fifth possible implementation manner of the second aspect, the processing unit is further configured to:

Setting a video of the current speaker and a video of a previous speaker of the current speaker to display in a two-picture form;

The dual picture includes two parts of pictures that are not included in each other, a part of the picture displays the current speaker, and another part of the picture displays the previous speaker of the current speaker;

The video of the current speaker and the previous speaker of the current speaker is simultaneously output in the form of a two-picture.

In conjunction with the fifth possible implementation of the second aspect, in a sixth possible implementation manner of the second aspect, the control unit is further configured to:

The processing unit is specifically configured to:

In conjunction with the second aspect, in a seventh possible implementation of the second aspect, the control unit is further configured to:

Before controlling the first camera to capture the video of the first speaker, in the initial state, controlling the first camera and the second camera to capture a video of the entire venue;

The processing unit is further configured to output the captured video.

In combination with the second aspect or the first to seventh possible implementations of the second aspect, in the second In an eighth possible implementation of the aspect, the control unit is further configured to:

The control unit is specifically configured to: when the first speaker speaks, control the first camera with the first tracking flag to capture the video of the first speaker, after successfully acquiring the video of the first speaker, The tracking flag of the first camera device is set from the first tracking flag to the second tracking flag, and the tracking flag of the second camera device is set from the second tracking flag to the first tracking flag Sign

The control unit is specifically configured to: when the current speaker changes from the first speaker to the second speaker, control the second camera having the first tracking flag to capture the video of the second speaker, After successfully acquiring the video of the second speaker, setting the tracking flag of the second camera device from the first tracking flag to the second tracking flag, and simultaneously tracking the tracking flag of the first camera device The second tracking flag is set as the first tracking flag.

With reference to the eighth possible implementation manner of the second aspect, in a ninth possible implementation manner of the second aspect, the control unit is specifically configured to: when the speaker change occurs each time, the control has the first The camera of the tracking mark captures the video of the current speaker, and after successfully acquiring the video of the current speaker, the tracking marks of the first camera and the second camera are interchanged.

With reference to the ninth possible implementation of the second aspect, in a tenth possible implementation manner of the second aspect, the control unit is specifically configured to:

In conjunction with the tenth possible implementation of the second aspect, in the eleventh possible implementation manner of the second aspect, the control unit is specifically configured to:

The sound source localization technique is combined with preset position or image recognition technology to control the camera to capture the speaker's video.

With reference to the second aspect, or any one of the first to the eleventh possible implementation manners of the second aspect, in the twelfth possible implementation manner of the second aspect, the control unit is specifically configured to: Determining whether the second speaker position is in an output screen of the first speaker; if the second speaker position is not in an output screen of the first speaker, controlling the second camera to shoot a video of the second speaker;

After the above technical solution, the method for controlling video shooting and the device for controlling video shooting according to the present invention, when one of the participants alternately speaks, sequentially controls the first camera and the second camera to alternately capture the current speech. The video of the person, and output the video of the current speaker, so that even if there are multiple people in the venue quickly alternately speaking, the two camera devices can capture the facial images of the plurality of speakers, and the technical solution provided by the present invention The video of the current speaker is output only after the camera device successfully acquires the video of the current speaker. Compared with the prior art, it is required to switch to the panorama of the conference site before the camera device successfully acquires the video of the next speaker. The invention can indeed reduce the number of video switchings, so that the picture is closely connected and the output video is smoother. DRAWINGS

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art will be briefly described below. Obviously, the drawings in the following description are merely Some embodiments of the present invention may also be used to obtain other drawings based on these drawings without departing from the prior art.

1 is a flow chart of an embodiment of a method for controlling video capture according to the present invention;

2A is a schematic diagram of the speaker after the change of the shooting, in the case where the position of the speaker after the change is within the setting area of the output screen of the speaker before the change;

2B is a schematic diagram of the speaker after the change in the case where the changed speaker position is in the output screen of the speaker before the change but not in the setting area of the screen; 2G is a schematic diagram of the speaker after the change of the shooting in the case where the speaker position after the change is not in the output screen of the speaker before the change;

3A is a flow chart of a specific embodiment of a method for controlling video capture according to the present invention;

FIG. 3B is another flowchart of a specific embodiment of a method for controlling video capture according to the present invention; FIG. 4 is a schematic diagram of a specific embodiment of a method for controlling video capture according to the present invention;

FIG. 5A is a schematic diagram showing an effect of outputting a camera rotation/push-pull process when displaying in full screen; FIG.

5B is a schematic diagram showing an effect of not outputting a camera rotation/push-pull process when displaying in full screen; FIG. 6 is a flowchart of another embodiment of a method for controlling video shooting according to the present invention; FIG. 7 is another embodiment of a method for controlling video shooting according to the present invention; FIG. 8A is a schematic diagram showing the effect of outputting the camera rotation/push-pull process when the picture-in-picture is displayed; FIG. 8B is a schematic diagram showing the effect of not outputting the camera rotation/push-pull process when the picture-in-picture is displayed; A flowchart of still another embodiment of a method for controlling video capture; FIG. 10 is a schematic diagram of still another embodiment of a method for controlling video capture according to the present invention; FIG. 1 1 A is an output camera rotation/push-pull when displayed in dual screen FIG. 1 is a schematic diagram showing the effect of not outputting the camera rotation/push-pull process when displaying in two pictures; FIG. 1 is a structural block diagram of an embodiment of the apparatus for controlling video shooting according to the present invention;

FIG. 1 is a schematic structural diagram of another embodiment of an apparatus for controlling video shooting according to the present invention; FIG. 1B is a schematic structural diagram of still another embodiment of an apparatus for controlling video shooting according to the present invention; A schematic structural view of a further embodiment of the device. detailed description

The technical solutions of the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings. It is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.

1 is a flow chart of an embodiment of a method of controlling video capture of the present invention. The method for controlling video capture provided by the embodiment of the present invention may be implemented by a device having a control processing function, which may be, for example, a camera, a video controller, a video terminal, or the like. As shown in Figure 1, the present invention The methods provided by the embodiment for controlling video shooting include:

S1 1 , controlling the first camera to capture the video of the first speaker when the first speaker speaks. In the embodiment of the present invention, two sets of imaging devices are provided: a first imaging device and a second imaging device to capture a video of a speaker. The first imaging device may be a camera module, and the second camera device may also be a camera module. Of course, within the scope of the present invention, the first imaging device and the second imaging device may also be a plurality of camera modules, and the specific applications of the plurality of camera modules may be similarly obtained according to the application of one camera module. The first camera device and the second camera device may be connected and fixed together by a connecting device, or may be independent of each other. The camera device mentioned in the embodiment of the present invention may be a video camera or other terminal device having a camera function.

The method for controlling video shooting provided by the embodiment of the present invention can be applied to a video conference for capturing and outputting a video of a speaker in a local conference site, and can also be used for transmitting a video of a local conference site to a remote conference site, so as to be remotely located. Participants at the venue watched the situation at the local venue.

After the camera is turned on, when the video conference starts, if there is still no speech in the local conference site, both the first camera device and the second camera device can simultaneously control the panoramic view of the local site. If it is predetermined to control the first camera to shoot the first speaker in the conference site, it is preferable to output the video captured by the second camera to the remote site. At this point, because there are no speakers, the participants at the remote site only need to watch the panorama of the local site. When a speaker in the local conference site starts speaking, that is, when the first speaker appears, the first camera device can be immediately controlled to capture the video of the first speaker; and the second camera device can still be controlled to capture the panorama of the local conference site.

In an embodiment of the invention, the position of the speaker can be determined using sound source localization techniques. Only the sound source localization technology may not accurately acquire the position of the speaker due to noise interference or the like. Therefore, it is also possible to pre-set the possible position of the speaker when speaking in the local venue, and locate the sound source. When the technology acquires the position of the speaker, the accuracy of the judgment in combination with the preset possible position (ie, the preset position) is higher. In order to obtain the position of the speaker more accurately, sound source localization technology and image recognition technology can be combined. Specifically, when the camera device (including the first camera device and the second camera device) is controlled to capture the video of the speaker, the plurality of sound pickup microphones may be formed into a sound pickup microphone array, and when the first speaker speaks, the The pickup microphone array picks up the sound of the local venue, passing the audio front Handle to the sound source locator. The sound source locator is a module having a sound source localization function in the type of device having the control processing function, and the sound pickup microphone array has two or more sound pickups distributed at different positions of the local venue. The microphone is composed. The sound source locator receives the sound picked up by the sound collecting microphone array and performs positioning processing thereof to obtain position information of the first speaker. The controller may send a corresponding camera control command to the pan/tilt according to the position information, and the pan/tilt control the first camera to rotate to a suitable shooting angle to roughly obtain the video of the first speaker, where The pan/tilt is used to receive and execute camera control commands sent by the controller. Then, combining the position information obtained by the sound source positioning, the preset position information or the image recognition technology (the image recognition technology may specifically be face recognition, face detection, lip motion detection, etc.), and the first speaker is further obtained. Accurate position information, generating a new control command to send to the pan/tilt, controlling the first camera to rotate/push and pull the camera, acquiring a picture of the first speaker size according to requirements, for example, the first speaker may be The face occupies 1 / 2, 1 / 3 or 1 / 4 of the entire picture.

Because the accuracy of the sound source localization technology is not high or the noise is easily disturbed, the embodiment of the invention uses the sound source localization technology combined with the preset position or the image recognition technology to accurately determine the position of the speaker, and then control the camera device. Take a picture. It should be noted that, in the present invention, only the sound source localization technology may be used according to actual conditions, or the sound source localization technology may be used in combination with the preset position, or the sound source localization technology may be used to collect the preset position, and the sound source localization technology may also be combined at the same time. Preset and image recognition technology.

S1 2, when the current speaker changes from the first speaker to the second speaker, controlling the second camera to capture a video of the second speaker, wherein the second speaker is speaking with the first speaker The next speaker with a different location.

The current speaker refers to the person currently speaking in the local venue, and in steps S1 1, S1 2, the current speaker is the first speaker and the second speaker, respectively. It should be noted that, after the change of the speaker position and the camera successfully obtains the video of the speaker after the change, although the camera device has not successfully acquired the video of the speaker after the change, in the process, the current The speaker is already the speaker after the change.

Similar to the control of the first camera to capture the video of the first speaker, the position of the speaker may be changed according to the sound source localization technique, that is, the speaker changes from the first speaker to the position. The second speaker different from the first speaker controls the second camera to rotate/push and pull to a suitable shooting angle and shooting size. Then, as in step S1 1 , in combination with the preset position or image recognition technology, the second camera device is further controlled to rotate/push and pull the camera as needed, and the video of the second speaker size is captured.

It should be noted that if the speaker only moves a little, for example, only moves one or two positions, it can be considered that the position of the speaker has not changed, and it is not necessary to switch the camera, and as long as the speaker is still in the shooting screen. In the setting area, for example, in the center area of 80% of the entire screen, the camera does not need to rotate/push and pull the camera for tracking. If the speaker has moved, as long as the speaker is still in the setting area of the shooting screen, it can be considered that the position of the speaker has not changed, there is no need to switch the camera, and the camera does not need to rotate/push the camera for tracking. If the speaker changes to another speaker, but the two speakers only alternate between the same position, or if the distance between the two speakers is very close, and is within the setting area of a camera shooting picture, It can be considered that the position of the speaker has not changed, and it is not necessary to switch the imaging device, and the imaging device does not need to rotate/push the camera for tracking (see FIG. 2A, the solid line indicates the imaging screen, and the broken line indicates the setting region). Whether the same speaker or a different speaker, if the speaker position is in the output screen but not in the setting area, there is no need to switch the camera, but the camera can be slightly rotated/pushed, so that the changed speaker is in the picture. Middle (see Figure 2 B). In the following description, unless otherwise specified, the change of the speaker or the change of the position of the speaker refers to the change of the position of the speaker, and the distance between the changed position and the center position of the shooting screen reaches the need to switch the camera. The degree can be set according to the actual specific scene (refer to Figure 2 C).

S13, when the speaker change occurs later, the first camera and the second camera are sequentially controlled to alternately capture the video of the current speaker.

Specifically, when the subsequent speaker changes from the second speaker to the next speaker-third speaker of the second speaker, controlling the first camera to capture the third speaker video. If the speaker change occurs later, that is, the speaker changes from the third speaker to the next speaker of the third speaker—the fourth speaker, the second camera is controlled to capture the fourth The speaker's video. Repeatedly, ensuring that the first camera device and the second camera device alternately shoot the current lecture The video of the speaker.

For example, if there are four speakers in A, B, C, and D in the local venue, A first starts to talk, then controls the first camera to shoot A; when the speaker changes from A to B, then controls the second. The imaging device takes a picture B; after the speaker changes from B to C, the first camera device is again controlled to capture C; and when the speaker is changed from C to C, the second camera device is again controlled to capture D, and so on.

When a plurality of people in the venue quickly talk alternately, the picture taken by the camera of the prior art for taking a video of the speaker will include a plurality of speakers, and if the plurality of speakers are far away, they cannot Observing the expressions of the plurality of speakers in the captured picture results in loss of valuable information of the meeting. In the present invention, the first camera device and the second camera device can both track the speaker, wherein when one camera device tracks the current speaker, the other camera device tracks the changed speaker. In this way, it can be ensured that the first camera device and the second camera device cooperate with each other and seamlessly: when the first camera device captures the current speaker, the second speaker device is used to capture the next speaker of the current speaker; When the second camera captures the current speaker, the next speaker of the current speaker is captured by the first camera. Especially when there are only two speakers in A and B in the local venue, the first camera can keep track of shooting A, and the second camera can keep tracking B. If the speaker alternates, the first camera and the first camera The two camera units have respectively adjusted the focal length, thus eliminating the need to rotate/push the camera. Thus, even if there is a speaker in the venue who alternately speaks quickly, the two cameras can alternately capture the speaker's face picture, retain more valuable information of the conference, and the efficiency of video tracking is also improved.

S14. After successfully acquiring the video of the current speaker, output the video of the current speaker. Specifically, the camera that captures the current speaker outputs the video of the current speaker after successfully acquiring the video of the current speaker, and the video outputting the current speaker is included in the camera. The display of the device or the display of the local site is output in different ways (ie, full screen, picture-in-picture, dual-picture, etc.), and is also output to the remote site in different ways. It should be noted that the present invention does not limit the manner in which the video captured in the local site is transmitted to the remote site by means of encoding (such as encoding, decoding, etc.). In the process of transmitting to the remote site, for example, the video of the current speaker can be sent to the video signal processor, and the video signal processor receives the current speech. After the video of the user, processing such as encoding and compression is performed, and then the code stream obtained by the encoding and compression is transmitted to the remote site through the network; after receiving the code stream, the remote site performs decoding and the like, and obtains a solution. The current speaker's video can then be displayed on the display of the remote site in different ways. In this way, participants at the remote site can view the local site on the display.

When the speaker changes, it takes a certain time for the camera to acquire the speaker video after the change. During this period, the prior art will switch the screen to the panoramic view of the conference site. When the camera successfully obtains the video of the changed speaker, the screen is switched to the changed speaker, which may result in the video not being smooth. In the embodiment of the present invention, before the video of the current speaker is successfully acquired in step S14, the method for controlling video capture provided by the embodiment of the present invention may further include: outputting a video of a previous speaker of the current speaker. That is, before successfully acquiring the video of the current speaker, the video of the previous speaker of the current speaker is output; after the video of the current speaker is successfully acquired, the video of the current speaker is output. In this way, when the screen is output in full screen, not only can the output screen be continuous, but also the output image can be ensured to be high, and the camera device can be prevented from rotating/pushing and pulling the camera during the process of acquiring the current speaker's video. The picture that causes the output appears blurry, swaying, and so on.

Of course, in the embodiment of the present invention, when the screen of the local site is output, not only full-screen output but also a picture-in-picture, dual-picture, or the like may be output. When outputting in the form of picture-in-picture, after successfully acquiring the video of the current speaker, the current speaker may be output in a large picture (first picture) and output in a small picture (second picture) The previous speaker of the current speaker. When the output of the two-picture form is adopted, after the video of the current speaker is successfully acquired, the current speaker may be outputted in a part of the pictures of the two-part picture not included, and the other part of the picture is outputted. The former speaker of the current speaker. The specific implementation of these output forms will be separately described in the following specific embodiments.

Further, in the embodiment of the present invention, in order to facilitate controlling the two camera devices to take the current speaker and output the video of the current speaker in turn, the tracking flag may be separately set for the two camera devices before starting the shooting, for example, the The first camera device and the second camera device respectively set an initial tracking flag as a first tracking flag and a second tracking flag, and the tracking flag may use 0 or 1 or the like. The number is used to indicate. The camera device with the tracking mark as the first tracking mark may be set to specifically capture the video of the current speaker, and the camera device with the tracking mark as the second tracking mark is specifically used to capture the next speaker of the current speaker (or the former A video of a speaker). Moreover, after successfully acquiring the video of the current speaker, the tracking flags of the first camera device and the second camera device need to be interchanged.

In the case that the tracking icon is set for the first camera device and the second camera device, step S1 1 when the first speaker speaks, controlling the first camera to capture the video of the first speaker may include: speaking at the first speaker Controlling, by the first camera having the first tracking flag, the video of the first speaker, and after successfully acquiring the video of the first speaker, the tracking flag of the first camera is from the first The tracking flag is set to the second tracking flag, and the tracking flag of the second camera device is set from the second tracking flag to the first tracking flag.

Step S1 2: When the current speaker changes from the first speaker to the second speaker, controlling the second camera to capture the video of the second speaker may include: changing the current speaker from the first speaker to a second speaker, controlling a second camera having the first tracking flag to capture a video of the second speaker, and after successfully acquiring the video of the second speaker, tracking the second camera The flag is set from the first tracking flag to the second tracking flag, and the tracking flag of the first camera is set from the second tracking flag to the first tracking flag.

Step S1 3, when the speaker change is subsequently performed, sequentially controlling the first camera and the second camera to alternately capture the video of the current speaker may include: each time a speaker change occurs, the control has the The camera of the first tracking mark captures the video of the current speaker, and after successfully acquiring the video of the current speaker, the tracking marks of the first camera and the second camera are interchanged. In this way, it is possible to ensure that the two cameras cooperate with each other, seamlessly dock, and alternately capture the video of the current speaker.

In the embodiment of the present invention, both the first camera device and the second camera device can track the speaker. Controlling, by the first speaker, the first camera to capture the first speaker, while the second camera is in preparation for tracking the next speaker of the first speaker Standby state. Changing the current speaker from the first speaker to the second speaker (ie, with the first speaker) Controlling the second camera to capture the second speaker while the second camera is shooting, while the first camera keeps capturing the first speaker and transitioning to preparation for tracking A state of the next speaker different from the second speaker position is taken. In this way, it can be ensured that the first camera device and the second camera device can cooperate with each other and seamlessly dock. It takes a certain time for the camera to successfully acquire the video of the changed speaker when the speaker changes. During this period, the prior art uses a camera device exclusively for capturing the panoramic view of the local venue, and the other camera device is specifically used for tracking the shooting of the speaker, and therefore, the camera device dedicated to tracking the speaker is successfully acquired. Before the current speaker's video, the screen needs to be switched to the panorama of the venue. When the camera successfully acquires the current speaker's video, the screen is switched to the changed speaker, which will result in the video not being smooth. In the technical solution provided by the present invention, the video of the current speaker is output only after the camera device successfully acquires the video of the current speaker, and the current output is kept before the camera device successfully acquires the video of the current speaker. A video of the speaker's previous speaker. In this way, compared with the prior art, it is required to switch to the panorama of the local conference site before the camera device successfully acquires the video of the next speaker. The present invention can indeed reduce the number of video switching, thereby making the screen tightly connected and the output video more smooth. Moreover, when a plurality of people in the local venue quickly talk alternately, the picture taken according to the prior art includes a plurality of speakers, and if the plurality of speakers are far apart, the picture cannot be observed in the captured picture. The expression of the plurality of speakers. In the present invention, due to the cooperation of the first image pickup device and the second image pickup device, even if there is a speaker who alternately speaks in the local venue, the two camera devices can alternately capture the face image of the speaker.

For a better understanding of the present invention, the present invention will be further described with reference to Figures 3A through 10, and several specific embodiments. It is also to be noted that the embodiments set forth below are only a part of the embodiments of the present invention, and those skilled in the art can readily contemplate other embodiments, which are within the scope of the present invention.

In the following specific embodiments, the imaging device may be marked with a tracking mark and the video captured by the imaging device that specifies the tracking flag may be output. For example, the initial tracking flag of the first camera device may be set to 0 (ie, the first tracking flag), and the initial tracking flag of the second camera device is set to 1 (ie, the second tracking flag), wherein the tracking flag is 0. Camera for shooting the current speaker Video; The camera with the tracking flag of 1 is used to capture the video of the next speaker of the current speaker, which will be described below for convenience. Of course, the tracking flag of the first camera is set to

1. The tracking flag of the second camera is set to 0, or other manners of setting the tracking flag are also possible, which is not limited by the present invention.

3A is a flow chart of a specific embodiment of a method of controlling video capture in accordance with the present invention. Figure 3B is another flow diagram of one embodiment of a method of controlling video capture in accordance with the present invention.

As shown in FIG. 3A, the method for controlling video shooting provided by the specific embodiment of the present invention includes:

531. At the beginning of the meeting, control two cameras to take a panoramic view of the local venue.

After the two cameras (the first camera and the second camera) are turned on, that is, at the beginning of the conference, the local site has not yet spoken. In order to transmit the layout of the local site to the remote site, the two can be controlled. The camera captures the panorama of the local site. The angle and size of the camera can be set by the user. The preferred setting can be the setting that can include all participants and the main conference scene. When the camera captures the image from the local site to the remote site, since both cameras capture the panorama of the local site, you can transfer the image captured by any camera. The first transmission tracking flag is 1 The camera (ie the second camera) captures the picture.

532. Control the first camera to capture the video of the first speaker by using sound source localization technology. After controlling the panoramic view of the two cameras to shoot the conference site, when a person in the conference starts speaking, when the first speaker appears, the sound collection microphone array picks up the sound of the local conference site, and sends the sound to the sound source to locate the sound. The sound source locator generates speaker position information according to a sound source localization technique. Then, the controller controls the camera with the tracking flag of 0 to capture the video of the first speaker size according to the position information. After the camera with the tracking flag of 0 (ie, the first camera) captures the video of the first speaker size, the tracking flag is set to 1, and the tracking flag of the other camera (ie, the second camera) is set to 1. Set to 0.

533. Control, when the current speaker changes from the first speaker to the second speaker, the second camera to capture a video of the second speaker, where the second speaker is The next speaker with a different speaker position. After the first camera captures the video of the first speaker size, the tracking flag of the first camera becomes 1, and the tracking flag of the second camera becomes 0. Thereafter, if the position of the speaker changes, that is, the first speaker changes to the second speaker different from the first speaker position, the controller may control the camera with the tracking flag of 0 ( That is, the second camera) captures the video of the second speaker, and the method of controlling the shooting is the same as S32. When the camera with the tracking flag of 0 captures the video of the second speaker size, its tracking flag is set to 1, and the tracking flag of the other camera is set to 0 by 1.

534. When the speaker change occurs subsequently, the first camera and the second camera are sequentially controlled to alternately capture the video of the current speaker.

After the second camera captures the video of the second speaker size, the tracking flag of the second camera becomes 1, and the tracking flag of the first camera becomes 0. Thereafter, if the speaker is further changed by the second speaker to the third speaker (ie, the next speaker of the second speaker), the camera whose tracking flag is 0 (ie, the first camera) is controlled. Going to the third speaker, after the camera with the tracking flag of 0 successfully acquires the video of the third speaker, the tracking flag of the camera with the tracking flag of 0 is set to 0 by 0, and the camera of the other camera The tracking flag is set to 0 by 1. Similarly, when the speaker is changed from the third speaker to the fourth speaker (the next speaker of the third speaker), the camera whose tracking flag is 0 (ie, the second camera) is controlled. Going to the fourth speaker, after the camera with the tracking flag of 0 successfully acquires the video of the fourth speaker, the tracking flag of the camera with the tracking flag of 0 is set to 0 by 1 and the other The tracking flag of the camera is set to 0 by 1. In this way, each time the speaker changes, the camera with the tracking flag of 0 (specifically the first camera or the second camera) is controlled to track the speaker after the change, and the camera successfully acquires the video of the speaker. After that, its tracking flag is set to 0 by 0, and the tracking flag of the other camera is set to 0 by 1.

535. After the camera that captures the current speaker video successfully acquires the video of the current speaker, the video of the current speaker is output in full screen.

After the camera identified as 0 successfully acquires the video of the current speaker, the tracking flag of the camera with the tracking flag of 0 is set to 1 by 0, and the tracking flag of the other camera is set to 0 by 1. and so, The video captured by the camera with the tracking flag of 1 after the change is the video of the current speaker. Here, the full screen output of the current speaker's video means that the output video is from a camera. In the full-screen display, only one speaker can be displayed, or multiple speakers can be displayed. Among them, the distances of a plurality of speakers are relatively close, so that the body language or face information of each speaker can be observed according to the captured video. Referring to step S1 2, if a plurality of speakers are far apart to observe each speaker in the video captured by the same camera, the position of the speaker may be considered to be changed, and the video of the speaker may be taken by another camera. . After the current speaker's video is transmitted to the remote site in full screen, the participant of the remote site can clearly observe the close-up picture of the current speaker, wherein the close-up picture may contain important meetings. Information, so that important meeting information can be retained as much as possible.

As shown in FIG. 4, from the left to the right of the three figures, the first picture shows that the display shows the panoramic view of the local site in full screen at the beginning of the meeting; the second picture shows that after the first speaker appears, the display is displayed full screen. A speaker's video; the third picture shows that after the speaker has changed from the first speaker to the second speaker, the display displays the second speaker in full screen.

S36: Before the camera that captures the current speaker video successfully acquires the video of the current speaker, output a video of the current speaker of the current speaker.

It should be noted that step S36 is performed before step S35.

Since the camera starts to change, before the camera successfully acquires the current speaker's video, the camera will rotate/push the camera, resulting in a blurred or unstable picture. However, in the above process, by outputting the video of the previous speaker of the current speaker, it is possible to avoid outputting the blurred or unstable picture.

For ease of understanding, the following description will be made with reference to Figs. 5A and 5B. As shown in FIG. 5A, in the order from left to right, the three figures are agreed to be the first picture, the second picture, and the third picture, respectively. The third picture speaker is the next speaker of the first picture speaker. From the time the speaker changes, until the camera successfully acquires the video of the size of the third picture speaker, if the direct output camera is rotating / The picture taken during the process of pushing and pulling the camera will result in a blurred or unstable picture in the second picture. Correspondingly, in the above process, the specific embodiment of the present invention outputs the video of the first picture speaker, and The video of the third picture talker is output only after the video of the size of the third picture talker is successfully acquired, so that the blurred or unstable picture can be avoided (refer to FIG. 5B).

In addition, according to the situation of the local site, the following situations may occur in the implementation process of the specific embodiment, and the corresponding processing manner is as follows:

(1), no one speaks at the local venue

The screen of the output is not switched, and the panoramic view of the local site is still output;

(2), a single person at the local venue speaks, no one is plugged in

The output screen is the full screen display of the current speaker;

(3), a single person in the local venue is speaking, someone is interjecting, but the time is very short.

Does not switch the output screen, still output the full-screen display of the main speaker;

(4), a single person speaks at the local venue, there is a movement

If the speaker moves, the head or body offset does not exceed the current output screen and is located in the set center area of the screen, the camera does not switch and does not track. The output picture is that the current speaker is in the center area. Full screen display; if the speaker's movement is such that the speaker has not exceeded the current output picture but may or has exceeded the set center area of the picture, the camera does not switch, but can be properly tracked to keep the speaker Located in the central area; if the speaker's movement causes the speaker to have exceeded the current output picture, the camera is switched to track the speaker;

(5) The local conference speaker has changed once and changed to the person or other person next to it.

If the changed speaker position does not exceed the output screen before the change and is located in the setting center area of the screen, the camera does not switch and does not perform tracking. The output screen is the full screen of the changed speaker in the center area. Display screen; if the position of the speaker after the change has not exceeded the output screen before the change but may or may have exceeded the setting center area of the screen, the camera does not switch, but can be properly tracked to maintain the changed speech. The player is located in the center area, and the output screen is a full-screen display screen in which the speaker after the change is located in the center area. If the changed speaker position has exceeded the output screen before the change, the camera is switched, and the changed speaker is switched. Tracking;

(6), many people in the local venue speak at the same time, that is, the state of robbing

In this case, the time for snatching is usually very short, and the output picture is not switched; (7), the local venue is discussed by many people, alternating speech, that is, the speaker position change occurs multiple times. The camera alternately tracks the speaker after each position change, and the output screen is the full screen display screen of the speaker after the change.

In this embodiment, each time the position change of the speaker occurs, the camera with the tracking flag of 0 is controlled to track the speaker after the change of the shooting position, and after the camera successfully acquires the appropriate video of the speaker, The tracking flags are set to 0 by 0, and the tracking flag of the other camera is set to 0 by 1. This always ensures that at some point, one camera is shooting the current speaker, and another camera can be used to capture the next speaker of the current speaker. In other words, the two cameras can be mated and seamlessly mated. It takes a certain amount of time for the camera to successfully acquire the video of the changed speaker when the position of the speaker changes. During this period, the video of the previous speaker of the current speaker is kept output, and the video of the current speaker is output only after the camera successfully acquires the video of the current speaker, which is required to be compared with the prior art. Switching to the panoramic view of the site, when the camera successfully acquires the changed speaker's video, the screen is switched to the changed speaker. The present invention can reduce the number of video switching, thereby making the picture tightly connected and the output video smoother. . Moreover, when multiple people in the venue quickly talk alternately, the prior art camera shot for the camera video will include multiple speakers, if the multiple speakers are far apart, The expressions of the plurality of speakers are observed in the captured picture. In the present invention, due to the cooperation of the first camera and the second camera, even if there is a speaker who alternately speaks in the venue, the two cameras can alternately photograph the face of the speaker. In addition, by outputting the video of the current speaker in full screen, the participants of the remote site can more clearly observe the facial features of the current speaker, and these facial features may contain important meeting information, so that more Preserve valuable meeting information.

6 is a flow chart of another embodiment of a method of controlling video capture of the present invention.

As shown in FIG. 6 , the method for controlling video shooting provided by the specific embodiment of the present invention includes:

S61. At the beginning of the meeting, control two cameras to take a panoramic view of the local venue.

After the two cameras are turned on, that is, at the beginning of the meeting, no one in the local venue has spoken. The location of the local site is transmitted to the remote site. The two cameras can be controlled to capture the panorama of the local site. The angle and size of the camera can be set by the user. The preferred setting can include all participants and the main conference scene. In addition, when outputting the panoramic video of the local site, it is preferable to output the video captured by the camera with the tracking flag of 1.

562. Control the first camera to capture the video of the first speaker in combination with the sound source localization technology and the preset position.

After controlling the panoramic view of the two cameras to shoot the conference site, when a person in the conference starts speaking, when the first speaker appears, the position information of the first speaker is obtained by using sound source localization technology. In combination with the preset position, that is, the exact position of the first speaker is determined in combination with a preset position where the speaker is speaking in the local venue. Specifically, a preset position closest to the position obtained by the sound source localization can be found from a plurality of preset positions as an accurate position. Then, the controller controls the camera with the tracking flag of 0 to capture the video of the first speaker according to the exact position of the first speaker. After the camera with the tracking flag of 0 captures the appropriate video of the first speaker, its tracking flag is set to 1, and the tracking flag of the other camera is set to 0 by 1.

563. Control, when the current speaker changes from the first speaker to the second speaker, the second camera to capture a video of the second speaker, where the second speaker is The next speaker with a different speaker position.

After the first camera successfully captures the video of the first speaker, the tracking flag of the first camera becomes 1, and the tracking flag of the second camera becomes 0. At this time, if the speaker changes, that is, the first speaker changes to the second speaker different from the first speaker position, the controller may control the tracking flag as in step S62. A camera of 0 (ie, the second camera) takes a video of the second speaker. When the camera with the tracking flag of 0 successfully captures the video of the second speaker, its tracking flag is set to 1, and the tracking flag of the other camera is set to 0 by 1.

564. When the speaker change occurs later, the first camera and the second camera are sequentially controlled to alternately capture the video of the current speaker.

After the second camera successfully captures the video of the second speaker, the second camera The tracking flag of the machine becomes 1, and the tracking flag of the first camera becomes 0. If the speaker is changed from the second speaker to the third speaker, the camera with the tracking flag of 0 (ie, the first camera) is controlled to shoot the third speaker, when the tracking flag is 0. After successfully acquiring the appropriate video of the third speaker, the tracking flag of the camera with the tracking flag of 0 is set to 1 by 0, and the tracking flag of the other camera (ie, the second camera) is set to 1 by 0. . Similarly, when the speaker is changed from the third speaker to the fourth speaker (ie, the next speaker of the third speaker), the camera with the tracking flag of 0 (ie, the second camera) is controlled. Going to the fourth speaker, after the camera with the tracking flag of 0 successfully acquires the appropriate video of the fourth speaker, the tracking flag of the camera with the tracking flag of 0 is set to 0 by 1, another camera The tracking flag of the first camera (ie, the first camera) is set to zero by one. When the speaker changes again, the alternate shooting is performed in the same manner.

S65, after the camera that captures the current speaker video successfully acquires the video of the current speaker, simultaneously outputs the video of the current speaker and the previous speaker of the current speaker in the form of picture-in-picture; The picture-in-picture includes a first picture and a second picture included in the first picture that is smaller than the first picture, where the current speaker is output, in the second picture The previous speaker of the current speaker is output.

After the camera identified as 0 successfully acquires the video of the current speaker, the tracking flag of the camera with the tracking flag of 0 is set to 1 by 0. At this time, the camera with the tracking flag of 1 captures the video of the current speaker, and the camera with the tracking flag of 0 captures the video of the previous speaker of the current speaker. Here, the simultaneously outputting the video of the current speaker and the previous speaker of the current speaker in the form of picture-in-picture refers to outputting the current speaker in the first picture, including In the second picture that is smaller than the first picture and outputs the previous speaker of the current speaker. In this way, in addition to observing the facial expression of the current speaker, the participant of the remote site can also observe the expression of one party's reaction to the other party's speech. These expressions may contain important meeting information, so that It is possible to retain important meeting information.

As shown in Fig. 7, among the three pictures from left to right, the first picture shows the panorama of the local site in the form of picture-in-picture at the beginning of the meeting; the second picture shows that after the first speaker appears, In the big picture (ie In the first screen, the first speaker is output, and the lower right corner of the screen (ie, the second screen) outputs a panoramic view of the local site; the third image shows that after the speaker is changed from the first speaker to the second speaker, the large screen The second speaker is output, and the first speaker is output in the lower right corner of the screen.

S66. Before the camera that captures the current speaker video successfully acquires the video of the current speaker, respectively output the first two speakers of the current speaker in the first picture and the second picture.

It should be noted that step S66 is performed before step S65.

During the process from the change of the speaker to the successful acquisition of the video of the current speaker by the camera, the camera will rotate/push the camera, resulting in a blurred or unstable picture. To this end, the first two speakers of the current talker may be respectively output in the first picture and the second picture, so that the output of the blurred or unstable picture can be avoided.

For ease of understanding, the following description will be made with reference to Figs. 8A and 8B. As shown in Fig. 8A, in the order from left to right, the three figures are agreed to be the first picture, the second picture, and the third picture, respectively. The lower right corner of the first picture (ie the second picture) The speaker is the first picture of the first picture (ie the first picture) The previous speaker of the speaker, the first picture of the big picture is the third picture The former speaker of the speaker. Now, the speaker is changed from the first picture big picture speaker to the third picture big picture speaker. In the process from the change of the speaker to the video before the camera successfully acquires the video of the third picture big picture speaker, if the picture taken by the camera during the rotation/pushing of the camera is directly output, the picture in the lower right corner of the second picture will appear. A blurred or unstable picture. As shown in FIG. 8B, correspondingly, in the above process, the specific embodiment of the present invention outputs the active picture of the first picture speaker (the large picture of the second picture) and the previous speech of the first picture speaker. The solidified picture (the lower right picture of the second picture) can avoid outputting the blurred or unstable picture.

Of course, according to actual needs, the output mode shown in the second figure of Fig. 8A can also be used in the process from the change of the speaker to the successful acquisition of the video of the current speaker by the camera.

(1), no one speaks at the local venue

The combination of the output screens remains unchanged, and the panoramic view of the local site is still output; (2), a single person at the local venue speaks, no one is plugged in

The current speaker is outputted in the first picture, and the second speaker outputs the previous speaker of the current speaker, and the picture combination mode is unchanged;

The first screen outputs the presenter, the second screen does not switch or outputs the interjector, and preferably the second screen does not switch;

(4), a single person speaks at the local venue, there is a movement

If the speaker moves, the head or body offset does not exceed the first screen of the current output and is located in the set center area of the first screen, the camera does not switch and does not perform tracking, and the first screen outputs the current speech. The action has a picture, the second picture does not change, and the output picture combination mode does not change; if the speaker moves so that the speaker has not exceeded the first picture currently output but may or may have exceeded the set center area of the first picture , the camera does not switch, but can be properly tracked to keep the speaker in the set center area of the first picture, the second picture is unchanged, and the output picture combination mode is unchanged; if the speaker's movement makes the speaker already If the first screen of the current output is exceeded, the camera is switched, and the speaker is tracked. After the tracking succeeds, the speaker is output in the first screen, and the first screen before the camera switching is switched to the second screen for output;

If the changed speaker position does not exceed the first screen before the change and is located in the set center area of the first screen, the camera does not switch and does not perform tracking. The first screen outputs the changed speaker at the center. The picture in the area, the second picture does not change; if the position of the speaker after the change has not exceeded the first picture before the change but may or may have exceeded the setting center area of the first picture, the camera does not switch, but can do Appropriate tracking, so that the changed speaker is located in the center of the first screen, the second screen remains unchanged; if the changed speaker position has exceeded the first screen before the change, the camera is switched, after the change The speaker performs tracking, the first screen outputs the changed speaker, and the second screen outputs the speaker before the change;

In this case, the time for snatching is usually very short, and the combination of the output pictures is unchanged; (7), the local venue, many people discuss, alternate speech, that is, the speaker position change occurs multiple times. The camera alternately tracks the speaker after each position change, and changes the combination mode of the output screen, that is, after each change, the The current speaker is output in one picture, and the second speaker outputs the previous speaker of the current speaker.

In this embodiment, each time the position change of the speaker occurs, the camera with the tracking flag of 0 is controlled to track the speaker after the change of the shooting position, and after the camera successfully acquires the video of the appropriate size of the speaker. The tracking flag is set to 0 by 0, and the tracking flag of the other camera is set to 0 by 1. This always ensures that at some point, one camera is shooting the current speaker while another camera is idle and can be used to capture the next speaker of the current speaker. In other words, the two cameras can be mated and seamlessly mated. It takes a certain amount of time for the camera to successfully acquire the video of the changed speaker when the position of the speaker changes. During this period, the video of the previous speaker of the current speaker is kept output, and the video of the current speaker is output only after the camera successfully acquires the video of the current speaker, which is required to be compared with the prior art. Switching to the panoramic view of the site, when the camera successfully acquires the changed speaker's video, the screen is switched to the changed speaker. The present invention can reduce the number of video switching, thereby making the picture tightly connected and the output video smoother. . Moreover, when multiple people in the venue quickly talk alternately, the prior art camera shot for the camera video will include multiple speakers, if the multiple speakers are far apart, The expressions of the plurality of speakers are observed in the captured picture. In the present invention, due to the cooperation of the first camera and the second camera, even if there is a speaker who alternately speaks in the venue, the two cameras can alternately photograph the face of the speaker. In addition, the video of the current speaker and the previous speaker of the current speaker is simultaneously output in the form of picture-in-picture, so that the participants of the remote site can clearly observe the facial features of the current speaker. At the same time, you can see the changes in the speakers in the local venue and the reaction of one party to the other, so that more valuable meeting information is retained.

9 is a flow chart of still another embodiment of a method of controlling video capture of the present invention.

As shown in FIG. 9 , taking the camera device as a camera as an example, the control device provided by the specific embodiment of the present invention The methods of frequency shooting include:

591. At the beginning of the meeting, control the panoramic view of the two cameras to capture the venue.

After the two cameras are turned on, that is, at the beginning of the conference, no one has spoken at the local conference site. In order to transmit the layout of the local conference site to the remote conference site, the two cameras can be controlled to capture the panoramic view of the local conference site. The angle and size can be set by the user. The preferred setting can be a setting that can include all the participants and the main conference scene. When outputting the video of the panoramic screen of the local conference site, it is preferable to output the video captured by the camera with the tracking flag of 1. .

592. Control the first camera to capture the video of the first speaker by using sound source localization technology and image recognition technology.

After controlling the panoramic view of the two cameras to shoot the conference site, when one of the participants in the conference starts speaking, when the first speaker appears, the position of the first speaker is obtained by the sound source localization technique, and the camera with the tracking flag of 0 is controlled. Turn to the right angle. The image recognition technique is then utilized to further determine the exact location of the first speaker. Then, the controller controls the camera with the tracking flag of 0 to capture the video of the first speaker according to the exact position of the first speaker. After the camera with the tracking flag of 0 captures the appropriate video of the first speaker, its tracking flag is set to 1, and the tracking flag of the other camera is set to 0 by 1.

593. Control, when the current speaker changes from the first speaker to the second speaker, the second camera to capture a video of the second speaker, where the second speaker is The next speaker with a different speaker position.

After the first camera successfully captures the video of the first speaker, the tracking flag of the first camera becomes 1, and the tracking flag of the second camera becomes 0. At this time, if the speaker changes, that is, the first speaker changes to the second speaker different from the first speaker position, the controller may control the tracking flag as in step S92. A camera of 0 (ie, the second camera) takes a video of the second speaker. When the camera with the tracking flag of 0 captures the appropriate video of the second speaker, its tracking flag is set to 1, and the tracking flag of the other camera is set to 0 by 1.

594, when the speaker changes again, the first camera and the second camera are sequentially controlled. The camera alternately captures the current speaker's video.

After the second camera successfully captures the video of the second speaker, the tracking flag of the second camera becomes 1, and the tracking flag of the first camera becomes 0. If the speaker is changed from the second speaker to the third speaker, the camera with the tracking flag of 0 (ie, the first camera) is controlled to shoot the third speaker, when the tracking flag is 0. After successfully acquiring the appropriate video of the third speaker, the tracking flag of the camera with the tracking flag of 0 is set to 1 by 0, and the tracking flag of the other camera (ie, the second camera) is set to 1 by 0. . Similarly, when the speaker is changed from the third speaker to the fourth speaker (ie, the next speaker of the third speaker), the camera with the tracking flag of 0 (ie, the second camera) is controlled. Going to the fourth speaker, after the camera with the tracking flag of 0 successfully acquires the appropriate video of the fourth speaker, the tracking flag of the camera with the tracking flag of 0 is set to 0 by 1, another camera The tracking flag of the first camera (ie, the first camera) is set to zero by one. When the speaker changes again, the alternate shooting is performed in the same manner.

S95, after the camera that captures the current speaker video successfully acquires the video of the current speaker, simultaneously output the video of the current speaker and the previous speaker of the current speaker in the form of a dual screen; The dual picture includes two partial pictures that are not included in each other, one part of the picture outputs the current speaker, and the other part of the picture outputs the previous speaker of the current speaker.

After the camera identified as 0 successfully acquires the video of the current speaker, the tracking flag of the camera with the tracking flag of 0 is set to 1 by 0. At this time, the camera with the tracking flag of 1 captures the video of the current speaker, and the camera with the tracking flag of 0 captures the video of the previous speaker of the current speaker. Here, the simultaneous output of the video of the current speaker and the previous speaker of the current speaker in the form of a two-picture means that the current speaker is outputted in one screen and outputted in another screen. The previous speaker of the current speaker, the above two pictures are not included in each other. In this way, in addition to observing the facial expression of the current speaker, the participant of the remote site can also observe the expression of one party's reaction to the other party's speech. These expressions may contain important meeting information, so that It is possible to retain important meeting information.

As shown in Figure 10, from the left to the right of the three figures, the first picture shows the beginning of the meeting, with a double screen The form outputs the panoramic view of the local site; the second picture shows that after the first speaker appears, the first speaker is output in the left picture, and the local picture is output on the right side; the third picture shows that the speaker is After a speaker changes to the second speaker, the second speaker is output on the right screen, and the first speaker is output on the left screen.

S96. Before the camera that captures the current speaker video successfully acquires the video of the current speaker, respectively output the first two speakers of the current speaker in the dual screen.

It should be noted that step S96 is performed before step S95.

Since the start of the change of the speaker, the camera will rotate/push and pull the camera during the video capture of the current speaker's video, resulting in a blurred or unstable picture. To this end, the first two speakers of the current talker are respectively output in the dual picture, and the output of the blurred or unstable picture can be avoided.

The following description will be made with reference to Figs. 1 1 A and 1 1 B. As shown in Fig. 1 1 A, in the order from left to right, the three figures are agreed to be the first picture, the second picture, and the third picture. The speaker on the right side of the first picture is the previous speaker of the speaker on the left side of the first picture. The speaker on the left side of the first picture is the previous speaker of the speaker on the right side of the third picture. Now, the speaker is changed from the speaker on the left side of the first picture to the speaker on the right side of the third picture. In the process from the change of the speaker until the camera successfully acquires the appropriate video of the speaker on the right side of the third picture, if the picture taken during the rotation/pushing of the camera is directly output, the second picture will appear. A blurred or unstable picture in the side view. As shown in FIG. 11B, correspondingly, in the above process, the specific embodiment of the present invention outputs the active picture of the first picture speaker (the picture on the right side of the second picture) and the speaker of the first picture. The solidified picture of the previous speaker (the picture on the left side of the second picture) can avoid outputting the blurred or unstable picture.

Of course, according to actual needs, the output mode shown in the second figure of Fig. 7A can also be used in the process from the change of the speaker to the successful acquisition of the video of the current speaker by the camera.

(1), no one speaks at the local venue

One part of the picture outputs the current speaker, and the other part of the picture outputs the previous speaker of the current speaker, and the picture combination mode is unchanged;

One part of the picture outputs the presenter, and the other part of the picture does not switch or outputs the interjector, preferably the other part of the picture does not switch;

(4), a single person speaks at the local venue, there is a movement

If the speaker moves, the head or body offset does not exceed the current output picture and is located in the set center area of the picture, the camera does not switch, does not track, and the output picture combination mode does not change; The movement makes the speaker still not beyond the current output picture but may or has exceeded the setting center area of the current output picture, the camera does not switch, but can be properly tracked to keep the speaker in the center area, output screen The combination mode is unchanged; if the speaker's movement causes the speaker to have exceeded the current output picture, the camera is switched to track the speaker;

If the next speaker position does not exceed the output screen of the previous speaker and is located in the set center area of the screen, the camera does not switch and does not perform tracking, and the output picture is the picture in which the latter speaker is located in the center area. If the position of the next speaker has not exceeded the output picture of the previous speaker but may or may have exceeded the set center area of the picture, the camera does not switch, but can be properly tracked to keep the next speaker. Located in the central area, the output picture is the picture in which the latter speaker is located in the central area; if the latter speaker position has exceeded the output picture of the previous speaker, the camera is switched to track the next speaker;

In this case, the time for snatching is usually very short, and the combination of the output pictures is unchanged;

(7), the local venue, many people discuss, alternate speech, that is, the speaker position change occurs multiple times. The camera alternately tracks the speaker after each position change, and changes the combination mode of the output screen, that is, after each change, part of the screen The current speaker is output, and the other part of the picture outputs the previous speaker of the current speaker. In this embodiment, each time the position change of the speaker occurs, the camera with the tracking flag of 0 is controlled to track the speaker after the change of the shooting position, and after the camera successfully acquires the video of the appropriate size of the speaker, The tracking flag is set to 0 by 0, and the tracking flag of the other camera is set to 0 by 1. This always ensures that at some point, one camera is shooting the current speaker, and another camera can be used to capture the next speaker of the current speaker. In other words, the two cameras can be mated and seamlessly mated. It takes a certain amount of time for the camera to successfully acquire the video of the changed speaker when the position of the speaker changes. During this period, the video of the previous speaker of the current speaker is kept output, and the video of the current speaker is output only after the camera successfully acquires the video of the current speaker, which is required to be compared with the prior art. Switching to the panoramic view of the site, when the camera successfully acquires the changed speaker's video, the screen is switched to the changed speaker. The present invention can reduce the number of video switching, thereby making the picture tightly connected and the output video smoother. . Moreover, when multiple people in the venue quickly talk alternately, the prior art camera shot for the camera video will include multiple speakers, if the multiple speakers are far apart, The expressions of the plurality of speakers are observed in the captured picture. In the present invention, due to the cooperation of the first camera and the second camera, even if there is a speaker who alternately speaks in the venue, the two cameras can alternately photograph the face of the speaker. In addition, the video of the current speaker and the previous speaker of the current speaker is outputted in the form of a two-screen, and the participant of the remote site can observe, besides, the face of the current speaker can be clearly observed. The reaction of one of the local venues to the other party's speech (suitable for multi-person conversations, especially when the two talk), thus retaining more valuable meeting information.

Corresponding to a method for controlling video capture provided by an embodiment of the present invention, an embodiment of the present invention further provides an apparatus for controlling video capture. The apparatus for controlling video shooting provided by the embodiment of the present invention may be implemented by a device having a control processing function, which may be, for example, a camera, a video controller, a video terminal, or the like. As shown in FIG. 12, an apparatus for controlling video shooting according to an embodiment of the present invention includes:

The control unit 1 21 is configured to control, when the first speaker speaks, the first camera to capture a video of the first speaker; and to control when the current speaker changes from the first speaker to the second speaker The second camera device captures a video of the second speaker, wherein the second speaker is a next speaker different from the first speaker position; and is further used to sequentially change the speaker when the player changes The first camera and the second camera are controlled to alternately capture a video of the current speaker.

The processing unit 1 22 is connected to the control unit 1 21 for outputting the video of the current speaker after successfully acquiring the video of the current speaker.

Optionally, in an embodiment, the control unit 1 21 is further configured to: control the first camera device and the device in an initial state before the first camera device captures the video of the first speaker The second camera device captures a video of the entire venue;

The processing unit 1 22 is further configured to output the captured video.

Optionally, in another embodiment, the control unit 1 21 is further configured to: separately set a tracking flag for the first camera device and the second camera device, where the tracking of the first camera device The flag is initially a first tracking flag, and the tracking flag of the second camera device is initially a second tracking flag.

The control unit 1 21 is specifically configured to: when the first speaker speaks, control the first camera with the first tracking flag to capture the video of the first speaker, after successfully acquiring the video of the first speaker Setting a tracking flag of the first camera device from the first tracking flag to the second tracking flag, and setting a tracking flag of the second camera device from the second tracking flag to the first A tracking sign.

The control unit 1 21 is specifically configured to: when the current speaker changes from the first speaker to the second speaker, control the second camera with the first tracking flag to capture the video of the second speaker After successfully acquiring the video of the second speaker, setting the tracking flag of the second camera device from the first tracking flag to the second tracking flag, and simultaneously tracking the first camera device A flag is set from the second tracking flag to the first tracking flag.

The control unit 1 21 is specifically configured to: when each subsequent speaker change occurs, control the camera device having the first tracking flag to capture the video of the current speaker, and after successfully acquiring the video of the current speaker, The tracking flags of the first imaging device and the second imaging device are interchanged.

Optionally, the control unit 1 21 is specifically configured to: determine whether the second speaker position is in an output screen of the first speaker; if the second speaker position is not in an output of the first speaker Painting In the face, controlling the second camera to capture a video of the second speaker;

If the second speaker position is in the output screen of the first speaker, further determining whether the second speaker position is within a setting area of an output screen of the first speaker; The second speaker position is within the set area, then controlling the first camera to capture a video of the second speaker; if the second speaker position is not within the set area, controlling the location The first camera device tracks the second speaker to position the second speaker in the set area.

Optionally, the control unit 1 21 may be specifically configured to: control the camera to capture a video of the speaker by using a sound source localization technique.

Further, the control unit 1 21 may be specifically configured to: control the camera to capture a video of the speaker by using a sound source localization technique in combination with a preset position or an image recognition technology.

It should be noted that the first imaging device and the second imaging device may be connected and fixed together by the connecting device, or may be independent of each other.

In this embodiment, when someone starts speaking, the control unit 1 21 controls one of the camera devices to capture the video of the current speaker, and the processing unit 1 22 outputs the video after successfully acquiring the video of the current speaker. At this time, another camera device is in a standby state ready to track the next speaker of the current speaker. When the subsequent speaker changes, the control unit 1 21 can immediately control the video camera in the standby state to capture the video of the next speaker of the current speaker. Since the process of obtaining the appropriate video of the speaker after the change takes time from the change of the position of the speaker, the picture output to the remote site during this period does not need to be switched to the panorama of the site first, but continues to be output. The video of the former speaker is changed, so that the number of video switching can be reduced, so that the picture is closely connected and the output video is smoother. Moreover, since the control unit 1 21 controls the two camera devices to alternately capture the video of the current speaker, even if there is a speaker in the venue to quickly alternately speak, the two camera devices can alternately capture the face image of the speaker, more reserved. Valuable meeting information.

Optionally, in another embodiment of the present invention, the processing unit 1 22 may output the video of the current speaker in full screen. The processing unit 1 22 is specifically configured to: after successfully acquiring the video of the current speaker, set a full-screen display of the video of the current speaker, and after completing the setting, output the video of the current speaker in full screen; Output the current speech in full screen before the current speaker's video The video of the previous speaker.

By outputting the video of the current speaker in full screen, the participants of the remote site can more clearly observe the facial features of the current speaker, and these facial features may contain important meeting information, thus further retaining valuable value. Meeting information.

Optionally, in still another embodiment of the present invention, the processing unit 1 22 may simultaneously output the video of the current speaker and the previous speaker of the current speaker in the form of picture-in-picture.

The processing unit 1 22 is specifically configured to: after successfully acquiring the video of the current speaker, set a video of the current speaker and a video of a previous speaker of the current speaker to be displayed in the form of picture-in-picture; The picture-in-picture includes a first picture and a second picture included in the first picture that is smaller than the first picture, and the current speaker is displayed in the first picture. The previous speaker of the current speaker is displayed in the second screen; after the setting is completed, the video of the current speaker and the previous speaker of the current speaker is simultaneously output in the form of picture-in-picture.

The control unit 1 21 is further configured to: when the current speaker changes from the second speaker to the third speaker, control the first camera to capture a video of the third speaker, wherein the third speaker The next speaker is different from the second speaker position.

The processing unit 1 22 is specifically configured to: before successfully acquiring the video of the third speaker: outputting the second speaker in the first screen, and outputting the first speaker in the second screen a solidified picture; or, outputting, in the first picture, the second speaker, outputting, in the second picture, the third speaker that has started shooting but has not successfully acquired; After the video of the third speaker: the third speaker is outputted in the first picture, and the second speaker is outputted in the second picture.

Simultaneously outputting the video of the current speaker and the previous speaker of the current speaker in the form of picture-in-picture, so that the participant of the remote site can clearly observe the facial close-up of the current speaker while still You can see the changes in the speaker in the local venue and the reaction of one party to the other, so that more valuable meeting information is retained.

Optionally, in still another embodiment of the present invention, the processing unit 1 22 may simultaneously output the video of the current speaker and the previous speaker of the current speaker in a two-picture form. The processing unit 1 22 is specifically configured to: after successfully acquiring the video of the current speaker, set a video of the current speaker and a video of a previous speaker of the current speaker to be displayed in a dual screen; The dual screen includes two partial screens that are not included in each other, a part of the screen displays the current speaker, and another part of the screen displays the previous speaker of the current speaker; after the setting is completed, the simultaneous output is performed in the form of a dual screen. A video of the current speaker and the previous speaker of the current speaker.

The processing unit 1 22 is specifically configured to: before successfully acquiring the video of the third speaker: output a solidified picture of the first speaker in the part of the screen, and output the second in the other part of the picture a speaker; or, in the part of the screen, outputting the third speaker in the process of having started shooting but not successfully acquiring, outputting the second speaker in the other partial screen; After the video of the three talkers: the third speaker is outputted in the part of the picture, and the second speaker is outputted in the other part of the picture.

The video of the current speaker and the previous speaker of the current speaker is outputted in the form of a two-picture, and the participant of the remote site can observe the local meeting site in addition to the close-up of the current speaker's face. The reaction of one party to the other party's speech (suitable for multi-person conversations, especially in the case of two people talking), so that more valuable meeting information is retained.

It should be noted that, in the foregoing apparatus for controlling video shooting, each unit included is only divided according to functional logic, but is not limited to the above division, as long as the corresponding function can be implemented; The specific names of the units are also for convenience of distinction from each other and are not intended to limit the scope of the present invention.

Other embodiments of the apparatus for controlling video shooting of the present invention will now be described with reference to Figures 1 3A through 1G. As shown in FIG. 13A, the device 13 for controlling video capture provided by the embodiment of the present invention includes: a controller 1 31, configured to control the first camera module 1 32 to capture the first speaker when the first speaker speaks Video; for controlling when the current speaker changes from the first speaker to the second speaker The second camera module 133 captures a video of the second speaker, wherein the second speaker is a next speaker different from the first speaker position; and is further used when a speaker change occurs subsequently, The first camera module 132 and the second camera module 133 are sequentially controlled to alternately capture the video of the current speaker.

The output processor 134 is coupled to the first camera module 132 and the second camera module 133 for outputting the video of the current speaker after successfully acquiring the video of the current speaker.

The output processor 134 may be integrated in the first camera module 132 or the second camera module 133, or may be separated from the first camera module 132 and the second camera module 133.

Optionally, the controller 131 is further configured to: before the first camera module 132 is configured to capture the video of the first speaker, in the initial state, control the first camera module 132 and the second camera module 133 to capture the entire site. Video

The output processor 134 is further configured to output the video of the entire conference site that is captured.

The first camera module 132 and the second camera module 133 may be independent of each other, or may be connected and fixed together by a connecting device to form a dual camera module. The first camera module 132 and the second camera module 133 may be integrated on the device 13 that controls video capture, or may be separate from the device 13 that controls video capture.

Optionally, in an embodiment, the controller 131 is further configured to: set a tracking flag for the first camera module 132 and the second camera module 133, where the first camera module 132 The tracking flag is initially a first tracking flag, and the tracking flag of the second camera module 133 is initially a second tracking flag.

The controller 131 is specifically configured to: when the first speaker speaks, control the first camera module 132 having the first tracking flag to capture the video of the first speaker, after successfully acquiring the video of the first speaker Setting the tracking flag of the first camera module 132 from the first tracking flag to the second tracking flag, and setting the tracking flag of the second camera module 133 from the second tracking flag to The first tracking mark is described.

The controller 131 is specifically configured to: when the current speaker changes from the first speaker to the second speaker, control the second camera module 133 having the first tracking flag to capture the video of the second speaker After the video of the second speaker is successfully acquired, the tracking flag of the second camera module 133 is set from the first tracking flag to the second tracking flag, and the first camera module is simultaneously A tracking flag of 132 is set from the second tracking flag to the first tracking flag.

The controller 131 is specifically configured to: when each subsequent speaker change occurs, control the camera device having the first tracking flag to capture a video of the current speaker, and after successfully acquiring the video of the current speaker, The tracking marks of the first camera module 132 and the second camera module 133 are interchanged.

As shown in FIG. 13B, the apparatus 13 for controlling video shooting provided by the embodiment of the present invention further includes:

The sound microphone array 135 and the sound source locator 136 are configured to: acquire a position of a speaker by using a sound source localization technique, wherein the sound source locator 136 performs sound source localization technology according to the sound picked up by the sound pickup microphone array 135. Positioning. The controller 131 controls the camera module to capture the video of the speaker based on the position obtained by the sound source localization.

As shown in FIG. 13B, the apparatus 13 for controlling video shooting provided by the embodiment of the present invention further includes: an image locator 137, configured to perform image recognition on a speaker by using image recognition techniques such as face detection, skin color detection, or lip motion detection. The controller 131 can be used to control the camera module to capture the video of the speaker according to the position information obtained by the image recognition technology.

Optionally, the controller 131 controls the camera module to capture the video of the speaker according to the position and preset position information obtained by the sound source.

Optionally, the image locator 137 is specifically configured to determine whether the second speaker position is in an output screen of the first speaker; if the second speaker position is not in an output screen of the first speaker The controller 131 controls the second camera module 133 to capture the video of the second speaker;

If the second speaker position is in the output screen of the first speaker, the image locator

137 further determining whether the second speaker position is within a setting area of an output screen of the first speaker; if the second speaker position is within the setting area, the controller 131 controls the first The camera module 132 captures a video of the second speaker; if the second speaker position is not within the set area, the controller 131 controls the first camera module 132 to track the second speaker, so that The second talker position is within the set area.

In this embodiment, when someone starts speaking, the controller 131 controls the first camera module 132 to capture the video of the current speaker, the output processor 134 acquires the video of the current speaker, and outputs the video. Frequency. At this time, the second camera module 1 33 is in a standby state ready to track the next speaker of the current speaker. When the subsequent speaker changes, the controller 1 31 can immediately control the second camera module 1 33 in the standby state to capture the video of the next speaker of the current speaker. Since the process of obtaining the appropriate video of the speaker after the change takes time from the change of the position of the speaker, the picture output to the remote site during this period does not need to be switched to the panorama of the site first, but continues to be output. The video of the former speaker is changed, so that the number of video switching can be reduced, so that the picture is closely connected and the output video is smoother. Moreover, since the controller 1 31 controls the two camera modules to alternately capture the video of the current speaker, even if there is a speaker in the venue to quickly alternately speak, the two camera modules can alternately capture the speaker's face picture, more reserved Valuable meeting information.

Alternatively, in another embodiment of the present invention, the output processor 134 may output the video of the current speaker in full screen. The output processor 1 is specifically configured to: after successfully acquiring the video of the current speaker, set a full-screen display of the current speaker's video, and after completing the setting, output the video of the current speaker in full screen; Before the current speaker's video, the video of the current speaker's previous speaker is output in full screen.

Alternatively, in still another embodiment of the present invention, the output processor 134 may simultaneously output the video of the current speaker and the previous speaker of the current speaker in a picture-in-picture format.

The output processor 1 is specifically configured to: after successfully acquiring the video of the current speaker, set the video of the current speaker and the video of the previous speaker of the current speaker to be displayed in the form of picture-in-picture Wherein the picture-in-picture includes a first picture and a second picture included in the first picture that is smaller than the first picture, and the current speaker is displayed in the first picture, The previous speaker of the current speaker is displayed in the second screen; after the setting is completed, the video of the current speaker and the previous speaker of the current speaker is simultaneously output in the form of picture-in-picture.

The controller 1 31 is further configured to: when the current speaker changes from the second speaker to the third speaker, The first camera module 1 32 is controlled to capture a video of the third speaker, wherein the third speaker is the next speaker different from the second speaker position.

The output processor 1 is specifically configured to: before successfully acquiring the video of the third speaker: outputting the second speaker in the first screen, and outputting the first speech in the second screen a solidified picture of the person; or, outputting, in the first picture, the second speaker, outputting, in the second picture, the third speaker who has started shooting but has not successfully acquired; After the video of the third speaker: the third speaker is outputted in the first picture, and the second speaker is outputted in the second picture.

Simultaneously outputting the video of the current speaker and the previous speaker of the current speaker in the form of picture-in-picture, so that the participant of the remote site can clearly observe the facial close-up of the current speaker while still You can see the changes in the speaker in the local venue and the reaction of one party to the other, so that valuable meeting information is further retained.

Alternatively, in still another embodiment of the present invention, the output processor 134 may simultaneously output the video of the current speaker and the previous speaker of the current speaker in a two-picture form.

The output processor 1 is specifically configured to: after successfully acquiring the video of the current speaker, set a video of the current speaker and a video of a previous speaker of the current speaker to be displayed in a dual screen; The dual screen includes two partial screens that are not included in each other, a part of the screen displays the current speaker, and another part of the screen displays the previous speaker of the current speaker; after the setting is completed, the two speakers simultaneously output in the form of two pictures. A video of the current speaker and the previous speaker of the current speaker.

The controller 1 31 is further configured to: when the current speaker changes from the second speaker to the third speaker, control the first camera module 1 32 to capture a video of the third speaker, wherein the third The speaker is the next speaker who is different from the second speaker position.

The output processor 1 is specifically configured to: before successfully acquiring the video of the third speaker: outputting a solidified picture of the first speaker in the part of the screen, and outputting the first part in the part of the screen a second speaker; or, in the part of the screen, outputting the third speaker who has started shooting but has not successfully acquired, and outputs the second speaker in the other partial screen; After successfully acquiring the video of the third speaker: outputting the third speaker in the part of the picture, and outputting the second speaker in the other part of the picture.

The video of the current speaker and the previous speaker of the current speaker is outputted in the form of a two-picture, and the participant of the remote site can observe the local meeting site in addition to the close-up of the current speaker's face. The reaction of one of the parties to the other party's speech, thus further retaining valuable meeting information.

The device 13 for controlling video shooting provided by the embodiment of the present invention will be described below with reference to the accompanying drawings. As shown in FIG. 13G, the apparatus 13 for controlling video capture provided by the embodiment of the present invention includes:

The controller 131; the first camera module 132, the initial tracking flag is set to 0; the second camera module 133, the initial tracking flag is set to 1; the output processor 134; the microphone array 135; the sound source locator 136; a locator 137; a main control module 138; a video module 139; a video signal processor 140; an audio module 141; an audio signal processor 142; a pickup microphone 143; a speaker 144; Each of the above sections may be integrated into a complete device or separate parts and coordinated under the control of the controller 131 and the main control module 138.

After the device 13 for controlling the video shooting is turned on, that is, when the conference starts, no one is speaking at the local conference site. In order to transmit the layout of the local conference site to the remote conference site, the controller 131 can control the two camera modules to shoot the conference site. panoramic. After the camera module captures the video of the local site, the video captured by the second camera module 133 is encoded and decoded by the video signal processor 140 in the video module 139, and under the control of the main control module 138, The video is transmitted to the remote site through the network.

When a person in the local site starts to talk, that is, when the first speaker appears, the microphone array 135 picks up the sound of the local site, and sends the voice of the local site to the sound source locator 136, where the local site The sound may be sent to the sound source locator 136, and may be sent to the sound source locator 136 after being subjected to denoising or the like by an internal module of the audio module 141 (for example, a module having a preprocessing function). . The sound source locator 136 obtains the position information generated by the sound source locator 136 according to the position information generated by the sound source localization, and controls the first camera module 132. (ie, the camera module with the tracking flag of 0) Rotate to the appropriate angle to roughly capture the video of the first speaker. Then, the image locator 137 determines the exact position (including the face position) of the first speaker using image recognition technology based on the video of the first speaker acquired by the first camera module 132. Under the control of the controller 131, the first camera module 132 (i.e., the camera module with the tracking flag of 0) rotates/pushes the camera to capture the appropriate video of the first speaker. After the first camera module 132 successfully captures the video of the first speaker, its tracking flag is set by 0, and the tracking flag of the second camera module 133 is set to 0 by 1.

After the first camera module 132 successfully captures the video of the first speaker, if the speaker changes, that is, the first speaker changes to the second speaker, the controller 131 can control the The camera module with the tracking flag of 0 (ie, the second camera module 133) captures the video of the second speaker, and the method of controlling the shooting is the same as above. After the second camera module 133 captures the appropriate video of the second speaker, its tracking flag is set to 0, and the tracking flag of the first camera module 132 is set to 0 by 1.

As described above, each time the speaker changes, the controller 131 controls the camera module with the tracking flag of 0 (specifically, the first camera module 132 or the second camera module 133) to track the speaker after the change of the shooting. Moreover, after the camera module successfully captures the appropriate video of the speaker, its tracking flag is set to 1 by 0, and the tracking flag of the other camera module is set to 0 by 1.

After the camera module successfully captures the speaker's video, the output processor 134 retrieves the speaker's video from the camera module. After the video of the speaker is obtained, the output processor 134 can set the output mode of the video, and the obtained video of the speaker can be output in a full screen, picture-in-picture or dual-picture manner.

After the output mode of the set video is completed, the output processor 134 transmits the video of the speaker to the video signal processor 140, and the video signal processor 140 encodes the video of the speaker. Then, under the control of the main control module 138, the video of the talker is transmitted from the video signal processor 140 to the remote conference site through the network.

Further, before the camera module successfully acquires the video of the current speaker, the main control module 138 can control the output processor 134 to output the video of the previous speaker of the current speaker. In addition, the audio signal processor 1 42 is used to process the sound of the speaker of the local venue picked up by the pickup microphone 143. It should be noted that the sound picked up by the pickup microphone 143 is different from the pickup. The sound picked up by the microphone array 1 35 is used for transmission to the remote site together with the video captured by the camera module, and the latter is used for sound source localization. Both the speaker 1 44 and the display 1 45 are basic configurations of the device 13 for controlling video capture, respectively for outputting audio and video in a local venue.

The various embodiments in the present specification have been described in detail, and the same reference numerals may be referred to between the various embodiments, and each embodiment focuses on the differences from the other embodiments. In particular, for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.

It should be noted that the device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as the cells may or may not be physical. Units can be located in one place, or they can be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the objectives of the embodiments of the present invention. In addition, in the drawings of the apparatus embodiments provided by the present invention, the connection relationship between the modules indicates that there is a communication connection therebetween, and specifically, one or more communication buses or signal lines can be realized. Those of ordinary skill in the art can understand and implement without any creative effort.

Those of ordinary skill in the art will appreciate that aspects of the invention, or aspects of various aspects, may be embodied as a system, method, or computer program product. Thus, aspects of the invention, or possible implementations of various aspects, may be in the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, etc.), or a combination of software and hardware aspects, They are collectively referred to herein as "circuits," "modules," or "systems." Furthermore, aspects of the invention, or various possible implementations of the invention, may take the form of a computer program product, which is a computer readable program code stored in a computer readable medium.

The computer readable medium can be a computer readable signal or a computer readable storage medium. The computer readable storage device includes, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any suitable combination of the foregoing, such as random access memory (RAM), read only memory (ROM), Erasable programmable read-only memory (EPR0M or flash memory), optical fiber, portable Read only memory (CD-ROM).

The processor in the computer reads the computer readable program code stored in the computer readable medium, such that the processor can perform the functional actions specified in each step or combination of steps in the flowchart; A device that functions as specified in each block, or combination of blocks.

The computer readable program code can be executed entirely on the user's computer, partly on the user's computer, as a separate software package, partly on the user's computer and partly on the remote computer, or entirely on the remote computer or server. . It should also be noted that in some alternative implementations, the functions noted in the various steps of the flowcharts or in the blocks of the block diagrams may not occur in the order noted in the drawings. For example, two steps, or two blocks shown in succession, may be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.

The above is only the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily think of changes or substitutions within the technical scope of the present invention. It should be covered by the scope of the present invention. Therefore, the scope of protection of the invention should be determined by the scope of the claims.

Claims

Rights request

1. A method of controlling video shooting, characterized by including:

When the first speaker speaks, control the first camera device to capture the video of the first speaker; when the current speaker changes from the first speaker to the second speaker, control the second camera device to capture the second speaker A video of , wherein the second speaker is the next speaker whose position is different from that of the first speaker;

When the speaker changes subsequently, the first camera device and the second camera device are sequentially controlled to alternately capture the video of the current speaker;

After successfully acquiring the video of the current speaker, the video of the current speaker is output.

2. The method according to claim 1, wherein the outputting the video of the current speaker includes: outputting the video of the current speaker in full screen.

3. The method according to claim 2, wherein the full-screen output of the video of the current speaker includes:

Before successfully acquiring the video of the current speaker, output the video of the previous speaker of the current speaker in full screen;

After successfully acquiring the video of the current speaker, the video of the current speaker is output in full screen.

4. The method according to claim 1, wherein the outputting the video of the current speaker includes: simultaneously outputting the current speaker and the previous speech of the current speaker in a picture-in-picture form. author’s video;

Wherein, the picture-in-picture includes a first picture and a second picture included in the first picture that is smaller than the first picture, the current speaker is output in the first picture, and the The previous speaker of the current speaker is output in the second screen.

5. The method of claim 4, further comprising: when the current speaker changes from the second speaker to a third speaker, controlling the first camera device to capture the third speaker. A video of a speaker, wherein the third speaker is the next speaker whose position is different from that of the second speaker;

The current speaker and the previous speaker of the current speaker are simultaneously output in the form of picture-in-picture. Videos of speakers include:

Before successfully acquiring the video of the third speaker: output the second speaker in the first picture, and output the frozen picture of the first speaker in the second picture; or, in the The second speaker is output in the first screen, and the third speaker who has started shooting but has not yet been successfully acquired is output in the second screen;

After successfully acquiring the video of the third speaker: output the third speaker in the first picture, and output the second speaker in the second picture.

6. The method according to claim 1, wherein the outputting the video of the current speaker includes: simultaneously outputting the current speaker and the previous speaker of the current speaker in the form of a dual picture. 's video;

Wherein, the double picture includes two parts of the picture that are mutually exclusive, one part of the picture outputs the current speaker, and the other part of the picture outputs the previous speaker of the current speaker.

7. The method according to claim 6, further comprising:

When the current speaker changes from the second speaker to a third speaker, control the first camera device to capture a video of the third speaker, wherein the third speaker is the same as the second speaker Next speaker in a different position;

The simultaneous outputting of the video of the current speaker and the previous speaker of the current speaker in the form of dual pictures includes:

Before successfully acquiring the video of the third speaker: output the frozen picture of the first speaker in the part of the picture, and output the second speaker in the other part of the picture; or, in the The third speaker who has started shooting but has not yet been successfully acquired is output in a part of the picture, and the second speaker is output in the other part of the picture;

After the video of the third speaker is successfully obtained: the third speaker is output in the part of the picture, and the second speaker is output in the other part of the picture.

8. The method according to claim 1, characterized in that, before controlling the first camera device to capture the video of the first speaker, the method further includes:

In the initial state, the first camera device and the second camera device are controlled to capture the entire venue. video and output the captured video.

9. The method according to any one of claims 1 to 8, characterized in that, before controlling the first camera device to capture the video of the first speaker, the method further includes:

Tracking flags are respectively set for the first camera device and the second camera device, wherein the tracking flag of the first camera device is initially the first tracking flag, and the tracking flag of the second camera device is initially the second tracking flag. tracking flag;

The step of controlling the first camera device to capture the video of the first speaker when the first speaker is speaking includes: controlling the first camera device with the first tracking mark to capture the video of the first speaker when the first speaker is speaking. video, after successfully acquiring the video of the first speaker, setting the tracking flag of the first camera device from the first tracking flag to the second tracking flag, and simultaneously changing the tracking flag of the second camera device. a tracking flag is set from the second tracking flag to the first tracking flag;

Controlling the second camera device to capture the video of the second speaker when the current speaker changes from the first speaker to the second speaker includes: when the current speaker changes from the first speaker to the second speaker When the speaker is the speaker, control the second camera device with the first tracking mark to capture the video of the second speaker, and after successfully acquiring the video of the second speaker, remove the tracking mark of the second camera device from The first tracking flag is set to the second tracking flag, and the tracking flag of the first camera device is set from the second tracking flag to the first tracking flag.

10. The method according to claim 9, characterized in that,

When the speaker changes subsequently, controlling the first camera device and the second camera device to alternately capture the video of the current speaker includes: each time the speaker changes subsequently, controlling the first camera to The camera device tracking the mark captures the video of the current speaker. After successfully acquiring the video of the current speaker, the tracking marks of the first camera device and the second camera device are exchanged.

11. The method according to claim 10, characterized in that controlling the camera device to capture the video of the speaker includes:

Using sound source localization technology, the camera device is controlled to capture video of the speaker.

12. The method according to claim 11, characterized in that, using sound source positioning technology to control the camera device to capture the video of the speaker includes: Use sound source localization technology combined with preset position or image recognition technology to control the camera device to capture the video of the speaker.

13. The method according to any one of claims 1 to 12, characterized in that when the current speaker changes from the first speaker to the second speaker, controlling the second camera device to capture the second speaker. Videos of two speakers include:

Determine whether the second speaker's position is in the output picture of the first speaker;

If the second speaker's position is not in the output picture of the first speaker, control the second camera device to capture the video of the second speaker;

If the second speaker's position is in the first speaker's output screen, then further determine whether the second speaker's position is within the set area of the first speaker's output screen;

If the second speaker's position is within the set area, control the first camera device to capture the video of the second speaker;

If the second speaker's position is not within the set area, the first camera device is controlled to track and photograph the second speaker so that the second speaker's position is within the set area.

14. A device for controlling video shooting, characterized by including:

A control unit, used to control the first camera device to capture the video of the first speaker when the first speaker speaks;

The control unit is also configured to control the second camera device to capture the video of the second speaker when the current speaker changes from the first speaker to the second speaker, where the second speaker is the same as the first speaker. a next speaker whose position is different from the first speaker;

The control unit is also configured to sequentially control the first camera device and the second camera device to alternately capture the video of the current speaker when the speaker changes subsequently;

A processing unit, connected to the control unit, configured to output the video of the current speaker after successfully acquiring the video of the current speaker.

15. The device according to claim 14, wherein the processing unit is specifically configured to: set the current speaker's video to be displayed in full screen;

Output the video of the current speaker in full screen.

16. The device according to claim 15, wherein the processing unit is specifically configured to: before successfully acquiring the video of the current speaker, output the video of the previous speaker of the current speaker in full screen ;

17. The device according to claim 14, characterized in that the processing unit is specifically configured to: set the video of the current speaker and the video of the previous speaker of the current speaker in picture-in-picture mode. display in form;

Wherein, the picture-in-picture includes a first picture and a second picture included in the first picture that is smaller than the first picture, the current speaker is displayed in the first picture, and the current speaker is displayed in the first picture. Display the previous speaker of the current speaker in the second screen;

Videos of the current speaker and the previous speaker of the current speaker are simultaneously output in a picture-in-picture format.

18. The device according to claim 15, wherein the control unit is further configured to: when the current speaker changes from the second speaker to a third speaker, control the first camera device Shoot a video of a third speaker, where the third speaker is the next speaker in a different position from the second speaker;

The processing unit is specifically used for:

19. The device according to claim 14, wherein the processing unit is specifically configured to: set the video of the current speaker and the video of the previous speaker of the current speaker in the form of a dual picture. show;

Wherein, the double picture includes two parts of the picture that do not include each other, and one part of the picture displays the current speaker, another part of the screen displays the previous speaker of the current speaker;

The video of the current speaker and the previous speaker of the current speaker is simultaneously output in the form of a dual picture.

20. The device according to claim 19, wherein the control unit is further configured to: when the current speaker changes from the second speaker to a third speaker, control the first camera device to take pictures A video of a third speaker, wherein the third speaker is the next speaker whose position is different from that of the second speaker;

The processing unit is specifically used for:

21. The device according to claim 14, characterized in that, before controlling the first camera device to capture the video of the first speaker, the control unit is also used to:

In the initial state, the first camera device and the second camera device are controlled to capture video of the entire conference venue;

The processing unit is also used to output the video of the entire venue captured by the control unit.

22. The device according to any one of claims 14-21, characterized in that the control unit is also used for:

The control unit is specifically used to: when the first speaker speaks, control the first camera device with the first tracking mark to capture the video of the first speaker, and after successfully acquiring the video of the first speaker, The tracking mark of the first camera device is set from the first tracking mark to the second tracking mark. and simultaneously setting the tracking flag of the second camera device from the second tracking flag to the first tracking flag;

The control unit is specifically configured to: when the current speaker changes from the first speaker to the second speaker, control the second camera device with the first tracking mark to capture the video of the second speaker, After successfully acquiring the video of the second speaker, the tracking flag of the second camera device is set from the first tracking flag to the second tracking flag, and at the same time, the tracking flag of the first camera device is changed from The second tracking flag is set to the first tracking flag.

23. The device according to claim 22, wherein the control unit is specifically configured to: control the camera device with the first tracking mark to capture the video of the current speaker each time a speaker changes subsequently. , after successfully acquiring the video of the current speaker, the tracking flags of the first camera device and the second camera device are exchanged.

24. The device according to claim 23, characterized in that the control unit is specifically configured to: use sound source localization technology to control the camera device to capture the video of the speaker.

25. The device according to claim 24, wherein the control unit is specifically configured to: use sound source positioning technology combined with preset position or image recognition technology to control the camera device to capture the video of the speaker.

26. The device according to any one of claims 14 to 25, characterized in that the control unit is specifically used for: