CN113038271A

CN113038271A - Video automatic editing method, device and computer storage medium

Info

Publication number: CN113038271A
Application number: CN202110321530.6A
Authority: CN
Inventors: 黄锐; 胡攀文
Original assignee: Shenzhen Institute of Artificial Intelligence and Robotics
Current assignee: Shenzhen Institute of Artificial Intelligence and Robotics
Priority date: 2021-03-25
Filing date: 2021-03-25
Publication date: 2021-06-25
Anticipated expiration: 2041-03-25
Also published as: CN113038271B

Abstract

The embodiment of the application discloses a method and a device for automatically clipping a video and a computer storage medium, so that the clipped and generated video can maximally present the information of a target person and avoid presenting the information of other unrelated persons. The embodiment of the application comprises the following steps: and applying the calculated quantized value of the attitude information and the light stream energy change value to the calculation of the return value of the action in a reinforcement learning algorithm, determining a candidate video frame corresponding to the maximum return value as a next video frame of the current video frame, taking the next video frame of the current video frame as a new current video frame, returning to the step of calculating the return value of the action under the current video frame, simultaneously determining a video picture window based on the position and the size of the target person in the video frame, and extracting a video picture related to the target person according to the video picture window, so that the finally synthesized video maximally presents the information related to the target person and avoids presenting the information of other unrelated persons.

Description

Video automatic editing method, device and computer storage medium

Technical Field

The embodiment of the application relates to the field of video clipping, in particular to a method and a device for automatically clipping a video and a computer storage medium.

Background

In the prior art, the video automatic clipping can improve the working efficiency of video clipping in the fields of security, education, movie and television entertainment and the like. After the video is clipped, the data volume of the video is greatly reduced, and the storage space occupied by the video is reduced, so that the video automatic clipping can also relieve the storage problem of massive video, and the video can release more storage space after the video is automatically clipped.

The existing automatic video editing system is mainly designed aiming at videos such as dance videos, concert videos, outdoor activity videos and football match videos, and focuses on enabling video contents to be richer and enabling the video contents to be more diversified so as to increase interestingness and improve impression. However, in some scenarios where the target person needs to be highlighted in the video, the existing video automatic clipping system does not handle well because the existing video automatic clipping system focuses on presenting more video content, cannot focus on the target person, and cannot present more information about the target person. Meanwhile, the existing automatic video clipping system presents information of other people irrelevant to the target person in the clipped video, and the privacy of other people in the video can be leaked.

Disclosure of Invention

The embodiment of the application provides a method and a device for automatically editing video and a computer storage medium, so that the video generated by editing can maximally present the information of a target person and avoid presenting the information of other unrelated persons.

A first aspect of an embodiment of the present application provides a method for automatically editing a video, where the method includes:

calculating the face pose information of a target person of each video frame in at least one path of video, calculating a pose information quantization value corresponding to the face pose information, and calculating an optical flow energy change value of each video frame;

taking any video frame in any path of video as a current video frame;

calculating a return value of the action under the current video frame according to a posture information quantization value and an optical flow energy change value of the current video frame based on a reinforcement learning algorithm, determining a candidate video frame corresponding to the maximum return value as a next video frame of the current video frame, taking the next video frame of the current video frame as a new current video frame, returning to execute the reinforcement learning algorithm, and calculating the return value of the action under the current video frame according to the posture information quantization value and the optical flow energy change value of the current video frame; wherein the action is to select a candidate video frame from each video of the at least one video respectively;

determining a video frame sequence according to the sequence of the current video frame, and obtaining an initial synthetic video based on the video frame sequence;

determining the position and the size of a video picture window of each frame in the initial composite video according to the position and the size of each frame of the target person in the initial composite video;

and extracting the video picture of each frame in the initial composite video based on the position and the size of the video picture window to obtain a target composite video.

A second aspect of the embodiments of the present application provides an automatic video editing apparatus, including:

the computing unit is used for computing the face pose information of a target person of each video frame in at least one path of video, computing a pose information quantization value corresponding to the face pose information and computing an optical flow energy change value of each video frame;

the determining unit is used for taking any video frame in any path of video as a current video frame;

a clipping unit, configured to calculate a report value of an action under a current video frame according to a quantized value of pose information and a change value of optical flow energy of the current video frame based on a reinforcement learning algorithm, determine a candidate video frame corresponding to a maximum report value as a next video frame of the current video frame, regard the next video frame of the current video frame as a new current video frame, and return to execute the reinforcement learning algorithm, and calculate a report value of an action under the current video frame according to the quantized value of pose information and the change value of optical flow energy of the current video frame; wherein the action is to select a candidate video frame from each video of the at least one video respectively;

the generating unit is used for determining a video frame sequence according to the sequence of the current video frame and obtaining an initial synthetic video based on the video frame sequence;

the determining unit is further configured to determine a position and a size of a video frame window of each frame in the initial composite video according to the position and the size of each frame in the initial composite video of the target person;

and the extraction unit is used for extracting the video picture of each frame in the initial composite video based on the position and the size of the video picture window to obtain the target composite video.

A third aspect of the embodiments of the present application provides an automatic video editing apparatus, including:

a memory for storing a computer program; a processor for implementing the steps of the video automatic clipping method according to the aforementioned first aspect when executing the computer program.

A fourth aspect of embodiments of the present application provides a computer storage medium having instructions stored therein, which when executed on a computer, cause the computer to perform the method of the first aspect.

According to the technical scheme, the embodiment of the application has the following advantages:

in this embodiment, by calculating a facial pose information quantization value and an optical flow energy change value of a target person in each video frame, applying the calculated pose information quantization value and the calculated optical flow energy change value to the calculation of a return value of an action in a reinforcement learning algorithm, determining a candidate video frame corresponding to a maximum return value as a next video frame of a current video frame, taking the next video frame of the current video frame as a new current video frame, and returning to the step of executing the calculation of the return value of the action under the current video frame, the video frame determined from at least one video each time can maximally present information of the target person and avoid presenting a picture blocked by the target person. Meanwhile, a video picture window is determined based on the position and the size of the target person in the video frame, and a video picture about the target person is extracted according to the video picture window, so that the finally synthesized video maximally presents the information about the target person and avoids presenting the information about other unrelated persons.

Drawings

FIG. 1 is a flowchart illustrating an automatic video editing method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart illustrating an exemplary method for automatically editing a video according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of facial pose information according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of an automatic video editing apparatus according to an embodiment of the present application;

FIG. 5 is a schematic view of another structure of an automatic video editing apparatus according to an embodiment of the present application.

Detailed Description

Referring to fig. 1, an embodiment of an automatic video editing method according to the embodiment of the present application includes:

101. calculating the face pose information of a target person of each video frame in at least one path of video, calculating a pose information quantization value corresponding to the face pose information, and calculating an optical flow energy change value of each video frame;

the method of the embodiment can be applied to an automatic video clipping device, which can be a computer device with data processing capability, such as a terminal and a server.

In an application scenario that a target person in a video needs to be highlighted, the embodiment acquires at least one path of video, where a video picture of each path of video includes the target person, and the task of the embodiment is to automatically clip the at least one path of video, so that the video generated by clipping mainly presents information of the target person and information of an object interacting with the target person, and ensure that information of other unrelated persons is not displayed in the video generated by clipping, thereby protecting the security of privacy information of the other unrelated persons.

After at least one path of video is obtained, face posture information of a target person of each video frame of each path of video is calculated, and a posture information quantization value corresponding to the face posture information is calculated. In addition, the embodiment also proposes a method for determining the situation that the target person in the video picture is occluded, namely, the occlusion situation of the target person in the video picture is determined according to the optical flow energy change value. Therefore, this step also calculates the optical flow energy change value for each video frame, and reflects the blocking situation of the target person by the optical flow energy change value.

102. Taking any video frame in any path of video as a current video frame;

the present embodiment employs a reinforcement learning algorithm to determine each frame in the video generated by the clipping, the video frame being a state in the reinforcement learning algorithm. The user can designate any video frame in any path of video as the current video frame, so that the automatic video clipping device determines the current video frame according to the designation of the user, the current video frame is used as one state in the reinforcement learning algorithm, and the next state is determined according to the state corresponding to the current video frame in the subsequent steps.

103. Calculating a return value of the action under the current video frame according to the attitude information quantization value and the optical flow energy change value of the current video frame based on a reinforcement learning algorithm, determining a candidate video frame corresponding to the maximum return value as a next video frame of the current video frame, taking the next video frame of the current video frame as a new current video frame, and returning to execute a step of calculating the return value of the action under the current video frame according to the attitude information quantization value and the optical flow energy change value of the current video frame based on the reinforcement learning algorithm;

in the automatic clipping process, the present embodiment determines each video frame in the clip-generated video in turn. Specifically, after determining the current video frame, the video automatic clipping device calculates a return value of the action under the current video frame according to the quantized value of the pose information and the optical flow energy change value of the current video frame based on a reinforcement learning algorithm.

The reinforcement learning algorithm of the present embodiment may specifically be a markov decision process. In the reinforcement learning algorithm, the larger the return value of the action is, the more meaningful the action is, the strategy is optimized by the virtual agent in the reinforcement learning algorithm according to the action corresponding to the maximum return value, and then the next action is taken according to the optimized strategy. Therefore, after calculating the reward values of a plurality of actions under the current video frame, the candidate video frame selected by the action with the maximum reward value is determined as the next video frame of the current video frame.

And after the next video frame of the current video frame is determined, taking the next video frame of the current video frame as a new current video frame, and returning to execute the step of calculating the report value of the action under the current video frame according to the attitude information quantization value and the optical flow energy change value of the current video frame based on the reinforcement learning algorithm. The action under the current video frame refers to selecting a candidate video frame from each video of at least one video respectively.

104. Determining a video frame sequence according to the sequence of the current video frame, and obtaining an initial synthetic video based on the video frame sequence;

each video frame can be determined in sequence through step 103, and the determined plurality of video frames have a sequential determination order, so that the sequence of the video frames can be determined according to the sequential determination order of the current video frame in step 103, and an initial composite video is obtained based on the sequence of the video frames.

105. Determining the position and the size of a video picture window of each frame in the initial composite video according to the position and the size of each frame of the target person in the initial composite video;

after the initial composite video is obtained, since the present embodiment aims to emphasize the target person in the video picture, the position and size of the target person in the initial composite video in each frame of picture are further determined, and the position and size of the video picture window of each frame in the initial composite video are determined according to the position and size of the target person in each frame of picture.

106. Extracting a video picture of each frame in the initial synthesized video based on the position and the size of the video picture window to obtain a target synthesized video;

after the position and the size of the video picture window are determined, the video picture of each frame in the initial composite video is extracted based on the position and the size of the video picture window, so that a target composite video is obtained, and the automatic clipping of the video is realized.

The embodiments of the present application will be described in further detail below on the basis of the aforementioned embodiment shown in fig. 1. Referring to fig. 2, another embodiment of the method for automatically editing a video according to the embodiment of the present application includes:

201. calculating the face pose information of a target person of each video frame in at least one path of video, calculating a pose information quantization value corresponding to the face pose information, and calculating an optical flow energy change value of each video frame;

in this embodiment, the face pose information of the target person of each video frame may be calculated according to a face pose estimation algorithm. Specifically, the face pose information calculated by the face pose estimation algorithm may be represented by a rotation matrix, a rotation vector, a quaternion, or an euler angle. Since euler angle readability is better, it may be preferable to use euler angle to represent facial pose information. As shown in fig. 3, the face pose information of the target person, such as the angle of pitch angle (pitch), the angle of yaw angle (yaw), and the angle of rotation angle (roll), can be calculated according to the face pose estimation algorithm.

In order to keep the consistency between the calculated face pose information and the symmetric structure of the face, the present embodiment uses a multivariate gaussian model to calculate a pose information quantization value corresponding to the face pose information. Further, for convenience of calculation, the calculated attitude information quantized value may be normalized, and the normalized attitude information quantized value is used in a subsequent calculation process.

Specifically, the specific way of calculating the optical flow energy change value in this embodiment is to calculate the optical flow information of each video frame in at least one video channel and calculate the optical flow information of other video frames belonging to the same video channel as the video frame, calculate the optical flow energy of each video frame according to the optical flow information of each video frame, calculate the optical flow energy of the other video frames according to the optical flow information of the other video frames, calculate the optical flow energy difference value between each video frame and the other video frames and the interval time between each video frame and the other video frames, and take the quotient of the optical flow energy difference value and the interval time as the optical flow energy change value of each video frame.

For example, assuming that the video automatic clipping device acquires C-way videos (C ≧ 1), each of which includes T-frame video frames, a certain video frame in the C-way video can be represented as f_c，t(C1, …, C; T1, …, T) with f_c，tBelonging to the same path of video and being f_c，t+1The adjacent video frames may be denoted as f_c，t+1. Respectively calculate f_c，tOptical flow information and f_c，t+1Optical flow information of according to f_c，tOptical flow information calculation of f_c，tEnergy of luminous flux of (a) according to f_c，t+1Optical flow information calculation of f_c，t+1Calculating the optical flow energy of f_c，tLuminous flux energy of and f_c，t+1The optical flow energy of the optical flow and f_c，tAnd f_c，t+1The quotient of the calculated optical flow energy difference and the interval time is taken as f_c，tThe value of the optical flow energy change.

In the multivariate Gaussian model, when the angle of the Euler angle of the face tends to 0, the quantized value of the posture information corresponding to the face posture information is maximum; when the human face deflects to generate an Euler angle, namely the Euler angle is not equal to 0, the quantized value of the posture information corresponding to the face posture information is reduced. The magnitude of the change in the attitude information quantization value caused by the angle of the euler angle can be controlled by the variance matrix. Therefore, the variance of the euler angles can be set to adjust the degree of influence of the euler angles on the quantized values of the attitude information.

202. Taking any video frame in any path of video as a current video frame;

the operation performed in this step is similar to the operation performed in step 102 in the embodiment shown in fig. 1, and is not repeated here.

203. Calculating a return value of the action under the current video frame according to the attitude information quantization value and the optical flow energy change value of the current video frame based on a reinforcement learning algorithm, determining a candidate video frame corresponding to the maximum return value as a next video frame of the current video frame, taking the next video frame of the current video frame as a new current video frame, and returning to execute a step of calculating the return value of the action under the current video frame according to the attitude information quantization value and the optical flow energy change value of the current video frame based on the reinforcement learning algorithm;

in this embodiment, when calculating the return value of the action, a specific calculation method is to determine a transition probability from the current video frame to the candidate video frame under the action, calculate an initial return value of the action under the current video frame according to the quantized value of the attitude information of the current video frame and the optical flow energy change value, and take the product of the initial return value and the transition probability as the return value of the action.

Specifically, the specific manner of determining the transition probability is to determine the transition probability to be 1 when a preset condition is met; when any one of the preset conditions is not satisfied, it is determined that the transition probability is 0. Wherein the preset conditions include: the current video frame and the candidate video frame are adjacent on the time line, the target character exists in the picture of the candidate video frame, and the video index corresponding to the action is consistent with the index of the candidate video frame in at least one path of video.

The current video frame and the candidate video frame are adjacent on the time line, which means that the current video frame and the candidate video frame are adjacent on the time line of the video. For example, if the current video frame is the T-th frame of the first video, the candidate video frame may be the T + 1-th frame of the current video or other videos (e.g., the second video, the third video, etc.).

204. Determining a video frame sequence according to the sequence of the current video frame, and obtaining an initial synthetic video based on the video frame sequence;

the operation performed in this step is similar to the operation performed in step 104 in the embodiment shown in fig. 1, and is not described here again.

205. Determining the position and the size of a video picture window of each frame in the initial composite video according to the position and the size of each frame of the target person in the initial composite video;

in this embodiment, the position and size of the target person in the video frame may be determined according to information of the target person and an object interacting with the target person, where the information of the object interacting with the target person may be information of an object focused by the line of sight direction of the target person. For example, if the target person's gaze direction focuses on a chair, the determined position and size of the target person in the video frame should include the chair on which the target person's gaze direction focuses in addition to the target person.

Specifically, the gaze direction of the target person in the video frame may be determined in such a way that the gaze direction of the target person in each frame of the initial composite video may be determined according to the facial pose information of the target person in each frame of the initial composite video, for example, the face-up motion may be determined as the upward looking motion, and the face-down motion may be determined as the downward looking motion.

For example, looking to the left and looking to the right depend mainly on the yaw angle (yaw) in the face pose information, and therefore, the line of sight direction of the target person can be determined according to the yaw angle. To facilitate the subsequent calculation process, the gaze direction may be quantified, for example, the gaze direction may be represented numerically based on the following formula:

wherein g denotes the direction of the line of sight and phi denotes the yaw angle. Therefore, the value of the sight line direction corresponding to a certain yaw angle can be determined according to the formula.

After the value of the sight line direction is determined, the position and the size of each frame of the target person in the initial composite video are further determined according to the sight line direction of each frame of the target person in the initial composite video, and the position and the size of the video picture window of each frame in the initial composite video are determined according to the position and the size of each frame of the target person in the initial composite video.

Specifically, the position and size of each frame of the target person in the initial composite video can be expressed as

Wherein,

reference to the coordinates of the target person in each frame of the initial composite video, i.e. according to

And

the position of the target person in the video frame can be determined;

refers to the width of the target person in each frame of the initial composite video,

refer to the height of the target person in each frame of the initial composite video, i.e. according to

And

the size of the target person in the video frame may be determined.

Meanwhile, the position and size of the video picture window of each frame in the initial composite video may be expressed as

Wherein,

reference to coordinates of a video picture window in a video frame, i.e. according to

The specific position of the video frame of the video picture window in the initial composite video can be determined;

refers to the width of a window of a video picture,

indicating height of a window of a video picture, i.e. according to

And

the size of the video picture window may be determined.

In this embodiment, the solution is based on a function

Specifically, the position and size of the video frame window of each frame can be calculated according to the following objective function:

wherein, c_tRefers to the initial composite video, t refers to c_tAny one of the video frames of g_tRefers to the aforementioned values of the line of sight direction.

Since the objective function is a convex function, the objective function can be solved by using a convex optimization algorithm, and the initial composite video c can be obtained_tThe optimal position and size of the video frame window of each frame in the video frame, i.e. obtaining

The optimal solution of (1).

Therefore, as can be seen from the above expression of the objective function, when determining the position and size of the video screen window, the object interacted with by the target person is also included in the video screen window in consideration of the information of the object interacted with by the target person in the direction of the line of sight of the target person.

206. Extracting a video picture of each frame in the initial synthesized video based on the position and the size of the video picture window to obtain a target synthesized video;

after the position and size of the video picture window are determined, a video picture of each frame in the initial composite video may be extracted based on the position and size of the video picture window, the extracted multiple frames of video pictures constituting the target composite video. According to the above description, since the video frame window determines the position and the size based on the target person and the object interacting with the target person, the video frame extracted from the video frame window includes the information of the target person and the information of the object interacting with the target person, and the information of other unrelated persons is prevented from being presented in the video frame, so that the information of the target person can be presented maximally on one hand, the information of other unrelated persons is also prevented from being presented on the other hand, and the problem of privacy disclosure is avoided.

In the embodiment, the information of the target person is highlighted on the automatic video editing, and the privacy disclosure problem is avoided, so that the technical scheme has a practical application value, and the realizability of the scheme is improved.

In the above description of the video automatic clipping method in the embodiment of the present application, referring to fig. 4, the following description of the video automatic clipping device in the embodiment of the present application, and an embodiment of the video automatic clipping device in the embodiment of the present application includes:

a calculating unit 401, configured to calculate facial pose information of a target person of each video frame in at least one path of video, calculate a pose information quantization value corresponding to the facial pose information, and calculate an optical flow energy change value of each video frame;

a determining unit 402, configured to use any video frame in any channel of video as a current video frame;

a clipping unit 403, configured to calculate a report value of an action under a current video frame according to a quantized value of pose information and a change value of optical flow energy of the current video frame based on a reinforcement learning algorithm, determine a candidate video frame corresponding to a maximum report value as a next video frame of the current video frame, regard the next video frame of the current video frame as a new current video frame, and return to execute the reinforcement learning algorithm, and calculate a report value of an action under the current video frame according to the quantized value of pose information and the change value of optical flow energy of the current video frame; wherein the action is to select a candidate video frame from each video of the at least one video respectively;

a generating unit 404, configured to determine a video frame sequence according to a sequence of a current video frame, and obtain an initial synthesized video based on the video frame sequence;

the determining unit 402 is further configured to determine a position and a size of a video frame window of each frame in the initial composite video according to the position and the size of each frame in the initial composite video of the target person;

an extracting unit 405, configured to extract a video picture of each frame in the initial composite video based on the position and the size of the video picture window, so as to obtain a target composite video.

In a preferred embodiment of this embodiment, the calculating unit 401 is specifically configured to calculate the face pose information of the target person of each video frame according to a face pose estimation algorithm, where the face pose information includes an angle of pitch angle, an angle of yaw angle, and an angle of rotation angle.

In a preferred embodiment of this embodiment, the calculating unit 401 is specifically configured to calculate the quantized value of the pose information corresponding to the facial pose information by using a multivariate gaussian model.

In a preferred embodiment of this embodiment, the calculating unit 401 is specifically configured to calculate optical flow information of each video frame and optical flow information of other video frames belonging to the same video path as the each video frame; calculating optical flow energy of each video frame according to the optical flow information of each video frame, calculating optical flow energy of other video frames according to the optical flow information of other video frames, and calculating an optical flow energy difference value between each video frame and the other video frames and an interval time between each video frame and the other video frames; and taking the quotient of the optical flow energy difference value and the interval time as the optical flow energy change value of each video frame.

In a preferred implementation manner of this embodiment, the clipping unit 403 is specifically configured to determine a transition probability from the current video frame to the candidate video frame under the action; calculating an initial return value of the action under the current video frame according to the attitude information quantization value and the optical flow energy change value of the current video frame; taking a product of the initial reward value and the transition probability as the reward value.

In a preferred implementation manner of this embodiment, the clipping unit 403 is specifically configured to determine that the transition probability is 1 when a preset condition is met; when any one of the preset conditions is not satisfied, determining that the transition probability is 0;

wherein the preset conditions include: the current video frame and the candidate video frame are adjacent on a time line, the target character exists in a picture of the candidate video frame, and a video index corresponding to the action is consistent with an index of the candidate video frame in the at least one path of video.

In a preferred embodiment of this embodiment, the determining unit 402 is specifically configured to determine, according to the facial pose information of each frame of the initial composite video of the target person, a gaze direction of each frame of the initial composite video of the target person; determining the position and the size of each frame of the target person in the initial composite video according to the sight line direction of each frame of the target person in the initial composite video; and determining the position and the size of a video picture window of each frame in the initial composite video according to the position and the size of each frame of the target person in the initial composite video.

In this embodiment, operations performed by each unit in the automatic video editing apparatus are similar to those described in the embodiments shown in fig. 1 to fig. 2, and are not described again here.

In this embodiment, the calculating unit 401 calculates a facial pose information quantization value and an optical flow energy change value of a target person for each video frame, the clipping unit 403 applies the calculated pose information quantization value and optical flow energy change value to the calculation of a reward value for an action in the reinforcement learning algorithm, determines a candidate video frame corresponding to the maximum reward value as a next video frame of a current video frame, uses the next video frame of the current video frame as a new current video frame, and returns to the step of performing the calculation of the reward value for the action under the current video frame, so that each video frame determined from at least one video can maximally present information of the target person and avoid presenting a frame that the target person is blocked. Meanwhile, the determination unit 402 determines a video picture window based on the position and size of the target person in the video frame, and the extraction unit 405 extracts a video picture on the target person from the video picture window so that the finally synthesized video maximally presents information on the target person and avoids presenting information on other unrelated persons.

Referring to fig. 5, the automatic video editing apparatus in the embodiment of the present application is described below, where an embodiment of the automatic video editing apparatus in the embodiment of the present application includes:

the video automatic clipping device 500 may include one or more Central Processing Units (CPUs) 501 and a memory 505, where one or more applications or data are stored in the memory 505.

Memory 505 may be volatile storage or persistent storage, among others. The program stored in memory 505 may include one or more modules, each of which may include a sequence of instruction operations for a video automatic clipping device. Further, the central processor 501 may be arranged to communicate with the memory 505, and perform a series of instruction operations in the memory 505 on the video automatic clipping device 500.

The video auto-clipping device 500 may also include one or more power supplies 502, one or more wired or wireless network interfaces 503, one or more input-output interfaces 504, and/or one or more operating systems, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

The central processor 501 may perform the operations performed by the automatic video editing apparatus in the embodiments shown in fig. 1 to fig. 2, and details thereof are not repeated herein.

An embodiment of the present application further provides a computer storage medium, where one embodiment includes: the computer storage medium has stored therein instructions that, when executed on a computer, cause the computer to perform the operations performed by the video automatic clipping apparatus in the embodiments of fig. 1 to 2.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and the like.

Claims

1. A method for automatic video editing, the method comprising:

taking any video frame in any path of video as a current video frame;

2. The method of claim 1, wherein the calculating the facial pose information of the target person for each video frame in the at least one video comprises:

and calculating the face posture information of the target person of each video frame according to a face posture estimation algorithm, wherein the face posture information comprises a pitch angle, a yaw angle and a rotation angle.

3. The method of claim 1, wherein the calculating a pose information quantization value corresponding to the facial pose information comprises:

and calculating the quantized value of the posture information corresponding to the face posture information by using a multivariate Gaussian model.

4. The method according to claim 1, wherein said calculating an optical flow energy change value for each video frame comprises:

calculating optical flow information of each video frame and optical flow information of other video frames belonging to the same path of video with each video frame;

calculating optical flow energy of each video frame according to the optical flow information of each video frame, calculating optical flow energy of other video frames according to the optical flow information of other video frames, and calculating an optical flow energy difference value between each video frame and the other video frames and an interval time between each video frame and the other video frames;

and taking the quotient of the optical flow energy difference value and the interval time as the optical flow energy change value of each video frame.

5. The method of claim 1, wherein the calculating the reward value of the action under the current video frame according to the quantized value of the pose information and the optical flow energy change value of the current video frame comprises:

determining a transition probability of the current video frame to the candidate video frame under the action;

calculating an initial return value of the action under the current video frame according to the attitude information quantization value and the optical flow energy change value of the current video frame;

taking a product of the initial reward value and the transition probability as the reward value.

6. The method of claim 5, wherein the determining the transition probability of the current video frame to the candidate video frame under the action comprises:

when a preset condition is met, determining that the transition probability is 1; when any one of the preset conditions is not satisfied, determining that the transition probability is 0;

7. The method of claim 1, wherein determining the position and size of the video frame window for each frame of the initial composite video based on the position and size of the target person in each frame of the initial composite video comprises:

determining the sight line direction of the target person in each frame of the initial composite video according to the face pose information of the target person in each frame of the initial composite video;

determining the position and the size of each frame of the target person in the initial composite video according to the sight line direction of each frame of the target person in the initial composite video;

and determining the position and the size of a video picture window of each frame in the initial composite video according to the position and the size of each frame of the target person in the initial composite video.

8. An apparatus for automatic video editing, the apparatus comprising:

9. An apparatus for automatic video editing, the apparatus comprising:

a memory for storing a computer program; a processor for implementing the steps of the video automatic clipping method according to any of claims 1 to 7 when executing the computer program.

10. A computer storage medium having stored therein instructions that, when executed on a computer, cause the computer to perform the method of any one of claims 1 to 7.