CN113038271A - Video automatic editing method, device and computer storage medium - Google Patents
Video automatic editing method, device and computer storage medium Download PDFInfo
- Publication number
- CN113038271A CN113038271A CN202110321530.6A CN202110321530A CN113038271A CN 113038271 A CN113038271 A CN 113038271A CN 202110321530 A CN202110321530 A CN 202110321530A CN 113038271 A CN113038271 A CN 113038271A
- Authority
- CN
- China
- Prior art keywords
- video
- video frame
- frame
- value
- target person
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 230000009471 action Effects 0.000 claims abstract description 49
- 230000008859 change Effects 0.000 claims abstract description 44
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 32
- 230000002787 reinforcement Effects 0.000 claims abstract description 26
- 230000003287 optical effect Effects 0.000 claims description 71
- 239000002131 composite material Substances 0.000 claims description 60
- 238000013139 quantization Methods 0.000 claims description 28
- 230000007704 transition Effects 0.000 claims description 14
- 230000001815 facial effect Effects 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 4
- 238000000605 extraction Methods 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 abstract description 13
- 230000008569 process Effects 0.000 description 6
- 230000006870 function Effects 0.000 description 5
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000004907 flux Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 230000000903 blocking effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/44016—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/441—Acquiring end-user identification, e.g. using personal code sent by the remote control or by inserting a card
- H04N21/4415—Acquiring end-user identification, e.g. using personal code sent by the remote control or by inserting a card using biometric characteristics of the user, e.g. by voice recognition or fingerprint scanning
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/45—Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
- H04N21/466—Learning process for intelligent management, e.g. learning user preferences for recommending movies
- H04N21/4662—Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/47—End-user applications
- H04N21/485—End-user interface for client configuration
- H04N21/4858—End-user interface for client configuration for modifying screen layout parameters, e.g. fonts, size of the windows
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Databases & Information Systems (AREA)
- Human Computer Interaction (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Television Signal Processing For Recording (AREA)
Abstract
The embodiment of the application discloses a method and a device for automatically clipping a video and a computer storage medium, so that the clipped and generated video can maximally present the information of a target person and avoid presenting the information of other unrelated persons. The embodiment of the application comprises the following steps: and applying the calculated quantized value of the attitude information and the light stream energy change value to the calculation of the return value of the action in a reinforcement learning algorithm, determining a candidate video frame corresponding to the maximum return value as a next video frame of the current video frame, taking the next video frame of the current video frame as a new current video frame, returning to the step of calculating the return value of the action under the current video frame, simultaneously determining a video picture window based on the position and the size of the target person in the video frame, and extracting a video picture related to the target person according to the video picture window, so that the finally synthesized video maximally presents the information related to the target person and avoids presenting the information of other unrelated persons.
Description
Technical Field
The embodiment of the application relates to the field of video clipping, in particular to a method and a device for automatically clipping a video and a computer storage medium.
Background
In the prior art, the video automatic clipping can improve the working efficiency of video clipping in the fields of security, education, movie and television entertainment and the like. After the video is clipped, the data volume of the video is greatly reduced, and the storage space occupied by the video is reduced, so that the video automatic clipping can also relieve the storage problem of massive video, and the video can release more storage space after the video is automatically clipped.
The existing automatic video editing system is mainly designed aiming at videos such as dance videos, concert videos, outdoor activity videos and football match videos, and focuses on enabling video contents to be richer and enabling the video contents to be more diversified so as to increase interestingness and improve impression. However, in some scenarios where the target person needs to be highlighted in the video, the existing video automatic clipping system does not handle well because the existing video automatic clipping system focuses on presenting more video content, cannot focus on the target person, and cannot present more information about the target person. Meanwhile, the existing automatic video clipping system presents information of other people irrelevant to the target person in the clipped video, and the privacy of other people in the video can be leaked.
Disclosure of Invention
The embodiment of the application provides a method and a device for automatically editing video and a computer storage medium, so that the video generated by editing can maximally present the information of a target person and avoid presenting the information of other unrelated persons.
A first aspect of an embodiment of the present application provides a method for automatically editing a video, where the method includes:
calculating the face pose information of a target person of each video frame in at least one path of video, calculating a pose information quantization value corresponding to the face pose information, and calculating an optical flow energy change value of each video frame;
taking any video frame in any path of video as a current video frame;
calculating a return value of the action under the current video frame according to a posture information quantization value and an optical flow energy change value of the current video frame based on a reinforcement learning algorithm, determining a candidate video frame corresponding to the maximum return value as a next video frame of the current video frame, taking the next video frame of the current video frame as a new current video frame, returning to execute the reinforcement learning algorithm, and calculating the return value of the action under the current video frame according to the posture information quantization value and the optical flow energy change value of the current video frame; wherein the action is to select a candidate video frame from each video of the at least one video respectively;
determining a video frame sequence according to the sequence of the current video frame, and obtaining an initial synthetic video based on the video frame sequence;
determining the position and the size of a video picture window of each frame in the initial composite video according to the position and the size of each frame of the target person in the initial composite video;
and extracting the video picture of each frame in the initial composite video based on the position and the size of the video picture window to obtain a target composite video.
A second aspect of the embodiments of the present application provides an automatic video editing apparatus, including:
the computing unit is used for computing the face pose information of a target person of each video frame in at least one path of video, computing a pose information quantization value corresponding to the face pose information and computing an optical flow energy change value of each video frame;
the determining unit is used for taking any video frame in any path of video as a current video frame;
a clipping unit, configured to calculate a report value of an action under a current video frame according to a quantized value of pose information and a change value of optical flow energy of the current video frame based on a reinforcement learning algorithm, determine a candidate video frame corresponding to a maximum report value as a next video frame of the current video frame, regard the next video frame of the current video frame as a new current video frame, and return to execute the reinforcement learning algorithm, and calculate a report value of an action under the current video frame according to the quantized value of pose information and the change value of optical flow energy of the current video frame; wherein the action is to select a candidate video frame from each video of the at least one video respectively;
the generating unit is used for determining a video frame sequence according to the sequence of the current video frame and obtaining an initial synthetic video based on the video frame sequence;
the determining unit is further configured to determine a position and a size of a video frame window of each frame in the initial composite video according to the position and the size of each frame in the initial composite video of the target person;
and the extraction unit is used for extracting the video picture of each frame in the initial composite video based on the position and the size of the video picture window to obtain the target composite video.
A third aspect of the embodiments of the present application provides an automatic video editing apparatus, including:
a memory for storing a computer program; a processor for implementing the steps of the video automatic clipping method according to the aforementioned first aspect when executing the computer program.
A fourth aspect of embodiments of the present application provides a computer storage medium having instructions stored therein, which when executed on a computer, cause the computer to perform the method of the first aspect.
According to the technical scheme, the embodiment of the application has the following advantages:
in this embodiment, by calculating a facial pose information quantization value and an optical flow energy change value of a target person in each video frame, applying the calculated pose information quantization value and the calculated optical flow energy change value to the calculation of a return value of an action in a reinforcement learning algorithm, determining a candidate video frame corresponding to a maximum return value as a next video frame of a current video frame, taking the next video frame of the current video frame as a new current video frame, and returning to the step of executing the calculation of the return value of the action under the current video frame, the video frame determined from at least one video each time can maximally present information of the target person and avoid presenting a picture blocked by the target person. Meanwhile, a video picture window is determined based on the position and the size of the target person in the video frame, and a video picture about the target person is extracted according to the video picture window, so that the finally synthesized video maximally presents the information about the target person and avoids presenting the information about other unrelated persons.
Drawings
FIG. 1 is a flowchart illustrating an automatic video editing method according to an embodiment of the present application;
FIG. 2 is a schematic flow chart illustrating an exemplary method for automatically editing a video according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of facial pose information according to an embodiment of the present application;
FIG. 4 is a schematic structural diagram of an automatic video editing apparatus according to an embodiment of the present application;
FIG. 5 is a schematic view of another structure of an automatic video editing apparatus according to an embodiment of the present application.
Detailed Description
The embodiment of the application provides a method and a device for automatically editing video and a computer storage medium, so that the video generated by editing can maximally present the information of a target person and avoid presenting the information of other unrelated persons.
Referring to fig. 1, an embodiment of an automatic video editing method according to the embodiment of the present application includes:
101. calculating the face pose information of a target person of each video frame in at least one path of video, calculating a pose information quantization value corresponding to the face pose information, and calculating an optical flow energy change value of each video frame;
the method of the embodiment can be applied to an automatic video clipping device, which can be a computer device with data processing capability, such as a terminal and a server.
In an application scenario that a target person in a video needs to be highlighted, the embodiment acquires at least one path of video, where a video picture of each path of video includes the target person, and the task of the embodiment is to automatically clip the at least one path of video, so that the video generated by clipping mainly presents information of the target person and information of an object interacting with the target person, and ensure that information of other unrelated persons is not displayed in the video generated by clipping, thereby protecting the security of privacy information of the other unrelated persons.
After at least one path of video is obtained, face posture information of a target person of each video frame of each path of video is calculated, and a posture information quantization value corresponding to the face posture information is calculated. In addition, the embodiment also proposes a method for determining the situation that the target person in the video picture is occluded, namely, the occlusion situation of the target person in the video picture is determined according to the optical flow energy change value. Therefore, this step also calculates the optical flow energy change value for each video frame, and reflects the blocking situation of the target person by the optical flow energy change value.
102. Taking any video frame in any path of video as a current video frame;
the present embodiment employs a reinforcement learning algorithm to determine each frame in the video generated by the clipping, the video frame being a state in the reinforcement learning algorithm. The user can designate any video frame in any path of video as the current video frame, so that the automatic video clipping device determines the current video frame according to the designation of the user, the current video frame is used as one state in the reinforcement learning algorithm, and the next state is determined according to the state corresponding to the current video frame in the subsequent steps.
103. Calculating a return value of the action under the current video frame according to the attitude information quantization value and the optical flow energy change value of the current video frame based on a reinforcement learning algorithm, determining a candidate video frame corresponding to the maximum return value as a next video frame of the current video frame, taking the next video frame of the current video frame as a new current video frame, and returning to execute a step of calculating the return value of the action under the current video frame according to the attitude information quantization value and the optical flow energy change value of the current video frame based on the reinforcement learning algorithm;
in the automatic clipping process, the present embodiment determines each video frame in the clip-generated video in turn. Specifically, after determining the current video frame, the video automatic clipping device calculates a return value of the action under the current video frame according to the quantized value of the pose information and the optical flow energy change value of the current video frame based on a reinforcement learning algorithm.
The reinforcement learning algorithm of the present embodiment may specifically be a markov decision process. In the reinforcement learning algorithm, the larger the return value of the action is, the more meaningful the action is, the strategy is optimized by the virtual agent in the reinforcement learning algorithm according to the action corresponding to the maximum return value, and then the next action is taken according to the optimized strategy. Therefore, after calculating the reward values of a plurality of actions under the current video frame, the candidate video frame selected by the action with the maximum reward value is determined as the next video frame of the current video frame.
And after the next video frame of the current video frame is determined, taking the next video frame of the current video frame as a new current video frame, and returning to execute the step of calculating the report value of the action under the current video frame according to the attitude information quantization value and the optical flow energy change value of the current video frame based on the reinforcement learning algorithm. The action under the current video frame refers to selecting a candidate video frame from each video of at least one video respectively.
104. Determining a video frame sequence according to the sequence of the current video frame, and obtaining an initial synthetic video based on the video frame sequence;
each video frame can be determined in sequence through step 103, and the determined plurality of video frames have a sequential determination order, so that the sequence of the video frames can be determined according to the sequential determination order of the current video frame in step 103, and an initial composite video is obtained based on the sequence of the video frames.
105. Determining the position and the size of a video picture window of each frame in the initial composite video according to the position and the size of each frame of the target person in the initial composite video;
after the initial composite video is obtained, since the present embodiment aims to emphasize the target person in the video picture, the position and size of the target person in the initial composite video in each frame of picture are further determined, and the position and size of the video picture window of each frame in the initial composite video are determined according to the position and size of the target person in each frame of picture.
106. Extracting a video picture of each frame in the initial synthesized video based on the position and the size of the video picture window to obtain a target synthesized video;
after the position and the size of the video picture window are determined, the video picture of each frame in the initial composite video is extracted based on the position and the size of the video picture window, so that a target composite video is obtained, and the automatic clipping of the video is realized.
In this embodiment, by calculating a facial pose information quantization value and an optical flow energy change value of a target person in each video frame, applying the calculated pose information quantization value and the calculated optical flow energy change value to the calculation of a return value of an action in a reinforcement learning algorithm, determining a candidate video frame corresponding to a maximum return value as a next video frame of a current video frame, taking the next video frame of the current video frame as a new current video frame, and returning to the step of executing the calculation of the return value of the action under the current video frame, the video frame determined from at least one video each time can maximally present information of the target person and avoid presenting a picture blocked by the target person. Meanwhile, a video picture window is determined based on the position and the size of the target person in the video frame, and a video picture about the target person is extracted according to the video picture window, so that the finally synthesized video maximally presents the information about the target person and avoids presenting the information about other unrelated persons.
The embodiments of the present application will be described in further detail below on the basis of the aforementioned embodiment shown in fig. 1. Referring to fig. 2, another embodiment of the method for automatically editing a video according to the embodiment of the present application includes:
201. calculating the face pose information of a target person of each video frame in at least one path of video, calculating a pose information quantization value corresponding to the face pose information, and calculating an optical flow energy change value of each video frame;
in this embodiment, the face pose information of the target person of each video frame may be calculated according to a face pose estimation algorithm. Specifically, the face pose information calculated by the face pose estimation algorithm may be represented by a rotation matrix, a rotation vector, a quaternion, or an euler angle. Since euler angle readability is better, it may be preferable to use euler angle to represent facial pose information. As shown in fig. 3, the face pose information of the target person, such as the angle of pitch angle (pitch), the angle of yaw angle (yaw), and the angle of rotation angle (roll), can be calculated according to the face pose estimation algorithm.
In order to keep the consistency between the calculated face pose information and the symmetric structure of the face, the present embodiment uses a multivariate gaussian model to calculate a pose information quantization value corresponding to the face pose information. Further, for convenience of calculation, the calculated attitude information quantized value may be normalized, and the normalized attitude information quantized value is used in a subsequent calculation process.
Specifically, the specific way of calculating the optical flow energy change value in this embodiment is to calculate the optical flow information of each video frame in at least one video channel and calculate the optical flow information of other video frames belonging to the same video channel as the video frame, calculate the optical flow energy of each video frame according to the optical flow information of each video frame, calculate the optical flow energy of the other video frames according to the optical flow information of the other video frames, calculate the optical flow energy difference value between each video frame and the other video frames and the interval time between each video frame and the other video frames, and take the quotient of the optical flow energy difference value and the interval time as the optical flow energy change value of each video frame.
For example, assuming that the video automatic clipping device acquires C-way videos (C ≧ 1), each of which includes T-frame video frames, a certain video frame in the C-way video can be represented as fc,t(C1, …, C; T1, …, T) with fc,tBelonging to the same path of video and being fc,t+1The adjacent video frames may be denoted as fc,t+1. Respectively calculate fc,tOptical flow information and fc,t+1Optical flow information of according to fc,tOptical flow information calculation of fc,tEnergy of luminous flux of (a) according to fc,t+1Optical flow information calculation of fc,t+1Calculating the optical flow energy of fc,tLuminous flux energy of and fc,t+1The optical flow energy of the optical flow and fc,tAnd fc,t+1The quotient of the calculated optical flow energy difference and the interval time is taken as fc,tThe value of the optical flow energy change.
In the multivariate Gaussian model, when the angle of the Euler angle of the face tends to 0, the quantized value of the posture information corresponding to the face posture information is maximum; when the human face deflects to generate an Euler angle, namely the Euler angle is not equal to 0, the quantized value of the posture information corresponding to the face posture information is reduced. The magnitude of the change in the attitude information quantization value caused by the angle of the euler angle can be controlled by the variance matrix. Therefore, the variance of the euler angles can be set to adjust the degree of influence of the euler angles on the quantized values of the attitude information.
202. Taking any video frame in any path of video as a current video frame;
the operation performed in this step is similar to the operation performed in step 102 in the embodiment shown in fig. 1, and is not repeated here.
203. Calculating a return value of the action under the current video frame according to the attitude information quantization value and the optical flow energy change value of the current video frame based on a reinforcement learning algorithm, determining a candidate video frame corresponding to the maximum return value as a next video frame of the current video frame, taking the next video frame of the current video frame as a new current video frame, and returning to execute a step of calculating the return value of the action under the current video frame according to the attitude information quantization value and the optical flow energy change value of the current video frame based on the reinforcement learning algorithm;
in this embodiment, when calculating the return value of the action, a specific calculation method is to determine a transition probability from the current video frame to the candidate video frame under the action, calculate an initial return value of the action under the current video frame according to the quantized value of the attitude information of the current video frame and the optical flow energy change value, and take the product of the initial return value and the transition probability as the return value of the action.
Specifically, the specific manner of determining the transition probability is to determine the transition probability to be 1 when a preset condition is met; when any one of the preset conditions is not satisfied, it is determined that the transition probability is 0. Wherein the preset conditions include: the current video frame and the candidate video frame are adjacent on the time line, the target character exists in the picture of the candidate video frame, and the video index corresponding to the action is consistent with the index of the candidate video frame in at least one path of video.
The current video frame and the candidate video frame are adjacent on the time line, which means that the current video frame and the candidate video frame are adjacent on the time line of the video. For example, if the current video frame is the T-th frame of the first video, the candidate video frame may be the T + 1-th frame of the current video or other videos (e.g., the second video, the third video, etc.).
204. Determining a video frame sequence according to the sequence of the current video frame, and obtaining an initial synthetic video based on the video frame sequence;
the operation performed in this step is similar to the operation performed in step 104 in the embodiment shown in fig. 1, and is not described here again.
205. Determining the position and the size of a video picture window of each frame in the initial composite video according to the position and the size of each frame of the target person in the initial composite video;
in this embodiment, the position and size of the target person in the video frame may be determined according to information of the target person and an object interacting with the target person, where the information of the object interacting with the target person may be information of an object focused by the line of sight direction of the target person. For example, if the target person's gaze direction focuses on a chair, the determined position and size of the target person in the video frame should include the chair on which the target person's gaze direction focuses in addition to the target person.
Specifically, the gaze direction of the target person in the video frame may be determined in such a way that the gaze direction of the target person in each frame of the initial composite video may be determined according to the facial pose information of the target person in each frame of the initial composite video, for example, the face-up motion may be determined as the upward looking motion, and the face-down motion may be determined as the downward looking motion.
For example, looking to the left and looking to the right depend mainly on the yaw angle (yaw) in the face pose information, and therefore, the line of sight direction of the target person can be determined according to the yaw angle. To facilitate the subsequent calculation process, the gaze direction may be quantified, for example, the gaze direction may be represented numerically based on the following formula:
wherein g denotes the direction of the line of sight and phi denotes the yaw angle. Therefore, the value of the sight line direction corresponding to a certain yaw angle can be determined according to the formula.
After the value of the sight line direction is determined, the position and the size of each frame of the target person in the initial composite video are further determined according to the sight line direction of each frame of the target person in the initial composite video, and the position and the size of the video picture window of each frame in the initial composite video are determined according to the position and the size of each frame of the target person in the initial composite video.
Specifically, the position and size of each frame of the target person in the initial composite video can be expressed asWherein,reference to the coordinates of the target person in each frame of the initial composite video, i.e. according toAndthe position of the target person in the video frame can be determined;refers to the width of the target person in each frame of the initial composite video,refer to the height of the target person in each frame of the initial composite video, i.e. according toAndthe size of the target person in the video frame may be determined.
Meanwhile, the position and size of the video picture window of each frame in the initial composite video may be expressed asWherein,reference to coordinates of a video picture window in a video frame, i.e. according toThe specific position of the video frame of the video picture window in the initial composite video can be determined;refers to the width of a window of a video picture,indicating height of a window of a video picture, i.e. according toAndthe size of the video picture window may be determined.
In this embodiment, the solution is based on a functionSpecifically, the position and size of the video frame window of each frame can be calculated according to the following objective function:
wherein, ctRefers to the initial composite video, t refers to ctAny one of the video frames of gtRefers to the aforementioned values of the line of sight direction.
Since the objective function is a convex function, the objective function can be solved by using a convex optimization algorithm, and the initial composite video c can be obtainedtThe optimal position and size of the video frame window of each frame in the video frame, i.e. obtainingThe optimal solution of (1).
Therefore, as can be seen from the above expression of the objective function, when determining the position and size of the video screen window, the object interacted with by the target person is also included in the video screen window in consideration of the information of the object interacted with by the target person in the direction of the line of sight of the target person.
206. Extracting a video picture of each frame in the initial synthesized video based on the position and the size of the video picture window to obtain a target synthesized video;
after the position and size of the video picture window are determined, a video picture of each frame in the initial composite video may be extracted based on the position and size of the video picture window, the extracted multiple frames of video pictures constituting the target composite video. According to the above description, since the video frame window determines the position and the size based on the target person and the object interacting with the target person, the video frame extracted from the video frame window includes the information of the target person and the information of the object interacting with the target person, and the information of other unrelated persons is prevented from being presented in the video frame, so that the information of the target person can be presented maximally on one hand, the information of other unrelated persons is also prevented from being presented on the other hand, and the problem of privacy disclosure is avoided.
In the embodiment, the information of the target person is highlighted on the automatic video editing, and the privacy disclosure problem is avoided, so that the technical scheme has a practical application value, and the realizability of the scheme is improved.
In the above description of the video automatic clipping method in the embodiment of the present application, referring to fig. 4, the following description of the video automatic clipping device in the embodiment of the present application, and an embodiment of the video automatic clipping device in the embodiment of the present application includes:
a calculating unit 401, configured to calculate facial pose information of a target person of each video frame in at least one path of video, calculate a pose information quantization value corresponding to the facial pose information, and calculate an optical flow energy change value of each video frame;
a determining unit 402, configured to use any video frame in any channel of video as a current video frame;
a clipping unit 403, configured to calculate a report value of an action under a current video frame according to a quantized value of pose information and a change value of optical flow energy of the current video frame based on a reinforcement learning algorithm, determine a candidate video frame corresponding to a maximum report value as a next video frame of the current video frame, regard the next video frame of the current video frame as a new current video frame, and return to execute the reinforcement learning algorithm, and calculate a report value of an action under the current video frame according to the quantized value of pose information and the change value of optical flow energy of the current video frame; wherein the action is to select a candidate video frame from each video of the at least one video respectively;
a generating unit 404, configured to determine a video frame sequence according to a sequence of a current video frame, and obtain an initial synthesized video based on the video frame sequence;
the determining unit 402 is further configured to determine a position and a size of a video frame window of each frame in the initial composite video according to the position and the size of each frame in the initial composite video of the target person;
an extracting unit 405, configured to extract a video picture of each frame in the initial composite video based on the position and the size of the video picture window, so as to obtain a target composite video.
In a preferred embodiment of this embodiment, the calculating unit 401 is specifically configured to calculate the face pose information of the target person of each video frame according to a face pose estimation algorithm, where the face pose information includes an angle of pitch angle, an angle of yaw angle, and an angle of rotation angle.
In a preferred embodiment of this embodiment, the calculating unit 401 is specifically configured to calculate the quantized value of the pose information corresponding to the facial pose information by using a multivariate gaussian model.
In a preferred embodiment of this embodiment, the calculating unit 401 is specifically configured to calculate optical flow information of each video frame and optical flow information of other video frames belonging to the same video path as the each video frame; calculating optical flow energy of each video frame according to the optical flow information of each video frame, calculating optical flow energy of other video frames according to the optical flow information of other video frames, and calculating an optical flow energy difference value between each video frame and the other video frames and an interval time between each video frame and the other video frames; and taking the quotient of the optical flow energy difference value and the interval time as the optical flow energy change value of each video frame.
In a preferred implementation manner of this embodiment, the clipping unit 403 is specifically configured to determine a transition probability from the current video frame to the candidate video frame under the action; calculating an initial return value of the action under the current video frame according to the attitude information quantization value and the optical flow energy change value of the current video frame; taking a product of the initial reward value and the transition probability as the reward value.
In a preferred implementation manner of this embodiment, the clipping unit 403 is specifically configured to determine that the transition probability is 1 when a preset condition is met; when any one of the preset conditions is not satisfied, determining that the transition probability is 0;
wherein the preset conditions include: the current video frame and the candidate video frame are adjacent on a time line, the target character exists in a picture of the candidate video frame, and a video index corresponding to the action is consistent with an index of the candidate video frame in the at least one path of video.
In a preferred embodiment of this embodiment, the determining unit 402 is specifically configured to determine, according to the facial pose information of each frame of the initial composite video of the target person, a gaze direction of each frame of the initial composite video of the target person; determining the position and the size of each frame of the target person in the initial composite video according to the sight line direction of each frame of the target person in the initial composite video; and determining the position and the size of a video picture window of each frame in the initial composite video according to the position and the size of each frame of the target person in the initial composite video.
In this embodiment, operations performed by each unit in the automatic video editing apparatus are similar to those described in the embodiments shown in fig. 1 to fig. 2, and are not described again here.
In this embodiment, the calculating unit 401 calculates a facial pose information quantization value and an optical flow energy change value of a target person for each video frame, the clipping unit 403 applies the calculated pose information quantization value and optical flow energy change value to the calculation of a reward value for an action in the reinforcement learning algorithm, determines a candidate video frame corresponding to the maximum reward value as a next video frame of a current video frame, uses the next video frame of the current video frame as a new current video frame, and returns to the step of performing the calculation of the reward value for the action under the current video frame, so that each video frame determined from at least one video can maximally present information of the target person and avoid presenting a frame that the target person is blocked. Meanwhile, the determination unit 402 determines a video picture window based on the position and size of the target person in the video frame, and the extraction unit 405 extracts a video picture on the target person from the video picture window so that the finally synthesized video maximally presents information on the target person and avoids presenting information on other unrelated persons.
Referring to fig. 5, the automatic video editing apparatus in the embodiment of the present application is described below, where an embodiment of the automatic video editing apparatus in the embodiment of the present application includes:
the video automatic clipping device 500 may include one or more Central Processing Units (CPUs) 501 and a memory 505, where one or more applications or data are stored in the memory 505.
The video auto-clipping device 500 may also include one or more power supplies 502, one or more wired or wireless network interfaces 503, one or more input-output interfaces 504, and/or one or more operating systems, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
The central processor 501 may perform the operations performed by the automatic video editing apparatus in the embodiments shown in fig. 1 to fig. 2, and details thereof are not repeated herein.
An embodiment of the present application further provides a computer storage medium, where one embodiment includes: the computer storage medium has stored therein instructions that, when executed on a computer, cause the computer to perform the operations performed by the video automatic clipping apparatus in the embodiments of fig. 1 to 2.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and the like.
Claims (10)
1. A method for automatic video editing, the method comprising:
calculating the face pose information of a target person of each video frame in at least one path of video, calculating a pose information quantization value corresponding to the face pose information, and calculating an optical flow energy change value of each video frame;
taking any video frame in any path of video as a current video frame;
calculating a return value of the action under the current video frame according to a posture information quantization value and an optical flow energy change value of the current video frame based on a reinforcement learning algorithm, determining a candidate video frame corresponding to the maximum return value as a next video frame of the current video frame, taking the next video frame of the current video frame as a new current video frame, returning to execute the reinforcement learning algorithm, and calculating the return value of the action under the current video frame according to the posture information quantization value and the optical flow energy change value of the current video frame; wherein the action is to select a candidate video frame from each video of the at least one video respectively;
determining a video frame sequence according to the sequence of the current video frame, and obtaining an initial synthetic video based on the video frame sequence;
determining the position and the size of a video picture window of each frame in the initial composite video according to the position and the size of each frame of the target person in the initial composite video;
and extracting the video picture of each frame in the initial composite video based on the position and the size of the video picture window to obtain a target composite video.
2. The method of claim 1, wherein the calculating the facial pose information of the target person for each video frame in the at least one video comprises:
and calculating the face posture information of the target person of each video frame according to a face posture estimation algorithm, wherein the face posture information comprises a pitch angle, a yaw angle and a rotation angle.
3. The method of claim 1, wherein the calculating a pose information quantization value corresponding to the facial pose information comprises:
and calculating the quantized value of the posture information corresponding to the face posture information by using a multivariate Gaussian model.
4. The method according to claim 1, wherein said calculating an optical flow energy change value for each video frame comprises:
calculating optical flow information of each video frame and optical flow information of other video frames belonging to the same path of video with each video frame;
calculating optical flow energy of each video frame according to the optical flow information of each video frame, calculating optical flow energy of other video frames according to the optical flow information of other video frames, and calculating an optical flow energy difference value between each video frame and the other video frames and an interval time between each video frame and the other video frames;
and taking the quotient of the optical flow energy difference value and the interval time as the optical flow energy change value of each video frame.
5. The method of claim 1, wherein the calculating the reward value of the action under the current video frame according to the quantized value of the pose information and the optical flow energy change value of the current video frame comprises:
determining a transition probability of the current video frame to the candidate video frame under the action;
calculating an initial return value of the action under the current video frame according to the attitude information quantization value and the optical flow energy change value of the current video frame;
taking a product of the initial reward value and the transition probability as the reward value.
6. The method of claim 5, wherein the determining the transition probability of the current video frame to the candidate video frame under the action comprises:
when a preset condition is met, determining that the transition probability is 1; when any one of the preset conditions is not satisfied, determining that the transition probability is 0;
wherein the preset conditions include: the current video frame and the candidate video frame are adjacent on a time line, the target character exists in a picture of the candidate video frame, and a video index corresponding to the action is consistent with an index of the candidate video frame in the at least one path of video.
7. The method of claim 1, wherein determining the position and size of the video frame window for each frame of the initial composite video based on the position and size of the target person in each frame of the initial composite video comprises:
determining the sight line direction of the target person in each frame of the initial composite video according to the face pose information of the target person in each frame of the initial composite video;
determining the position and the size of each frame of the target person in the initial composite video according to the sight line direction of each frame of the target person in the initial composite video;
and determining the position and the size of a video picture window of each frame in the initial composite video according to the position and the size of each frame of the target person in the initial composite video.
8. An apparatus for automatic video editing, the apparatus comprising:
the computing unit is used for computing the face pose information of a target person of each video frame in at least one path of video, computing a pose information quantization value corresponding to the face pose information and computing an optical flow energy change value of each video frame;
the determining unit is used for taking any video frame in any path of video as a current video frame;
a clipping unit, configured to calculate a report value of an action under a current video frame according to a quantized value of pose information and a change value of optical flow energy of the current video frame based on a reinforcement learning algorithm, determine a candidate video frame corresponding to a maximum report value as a next video frame of the current video frame, regard the next video frame of the current video frame as a new current video frame, and return to execute the reinforcement learning algorithm, and calculate a report value of an action under the current video frame according to the quantized value of pose information and the change value of optical flow energy of the current video frame; wherein the action is to select a candidate video frame from each video of the at least one video respectively;
the generating unit is used for determining a video frame sequence according to the sequence of the current video frame and obtaining an initial synthetic video based on the video frame sequence;
the determining unit is further configured to determine a position and a size of a video frame window of each frame in the initial composite video according to the position and the size of each frame in the initial composite video of the target person;
and the extraction unit is used for extracting the video picture of each frame in the initial composite video based on the position and the size of the video picture window to obtain the target composite video.
9. An apparatus for automatic video editing, the apparatus comprising:
a memory for storing a computer program; a processor for implementing the steps of the video automatic clipping method according to any of claims 1 to 7 when executing the computer program.
10. A computer storage medium having stored therein instructions that, when executed on a computer, cause the computer to perform the method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110321530.6A CN113038271B (en) | 2021-03-25 | 2021-03-25 | Video automatic editing method, device and computer storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110321530.6A CN113038271B (en) | 2021-03-25 | 2021-03-25 | Video automatic editing method, device and computer storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113038271A true CN113038271A (en) | 2021-06-25 |
CN113038271B CN113038271B (en) | 2023-09-08 |
Family
ID=76473798
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110321530.6A Active CN113038271B (en) | 2021-03-25 | 2021-03-25 | Video automatic editing method, device and computer storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113038271B (en) |
Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110310237A1 (en) * | 2010-06-17 | 2011-12-22 | Institute For Information Industry | Facial Expression Recognition Systems and Methods and Computer Program Products Thereof |
US20120093365A1 (en) * | 2010-10-15 | 2012-04-19 | Dai Nippon Printing Co., Ltd. | Conference system, monitoring system, image processing apparatus, image processing method and a non-transitory computer-readable storage medium |
US20150318020A1 (en) * | 2014-05-02 | 2015-11-05 | FreshTake Media, Inc. | Interactive real-time video editor and recorder |
CN106534967A (en) * | 2016-10-25 | 2017-03-22 | 司马大大(北京)智能系统有限公司 | Video editing method and device |
CN108805080A (en) * | 2018-06-12 | 2018-11-13 | 上海交通大学 | Multi-level depth Recursive Networks group behavior recognition methods based on context |
EP3410353A1 (en) * | 2017-06-01 | 2018-12-05 | eyecandylab Corp. | Method for estimating a timestamp in a video stream and method of augmenting a video stream with information |
CN109618184A (en) * | 2018-12-29 | 2019-04-12 | 北京市商汤科技开发有限公司 | Method for processing video frequency and device, electronic equipment and storage medium |
US20190222776A1 (en) * | 2018-01-18 | 2019-07-18 | GumGum, Inc. | Augmenting detected regions in image or video data |
CN110691202A (en) * | 2019-08-28 | 2020-01-14 | 咪咕文化科技有限公司 | Video editing method, device and computer storage medium |
CN111063011A (en) * | 2019-12-16 | 2020-04-24 | 北京蜜莱坞网络科技有限公司 | Face image processing method, device, equipment and medium |
CN111131884A (en) * | 2020-01-19 | 2020-05-08 | 腾讯科技(深圳)有限公司 | Video clipping method, related device, equipment and storage medium |
CN111294524A (en) * | 2020-02-24 | 2020-06-16 | 中移(杭州)信息技术有限公司 | Video editing method and device, electronic equipment and storage medium |
CN111800644A (en) * | 2020-07-14 | 2020-10-20 | 深圳市人工智能与机器人研究院 | Video sharing and acquiring method, server, terminal equipment and medium |
US20200380274A1 (en) * | 2019-06-03 | 2020-12-03 | Nvidia Corporation | Multi-object tracking using correlation filters in video analytics applications |
CN112203115A (en) * | 2020-10-10 | 2021-01-08 | 腾讯科技(深圳)有限公司 | Video identification method and related device |
-
2021
- 2021-03-25 CN CN202110321530.6A patent/CN113038271B/en active Active
Patent Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110310237A1 (en) * | 2010-06-17 | 2011-12-22 | Institute For Information Industry | Facial Expression Recognition Systems and Methods and Computer Program Products Thereof |
US20120093365A1 (en) * | 2010-10-15 | 2012-04-19 | Dai Nippon Printing Co., Ltd. | Conference system, monitoring system, image processing apparatus, image processing method and a non-transitory computer-readable storage medium |
US20150318020A1 (en) * | 2014-05-02 | 2015-11-05 | FreshTake Media, Inc. | Interactive real-time video editor and recorder |
CN106534967A (en) * | 2016-10-25 | 2017-03-22 | 司马大大(北京)智能系统有限公司 | Video editing method and device |
EP3410353A1 (en) * | 2017-06-01 | 2018-12-05 | eyecandylab Corp. | Method for estimating a timestamp in a video stream and method of augmenting a video stream with information |
US20190222776A1 (en) * | 2018-01-18 | 2019-07-18 | GumGum, Inc. | Augmenting detected regions in image or video data |
CN108805080A (en) * | 2018-06-12 | 2018-11-13 | 上海交通大学 | Multi-level depth Recursive Networks group behavior recognition methods based on context |
CN109618184A (en) * | 2018-12-29 | 2019-04-12 | 北京市商汤科技开发有限公司 | Method for processing video frequency and device, electronic equipment and storage medium |
US20200380274A1 (en) * | 2019-06-03 | 2020-12-03 | Nvidia Corporation | Multi-object tracking using correlation filters in video analytics applications |
CN110691202A (en) * | 2019-08-28 | 2020-01-14 | 咪咕文化科技有限公司 | Video editing method, device and computer storage medium |
CN111063011A (en) * | 2019-12-16 | 2020-04-24 | 北京蜜莱坞网络科技有限公司 | Face image processing method, device, equipment and medium |
CN111131884A (en) * | 2020-01-19 | 2020-05-08 | 腾讯科技(深圳)有限公司 | Video clipping method, related device, equipment and storage medium |
CN111294524A (en) * | 2020-02-24 | 2020-06-16 | 中移(杭州)信息技术有限公司 | Video editing method and device, electronic equipment and storage medium |
CN111800644A (en) * | 2020-07-14 | 2020-10-20 | 深圳市人工智能与机器人研究院 | Video sharing and acquiring method, server, terminal equipment and medium |
CN112203115A (en) * | 2020-10-10 | 2021-01-08 | 腾讯科技(深圳)有限公司 | Video identification method and related device |
Non-Patent Citations (2)
Title |
---|
李刚等: "人脸自动识别方法综述", 《计算机应用研究》 * |
李刚等: "人脸自动识别方法综述", 《计算机应用研究》, no. 08, 28 August 2003 (2003-08-28) * |
Also Published As
Publication number | Publication date |
---|---|
CN113038271B (en) | 2023-09-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11776131B2 (en) | Neural network for eye image segmentation and image quality estimation | |
Zhang et al. | Deep future gaze: Gaze anticipation on egocentric videos using adversarial networks | |
US9813909B2 (en) | Cloud server for authenticating the identity of a handset user | |
WO2022078041A1 (en) | Occlusion detection model training method and facial image beautification method | |
CN108776775B (en) | Old people indoor falling detection method based on weight fusion depth and skeletal features | |
US20200327641A1 (en) | Method and apparatus for generating special deformation effect program file package, and method and apparatus for generating special deformation effects | |
CN111310705A (en) | Image recognition method and device, computer equipment and storage medium | |
Mocanu et al. | Deep-see face: A mobile face recognition system dedicated to visually impaired people | |
TW202141340A (en) | Image processing method, electronic device and computer readable storage medium | |
CN110443230A (en) | Face fusion method, apparatus and electronic equipment | |
CN113160244B (en) | Video processing method, device, electronic equipment and storage medium | |
CN111723707A (en) | Method and device for estimating fixation point based on visual saliency | |
JP6349448B1 (en) | Information processing apparatus, information processing program, and information processing method | |
JP2022185096A (en) | Method and apparatus of generating virtual idol, and electronic device | |
CN113038271A (en) | Video automatic editing method, device and computer storage medium | |
CN112714337A (en) | Video processing method and device, electronic equipment and storage medium | |
CN117455989A (en) | Indoor scene SLAM tracking method and device, head-mounted equipment and medium | |
WO2022222735A1 (en) | Information processing method and apparatus, computer device, and storage medium | |
JP2019040592A (en) | Information processing device, information processing program, and information processing method | |
Li et al. | Real-time human tracking based on switching linear dynamic system combined with adaptive Meanshift tracker | |
JP2009003615A (en) | Attention region extraction method, attention region extraction device, computer program, and recording medium | |
CN115426505B (en) | Preset expression special effect triggering method based on face capture and related equipment | |
US11810353B2 (en) | Methods, systems, and media for detecting two-dimensional videos placed on a sphere in abusive spherical video content | |
CN114356088B (en) | Viewer tracking method and device, electronic equipment and storage medium | |
US20230267671A1 (en) | Apparatus and method for synchronization with virtual avatar, and system for synchronization with virtual avatar |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |