Nothing Special   »   [go: up one dir, main page]

CN113038271A - Video automatic editing method, device and computer storage medium - Google Patents

Video automatic editing method, device and computer storage medium Download PDF

Info

Publication number
CN113038271A
CN113038271A CN202110321530.6A CN202110321530A CN113038271A CN 113038271 A CN113038271 A CN 113038271A CN 202110321530 A CN202110321530 A CN 202110321530A CN 113038271 A CN113038271 A CN 113038271A
Authority
CN
China
Prior art keywords
video
video frame
frame
value
target person
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110321530.6A
Other languages
Chinese (zh)
Other versions
CN113038271B (en
Inventor
黄锐
胡攀文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Artificial Intelligence and Robotics
Original Assignee
Shenzhen Institute of Artificial Intelligence and Robotics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Artificial Intelligence and Robotics filed Critical Shenzhen Institute of Artificial Intelligence and Robotics
Priority to CN202110321530.6A priority Critical patent/CN113038271B/en
Publication of CN113038271A publication Critical patent/CN113038271A/en
Application granted granted Critical
Publication of CN113038271B publication Critical patent/CN113038271B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44016Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/441Acquiring end-user identification, e.g. using personal code sent by the remote control or by inserting a card
    • H04N21/4415Acquiring end-user identification, e.g. using personal code sent by the remote control or by inserting a card using biometric characteristics of the user, e.g. by voice recognition or fingerprint scanning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/466Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • H04N21/4662Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/485End-user interface for client configuration
    • H04N21/4858End-user interface for client configuration for modifying screen layout parameters, e.g. fonts, size of the windows

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

The embodiment of the application discloses a method and a device for automatically clipping a video and a computer storage medium, so that the clipped and generated video can maximally present the information of a target person and avoid presenting the information of other unrelated persons. The embodiment of the application comprises the following steps: and applying the calculated quantized value of the attitude information and the light stream energy change value to the calculation of the return value of the action in a reinforcement learning algorithm, determining a candidate video frame corresponding to the maximum return value as a next video frame of the current video frame, taking the next video frame of the current video frame as a new current video frame, returning to the step of calculating the return value of the action under the current video frame, simultaneously determining a video picture window based on the position and the size of the target person in the video frame, and extracting a video picture related to the target person according to the video picture window, so that the finally synthesized video maximally presents the information related to the target person and avoids presenting the information of other unrelated persons.

Description

Video automatic editing method, device and computer storage medium
Technical Field
The embodiment of the application relates to the field of video clipping, in particular to a method and a device for automatically clipping a video and a computer storage medium.
Background
In the prior art, the video automatic clipping can improve the working efficiency of video clipping in the fields of security, education, movie and television entertainment and the like. After the video is clipped, the data volume of the video is greatly reduced, and the storage space occupied by the video is reduced, so that the video automatic clipping can also relieve the storage problem of massive video, and the video can release more storage space after the video is automatically clipped.
The existing automatic video editing system is mainly designed aiming at videos such as dance videos, concert videos, outdoor activity videos and football match videos, and focuses on enabling video contents to be richer and enabling the video contents to be more diversified so as to increase interestingness and improve impression. However, in some scenarios where the target person needs to be highlighted in the video, the existing video automatic clipping system does not handle well because the existing video automatic clipping system focuses on presenting more video content, cannot focus on the target person, and cannot present more information about the target person. Meanwhile, the existing automatic video clipping system presents information of other people irrelevant to the target person in the clipped video, and the privacy of other people in the video can be leaked.
Disclosure of Invention
The embodiment of the application provides a method and a device for automatically editing video and a computer storage medium, so that the video generated by editing can maximally present the information of a target person and avoid presenting the information of other unrelated persons.
A first aspect of an embodiment of the present application provides a method for automatically editing a video, where the method includes:
calculating the face pose information of a target person of each video frame in at least one path of video, calculating a pose information quantization value corresponding to the face pose information, and calculating an optical flow energy change value of each video frame;
taking any video frame in any path of video as a current video frame;
calculating a return value of the action under the current video frame according to a posture information quantization value and an optical flow energy change value of the current video frame based on a reinforcement learning algorithm, determining a candidate video frame corresponding to the maximum return value as a next video frame of the current video frame, taking the next video frame of the current video frame as a new current video frame, returning to execute the reinforcement learning algorithm, and calculating the return value of the action under the current video frame according to the posture information quantization value and the optical flow energy change value of the current video frame; wherein the action is to select a candidate video frame from each video of the at least one video respectively;
determining a video frame sequence according to the sequence of the current video frame, and obtaining an initial synthetic video based on the video frame sequence;
determining the position and the size of a video picture window of each frame in the initial composite video according to the position and the size of each frame of the target person in the initial composite video;
and extracting the video picture of each frame in the initial composite video based on the position and the size of the video picture window to obtain a target composite video.
A second aspect of the embodiments of the present application provides an automatic video editing apparatus, including:
the computing unit is used for computing the face pose information of a target person of each video frame in at least one path of video, computing a pose information quantization value corresponding to the face pose information and computing an optical flow energy change value of each video frame;
the determining unit is used for taking any video frame in any path of video as a current video frame;
a clipping unit, configured to calculate a report value of an action under a current video frame according to a quantized value of pose information and a change value of optical flow energy of the current video frame based on a reinforcement learning algorithm, determine a candidate video frame corresponding to a maximum report value as a next video frame of the current video frame, regard the next video frame of the current video frame as a new current video frame, and return to execute the reinforcement learning algorithm, and calculate a report value of an action under the current video frame according to the quantized value of pose information and the change value of optical flow energy of the current video frame; wherein the action is to select a candidate video frame from each video of the at least one video respectively;
the generating unit is used for determining a video frame sequence according to the sequence of the current video frame and obtaining an initial synthetic video based on the video frame sequence;
the determining unit is further configured to determine a position and a size of a video frame window of each frame in the initial composite video according to the position and the size of each frame in the initial composite video of the target person;
and the extraction unit is used for extracting the video picture of each frame in the initial composite video based on the position and the size of the video picture window to obtain the target composite video.
A third aspect of the embodiments of the present application provides an automatic video editing apparatus, including:
a memory for storing a computer program; a processor for implementing the steps of the video automatic clipping method according to the aforementioned first aspect when executing the computer program.
A fourth aspect of embodiments of the present application provides a computer storage medium having instructions stored therein, which when executed on a computer, cause the computer to perform the method of the first aspect.
According to the technical scheme, the embodiment of the application has the following advantages:
in this embodiment, by calculating a facial pose information quantization value and an optical flow energy change value of a target person in each video frame, applying the calculated pose information quantization value and the calculated optical flow energy change value to the calculation of a return value of an action in a reinforcement learning algorithm, determining a candidate video frame corresponding to a maximum return value as a next video frame of a current video frame, taking the next video frame of the current video frame as a new current video frame, and returning to the step of executing the calculation of the return value of the action under the current video frame, the video frame determined from at least one video each time can maximally present information of the target person and avoid presenting a picture blocked by the target person. Meanwhile, a video picture window is determined based on the position and the size of the target person in the video frame, and a video picture about the target person is extracted according to the video picture window, so that the finally synthesized video maximally presents the information about the target person and avoids presenting the information about other unrelated persons.
Drawings
FIG. 1 is a flowchart illustrating an automatic video editing method according to an embodiment of the present application;
FIG. 2 is a schematic flow chart illustrating an exemplary method for automatically editing a video according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of facial pose information according to an embodiment of the present application;
FIG. 4 is a schematic structural diagram of an automatic video editing apparatus according to an embodiment of the present application;
FIG. 5 is a schematic view of another structure of an automatic video editing apparatus according to an embodiment of the present application.
Detailed Description
The embodiment of the application provides a method and a device for automatically editing video and a computer storage medium, so that the video generated by editing can maximally present the information of a target person and avoid presenting the information of other unrelated persons.
Referring to fig. 1, an embodiment of an automatic video editing method according to the embodiment of the present application includes:
101. calculating the face pose information of a target person of each video frame in at least one path of video, calculating a pose information quantization value corresponding to the face pose information, and calculating an optical flow energy change value of each video frame;
the method of the embodiment can be applied to an automatic video clipping device, which can be a computer device with data processing capability, such as a terminal and a server.
In an application scenario that a target person in a video needs to be highlighted, the embodiment acquires at least one path of video, where a video picture of each path of video includes the target person, and the task of the embodiment is to automatically clip the at least one path of video, so that the video generated by clipping mainly presents information of the target person and information of an object interacting with the target person, and ensure that information of other unrelated persons is not displayed in the video generated by clipping, thereby protecting the security of privacy information of the other unrelated persons.
After at least one path of video is obtained, face posture information of a target person of each video frame of each path of video is calculated, and a posture information quantization value corresponding to the face posture information is calculated. In addition, the embodiment also proposes a method for determining the situation that the target person in the video picture is occluded, namely, the occlusion situation of the target person in the video picture is determined according to the optical flow energy change value. Therefore, this step also calculates the optical flow energy change value for each video frame, and reflects the blocking situation of the target person by the optical flow energy change value.
102. Taking any video frame in any path of video as a current video frame;
the present embodiment employs a reinforcement learning algorithm to determine each frame in the video generated by the clipping, the video frame being a state in the reinforcement learning algorithm. The user can designate any video frame in any path of video as the current video frame, so that the automatic video clipping device determines the current video frame according to the designation of the user, the current video frame is used as one state in the reinforcement learning algorithm, and the next state is determined according to the state corresponding to the current video frame in the subsequent steps.
103. Calculating a return value of the action under the current video frame according to the attitude information quantization value and the optical flow energy change value of the current video frame based on a reinforcement learning algorithm, determining a candidate video frame corresponding to the maximum return value as a next video frame of the current video frame, taking the next video frame of the current video frame as a new current video frame, and returning to execute a step of calculating the return value of the action under the current video frame according to the attitude information quantization value and the optical flow energy change value of the current video frame based on the reinforcement learning algorithm;
in the automatic clipping process, the present embodiment determines each video frame in the clip-generated video in turn. Specifically, after determining the current video frame, the video automatic clipping device calculates a return value of the action under the current video frame according to the quantized value of the pose information and the optical flow energy change value of the current video frame based on a reinforcement learning algorithm.
The reinforcement learning algorithm of the present embodiment may specifically be a markov decision process. In the reinforcement learning algorithm, the larger the return value of the action is, the more meaningful the action is, the strategy is optimized by the virtual agent in the reinforcement learning algorithm according to the action corresponding to the maximum return value, and then the next action is taken according to the optimized strategy. Therefore, after calculating the reward values of a plurality of actions under the current video frame, the candidate video frame selected by the action with the maximum reward value is determined as the next video frame of the current video frame.
And after the next video frame of the current video frame is determined, taking the next video frame of the current video frame as a new current video frame, and returning to execute the step of calculating the report value of the action under the current video frame according to the attitude information quantization value and the optical flow energy change value of the current video frame based on the reinforcement learning algorithm. The action under the current video frame refers to selecting a candidate video frame from each video of at least one video respectively.
104. Determining a video frame sequence according to the sequence of the current video frame, and obtaining an initial synthetic video based on the video frame sequence;
each video frame can be determined in sequence through step 103, and the determined plurality of video frames have a sequential determination order, so that the sequence of the video frames can be determined according to the sequential determination order of the current video frame in step 103, and an initial composite video is obtained based on the sequence of the video frames.
105. Determining the position and the size of a video picture window of each frame in the initial composite video according to the position and the size of each frame of the target person in the initial composite video;
after the initial composite video is obtained, since the present embodiment aims to emphasize the target person in the video picture, the position and size of the target person in the initial composite video in each frame of picture are further determined, and the position and size of the video picture window of each frame in the initial composite video are determined according to the position and size of the target person in each frame of picture.
106. Extracting a video picture of each frame in the initial synthesized video based on the position and the size of the video picture window to obtain a target synthesized video;
after the position and the size of the video picture window are determined, the video picture of each frame in the initial composite video is extracted based on the position and the size of the video picture window, so that a target composite video is obtained, and the automatic clipping of the video is realized.
In this embodiment, by calculating a facial pose information quantization value and an optical flow energy change value of a target person in each video frame, applying the calculated pose information quantization value and the calculated optical flow energy change value to the calculation of a return value of an action in a reinforcement learning algorithm, determining a candidate video frame corresponding to a maximum return value as a next video frame of a current video frame, taking the next video frame of the current video frame as a new current video frame, and returning to the step of executing the calculation of the return value of the action under the current video frame, the video frame determined from at least one video each time can maximally present information of the target person and avoid presenting a picture blocked by the target person. Meanwhile, a video picture window is determined based on the position and the size of the target person in the video frame, and a video picture about the target person is extracted according to the video picture window, so that the finally synthesized video maximally presents the information about the target person and avoids presenting the information about other unrelated persons.
The embodiments of the present application will be described in further detail below on the basis of the aforementioned embodiment shown in fig. 1. Referring to fig. 2, another embodiment of the method for automatically editing a video according to the embodiment of the present application includes:
201. calculating the face pose information of a target person of each video frame in at least one path of video, calculating a pose information quantization value corresponding to the face pose information, and calculating an optical flow energy change value of each video frame;
in this embodiment, the face pose information of the target person of each video frame may be calculated according to a face pose estimation algorithm. Specifically, the face pose information calculated by the face pose estimation algorithm may be represented by a rotation matrix, a rotation vector, a quaternion, or an euler angle. Since euler angle readability is better, it may be preferable to use euler angle to represent facial pose information. As shown in fig. 3, the face pose information of the target person, such as the angle of pitch angle (pitch), the angle of yaw angle (yaw), and the angle of rotation angle (roll), can be calculated according to the face pose estimation algorithm.
In order to keep the consistency between the calculated face pose information and the symmetric structure of the face, the present embodiment uses a multivariate gaussian model to calculate a pose information quantization value corresponding to the face pose information. Further, for convenience of calculation, the calculated attitude information quantized value may be normalized, and the normalized attitude information quantized value is used in a subsequent calculation process.
Specifically, the specific way of calculating the optical flow energy change value in this embodiment is to calculate the optical flow information of each video frame in at least one video channel and calculate the optical flow information of other video frames belonging to the same video channel as the video frame, calculate the optical flow energy of each video frame according to the optical flow information of each video frame, calculate the optical flow energy of the other video frames according to the optical flow information of the other video frames, calculate the optical flow energy difference value between each video frame and the other video frames and the interval time between each video frame and the other video frames, and take the quotient of the optical flow energy difference value and the interval time as the optical flow energy change value of each video frame.
For example, assuming that the video automatic clipping device acquires C-way videos (C ≧ 1), each of which includes T-frame video frames, a certain video frame in the C-way video can be represented as fc,t(C1, …, C; T1, …, T) with fc,tBelonging to the same path of video and being fc,t+1The adjacent video frames may be denoted as fc,t+1. Respectively calculate fc,tOptical flow information and fc,t+1Optical flow information of according to fc,tOptical flow information calculation of fc,tEnergy of luminous flux of (a) according to fc,t+1Optical flow information calculation of fc,t+1Calculating the optical flow energy of fc,tLuminous flux energy of and fc,t+1The optical flow energy of the optical flow and fc,tAnd fc,t+1The quotient of the calculated optical flow energy difference and the interval time is taken as fc,tThe value of the optical flow energy change.
In the multivariate Gaussian model, when the angle of the Euler angle of the face tends to 0, the quantized value of the posture information corresponding to the face posture information is maximum; when the human face deflects to generate an Euler angle, namely the Euler angle is not equal to 0, the quantized value of the posture information corresponding to the face posture information is reduced. The magnitude of the change in the attitude information quantization value caused by the angle of the euler angle can be controlled by the variance matrix. Therefore, the variance of the euler angles can be set to adjust the degree of influence of the euler angles on the quantized values of the attitude information.
202. Taking any video frame in any path of video as a current video frame;
the operation performed in this step is similar to the operation performed in step 102 in the embodiment shown in fig. 1, and is not repeated here.
203. Calculating a return value of the action under the current video frame according to the attitude information quantization value and the optical flow energy change value of the current video frame based on a reinforcement learning algorithm, determining a candidate video frame corresponding to the maximum return value as a next video frame of the current video frame, taking the next video frame of the current video frame as a new current video frame, and returning to execute a step of calculating the return value of the action under the current video frame according to the attitude information quantization value and the optical flow energy change value of the current video frame based on the reinforcement learning algorithm;
in this embodiment, when calculating the return value of the action, a specific calculation method is to determine a transition probability from the current video frame to the candidate video frame under the action, calculate an initial return value of the action under the current video frame according to the quantized value of the attitude information of the current video frame and the optical flow energy change value, and take the product of the initial return value and the transition probability as the return value of the action.
Specifically, the specific manner of determining the transition probability is to determine the transition probability to be 1 when a preset condition is met; when any one of the preset conditions is not satisfied, it is determined that the transition probability is 0. Wherein the preset conditions include: the current video frame and the candidate video frame are adjacent on the time line, the target character exists in the picture of the candidate video frame, and the video index corresponding to the action is consistent with the index of the candidate video frame in at least one path of video.
The current video frame and the candidate video frame are adjacent on the time line, which means that the current video frame and the candidate video frame are adjacent on the time line of the video. For example, if the current video frame is the T-th frame of the first video, the candidate video frame may be the T + 1-th frame of the current video or other videos (e.g., the second video, the third video, etc.).
204. Determining a video frame sequence according to the sequence of the current video frame, and obtaining an initial synthetic video based on the video frame sequence;
the operation performed in this step is similar to the operation performed in step 104 in the embodiment shown in fig. 1, and is not described here again.
205. Determining the position and the size of a video picture window of each frame in the initial composite video according to the position and the size of each frame of the target person in the initial composite video;
in this embodiment, the position and size of the target person in the video frame may be determined according to information of the target person and an object interacting with the target person, where the information of the object interacting with the target person may be information of an object focused by the line of sight direction of the target person. For example, if the target person's gaze direction focuses on a chair, the determined position and size of the target person in the video frame should include the chair on which the target person's gaze direction focuses in addition to the target person.
Specifically, the gaze direction of the target person in the video frame may be determined in such a way that the gaze direction of the target person in each frame of the initial composite video may be determined according to the facial pose information of the target person in each frame of the initial composite video, for example, the face-up motion may be determined as the upward looking motion, and the face-down motion may be determined as the downward looking motion.
For example, looking to the left and looking to the right depend mainly on the yaw angle (yaw) in the face pose information, and therefore, the line of sight direction of the target person can be determined according to the yaw angle. To facilitate the subsequent calculation process, the gaze direction may be quantified, for example, the gaze direction may be represented numerically based on the following formula:
Figure BDA0002993063330000081
wherein g denotes the direction of the line of sight and phi denotes the yaw angle. Therefore, the value of the sight line direction corresponding to a certain yaw angle can be determined according to the formula.
After the value of the sight line direction is determined, the position and the size of each frame of the target person in the initial composite video are further determined according to the sight line direction of each frame of the target person in the initial composite video, and the position and the size of the video picture window of each frame in the initial composite video are determined according to the position and the size of each frame of the target person in the initial composite video.
Specifically, the position and size of each frame of the target person in the initial composite video can be expressed as
Figure BDA0002993063330000082
Wherein,
Figure BDA0002993063330000083
reference to the coordinates of the target person in each frame of the initial composite video, i.e. according to
Figure BDA0002993063330000084
And
Figure BDA0002993063330000085
the position of the target person in the video frame can be determined;
Figure BDA0002993063330000086
refers to the width of the target person in each frame of the initial composite video,
Figure BDA0002993063330000087
refer to the height of the target person in each frame of the initial composite video, i.e. according to
Figure BDA0002993063330000088
And
Figure BDA0002993063330000089
the size of the target person in the video frame may be determined.
Meanwhile, the position and size of the video picture window of each frame in the initial composite video may be expressed as
Figure BDA0002993063330000091
Wherein,
Figure BDA0002993063330000092
reference to coordinates of a video picture window in a video frame, i.e. according to
Figure BDA0002993063330000093
The specific position of the video frame of the video picture window in the initial composite video can be determined;
Figure BDA0002993063330000094
refers to the width of a window of a video picture,
Figure BDA0002993063330000095
indicating height of a window of a video picture, i.e. according to
Figure BDA0002993063330000096
And
Figure BDA0002993063330000097
the size of the video picture window may be determined.
In this embodiment, the solution is based on a function
Figure BDA0002993063330000098
Specifically, the position and size of the video frame window of each frame can be calculated according to the following objective function:
Figure BDA0002993063330000099
wherein, ctRefers to the initial composite video, t refers to ctAny one of the video frames of gtRefers to the aforementioned values of the line of sight direction.
Since the objective function is a convex function, the objective function can be solved by using a convex optimization algorithm, and the initial composite video c can be obtainedtThe optimal position and size of the video frame window of each frame in the video frame, i.e. obtaining
Figure BDA00029930633300000910
The optimal solution of (1).
Therefore, as can be seen from the above expression of the objective function, when determining the position and size of the video screen window, the object interacted with by the target person is also included in the video screen window in consideration of the information of the object interacted with by the target person in the direction of the line of sight of the target person.
206. Extracting a video picture of each frame in the initial synthesized video based on the position and the size of the video picture window to obtain a target synthesized video;
after the position and size of the video picture window are determined, a video picture of each frame in the initial composite video may be extracted based on the position and size of the video picture window, the extracted multiple frames of video pictures constituting the target composite video. According to the above description, since the video frame window determines the position and the size based on the target person and the object interacting with the target person, the video frame extracted from the video frame window includes the information of the target person and the information of the object interacting with the target person, and the information of other unrelated persons is prevented from being presented in the video frame, so that the information of the target person can be presented maximally on one hand, the information of other unrelated persons is also prevented from being presented on the other hand, and the problem of privacy disclosure is avoided.
In the embodiment, the information of the target person is highlighted on the automatic video editing, and the privacy disclosure problem is avoided, so that the technical scheme has a practical application value, and the realizability of the scheme is improved.
In the above description of the video automatic clipping method in the embodiment of the present application, referring to fig. 4, the following description of the video automatic clipping device in the embodiment of the present application, and an embodiment of the video automatic clipping device in the embodiment of the present application includes:
a calculating unit 401, configured to calculate facial pose information of a target person of each video frame in at least one path of video, calculate a pose information quantization value corresponding to the facial pose information, and calculate an optical flow energy change value of each video frame;
a determining unit 402, configured to use any video frame in any channel of video as a current video frame;
a clipping unit 403, configured to calculate a report value of an action under a current video frame according to a quantized value of pose information and a change value of optical flow energy of the current video frame based on a reinforcement learning algorithm, determine a candidate video frame corresponding to a maximum report value as a next video frame of the current video frame, regard the next video frame of the current video frame as a new current video frame, and return to execute the reinforcement learning algorithm, and calculate a report value of an action under the current video frame according to the quantized value of pose information and the change value of optical flow energy of the current video frame; wherein the action is to select a candidate video frame from each video of the at least one video respectively;
a generating unit 404, configured to determine a video frame sequence according to a sequence of a current video frame, and obtain an initial synthesized video based on the video frame sequence;
the determining unit 402 is further configured to determine a position and a size of a video frame window of each frame in the initial composite video according to the position and the size of each frame in the initial composite video of the target person;
an extracting unit 405, configured to extract a video picture of each frame in the initial composite video based on the position and the size of the video picture window, so as to obtain a target composite video.
In a preferred embodiment of this embodiment, the calculating unit 401 is specifically configured to calculate the face pose information of the target person of each video frame according to a face pose estimation algorithm, where the face pose information includes an angle of pitch angle, an angle of yaw angle, and an angle of rotation angle.
In a preferred embodiment of this embodiment, the calculating unit 401 is specifically configured to calculate the quantized value of the pose information corresponding to the facial pose information by using a multivariate gaussian model.
In a preferred embodiment of this embodiment, the calculating unit 401 is specifically configured to calculate optical flow information of each video frame and optical flow information of other video frames belonging to the same video path as the each video frame; calculating optical flow energy of each video frame according to the optical flow information of each video frame, calculating optical flow energy of other video frames according to the optical flow information of other video frames, and calculating an optical flow energy difference value between each video frame and the other video frames and an interval time between each video frame and the other video frames; and taking the quotient of the optical flow energy difference value and the interval time as the optical flow energy change value of each video frame.
In a preferred implementation manner of this embodiment, the clipping unit 403 is specifically configured to determine a transition probability from the current video frame to the candidate video frame under the action; calculating an initial return value of the action under the current video frame according to the attitude information quantization value and the optical flow energy change value of the current video frame; taking a product of the initial reward value and the transition probability as the reward value.
In a preferred implementation manner of this embodiment, the clipping unit 403 is specifically configured to determine that the transition probability is 1 when a preset condition is met; when any one of the preset conditions is not satisfied, determining that the transition probability is 0;
wherein the preset conditions include: the current video frame and the candidate video frame are adjacent on a time line, the target character exists in a picture of the candidate video frame, and a video index corresponding to the action is consistent with an index of the candidate video frame in the at least one path of video.
In a preferred embodiment of this embodiment, the determining unit 402 is specifically configured to determine, according to the facial pose information of each frame of the initial composite video of the target person, a gaze direction of each frame of the initial composite video of the target person; determining the position and the size of each frame of the target person in the initial composite video according to the sight line direction of each frame of the target person in the initial composite video; and determining the position and the size of a video picture window of each frame in the initial composite video according to the position and the size of each frame of the target person in the initial composite video.
In this embodiment, operations performed by each unit in the automatic video editing apparatus are similar to those described in the embodiments shown in fig. 1 to fig. 2, and are not described again here.
In this embodiment, the calculating unit 401 calculates a facial pose information quantization value and an optical flow energy change value of a target person for each video frame, the clipping unit 403 applies the calculated pose information quantization value and optical flow energy change value to the calculation of a reward value for an action in the reinforcement learning algorithm, determines a candidate video frame corresponding to the maximum reward value as a next video frame of a current video frame, uses the next video frame of the current video frame as a new current video frame, and returns to the step of performing the calculation of the reward value for the action under the current video frame, so that each video frame determined from at least one video can maximally present information of the target person and avoid presenting a frame that the target person is blocked. Meanwhile, the determination unit 402 determines a video picture window based on the position and size of the target person in the video frame, and the extraction unit 405 extracts a video picture on the target person from the video picture window so that the finally synthesized video maximally presents information on the target person and avoids presenting information on other unrelated persons.
Referring to fig. 5, the automatic video editing apparatus in the embodiment of the present application is described below, where an embodiment of the automatic video editing apparatus in the embodiment of the present application includes:
the video automatic clipping device 500 may include one or more Central Processing Units (CPUs) 501 and a memory 505, where one or more applications or data are stored in the memory 505.
Memory 505 may be volatile storage or persistent storage, among others. The program stored in memory 505 may include one or more modules, each of which may include a sequence of instruction operations for a video automatic clipping device. Further, the central processor 501 may be arranged to communicate with the memory 505, and perform a series of instruction operations in the memory 505 on the video automatic clipping device 500.
The video auto-clipping device 500 may also include one or more power supplies 502, one or more wired or wireless network interfaces 503, one or more input-output interfaces 504, and/or one or more operating systems, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
The central processor 501 may perform the operations performed by the automatic video editing apparatus in the embodiments shown in fig. 1 to fig. 2, and details thereof are not repeated herein.
An embodiment of the present application further provides a computer storage medium, where one embodiment includes: the computer storage medium has stored therein instructions that, when executed on a computer, cause the computer to perform the operations performed by the video automatic clipping apparatus in the embodiments of fig. 1 to 2.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and the like.

Claims (10)

1. A method for automatic video editing, the method comprising:
calculating the face pose information of a target person of each video frame in at least one path of video, calculating a pose information quantization value corresponding to the face pose information, and calculating an optical flow energy change value of each video frame;
taking any video frame in any path of video as a current video frame;
calculating a return value of the action under the current video frame according to a posture information quantization value and an optical flow energy change value of the current video frame based on a reinforcement learning algorithm, determining a candidate video frame corresponding to the maximum return value as a next video frame of the current video frame, taking the next video frame of the current video frame as a new current video frame, returning to execute the reinforcement learning algorithm, and calculating the return value of the action under the current video frame according to the posture information quantization value and the optical flow energy change value of the current video frame; wherein the action is to select a candidate video frame from each video of the at least one video respectively;
determining a video frame sequence according to the sequence of the current video frame, and obtaining an initial synthetic video based on the video frame sequence;
determining the position and the size of a video picture window of each frame in the initial composite video according to the position and the size of each frame of the target person in the initial composite video;
and extracting the video picture of each frame in the initial composite video based on the position and the size of the video picture window to obtain a target composite video.
2. The method of claim 1, wherein the calculating the facial pose information of the target person for each video frame in the at least one video comprises:
and calculating the face posture information of the target person of each video frame according to a face posture estimation algorithm, wherein the face posture information comprises a pitch angle, a yaw angle and a rotation angle.
3. The method of claim 1, wherein the calculating a pose information quantization value corresponding to the facial pose information comprises:
and calculating the quantized value of the posture information corresponding to the face posture information by using a multivariate Gaussian model.
4. The method according to claim 1, wherein said calculating an optical flow energy change value for each video frame comprises:
calculating optical flow information of each video frame and optical flow information of other video frames belonging to the same path of video with each video frame;
calculating optical flow energy of each video frame according to the optical flow information of each video frame, calculating optical flow energy of other video frames according to the optical flow information of other video frames, and calculating an optical flow energy difference value between each video frame and the other video frames and an interval time between each video frame and the other video frames;
and taking the quotient of the optical flow energy difference value and the interval time as the optical flow energy change value of each video frame.
5. The method of claim 1, wherein the calculating the reward value of the action under the current video frame according to the quantized value of the pose information and the optical flow energy change value of the current video frame comprises:
determining a transition probability of the current video frame to the candidate video frame under the action;
calculating an initial return value of the action under the current video frame according to the attitude information quantization value and the optical flow energy change value of the current video frame;
taking a product of the initial reward value and the transition probability as the reward value.
6. The method of claim 5, wherein the determining the transition probability of the current video frame to the candidate video frame under the action comprises:
when a preset condition is met, determining that the transition probability is 1; when any one of the preset conditions is not satisfied, determining that the transition probability is 0;
wherein the preset conditions include: the current video frame and the candidate video frame are adjacent on a time line, the target character exists in a picture of the candidate video frame, and a video index corresponding to the action is consistent with an index of the candidate video frame in the at least one path of video.
7. The method of claim 1, wherein determining the position and size of the video frame window for each frame of the initial composite video based on the position and size of the target person in each frame of the initial composite video comprises:
determining the sight line direction of the target person in each frame of the initial composite video according to the face pose information of the target person in each frame of the initial composite video;
determining the position and the size of each frame of the target person in the initial composite video according to the sight line direction of each frame of the target person in the initial composite video;
and determining the position and the size of a video picture window of each frame in the initial composite video according to the position and the size of each frame of the target person in the initial composite video.
8. An apparatus for automatic video editing, the apparatus comprising:
the computing unit is used for computing the face pose information of a target person of each video frame in at least one path of video, computing a pose information quantization value corresponding to the face pose information and computing an optical flow energy change value of each video frame;
the determining unit is used for taking any video frame in any path of video as a current video frame;
a clipping unit, configured to calculate a report value of an action under a current video frame according to a quantized value of pose information and a change value of optical flow energy of the current video frame based on a reinforcement learning algorithm, determine a candidate video frame corresponding to a maximum report value as a next video frame of the current video frame, regard the next video frame of the current video frame as a new current video frame, and return to execute the reinforcement learning algorithm, and calculate a report value of an action under the current video frame according to the quantized value of pose information and the change value of optical flow energy of the current video frame; wherein the action is to select a candidate video frame from each video of the at least one video respectively;
the generating unit is used for determining a video frame sequence according to the sequence of the current video frame and obtaining an initial synthetic video based on the video frame sequence;
the determining unit is further configured to determine a position and a size of a video frame window of each frame in the initial composite video according to the position and the size of each frame in the initial composite video of the target person;
and the extraction unit is used for extracting the video picture of each frame in the initial composite video based on the position and the size of the video picture window to obtain the target composite video.
9. An apparatus for automatic video editing, the apparatus comprising:
a memory for storing a computer program; a processor for implementing the steps of the video automatic clipping method according to any of claims 1 to 7 when executing the computer program.
10. A computer storage medium having stored therein instructions that, when executed on a computer, cause the computer to perform the method of any one of claims 1 to 7.
CN202110321530.6A 2021-03-25 2021-03-25 Video automatic editing method, device and computer storage medium Active CN113038271B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110321530.6A CN113038271B (en) 2021-03-25 2021-03-25 Video automatic editing method, device and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110321530.6A CN113038271B (en) 2021-03-25 2021-03-25 Video automatic editing method, device and computer storage medium

Publications (2)

Publication Number Publication Date
CN113038271A true CN113038271A (en) 2021-06-25
CN113038271B CN113038271B (en) 2023-09-08

Family

ID=76473798

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110321530.6A Active CN113038271B (en) 2021-03-25 2021-03-25 Video automatic editing method, device and computer storage medium

Country Status (1)

Country Link
CN (1) CN113038271B (en)

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110310237A1 (en) * 2010-06-17 2011-12-22 Institute For Information Industry Facial Expression Recognition Systems and Methods and Computer Program Products Thereof
US20120093365A1 (en) * 2010-10-15 2012-04-19 Dai Nippon Printing Co., Ltd. Conference system, monitoring system, image processing apparatus, image processing method and a non-transitory computer-readable storage medium
US20150318020A1 (en) * 2014-05-02 2015-11-05 FreshTake Media, Inc. Interactive real-time video editor and recorder
CN106534967A (en) * 2016-10-25 2017-03-22 司马大大(北京)智能系统有限公司 Video editing method and device
CN108805080A (en) * 2018-06-12 2018-11-13 上海交通大学 Multi-level depth Recursive Networks group behavior recognition methods based on context
EP3410353A1 (en) * 2017-06-01 2018-12-05 eyecandylab Corp. Method for estimating a timestamp in a video stream and method of augmenting a video stream with information
CN109618184A (en) * 2018-12-29 2019-04-12 北京市商汤科技开发有限公司 Method for processing video frequency and device, electronic equipment and storage medium
US20190222776A1 (en) * 2018-01-18 2019-07-18 GumGum, Inc. Augmenting detected regions in image or video data
CN110691202A (en) * 2019-08-28 2020-01-14 咪咕文化科技有限公司 Video editing method, device and computer storage medium
CN111063011A (en) * 2019-12-16 2020-04-24 北京蜜莱坞网络科技有限公司 Face image processing method, device, equipment and medium
CN111131884A (en) * 2020-01-19 2020-05-08 腾讯科技(深圳)有限公司 Video clipping method, related device, equipment and storage medium
CN111294524A (en) * 2020-02-24 2020-06-16 中移(杭州)信息技术有限公司 Video editing method and device, electronic equipment and storage medium
CN111800644A (en) * 2020-07-14 2020-10-20 深圳市人工智能与机器人研究院 Video sharing and acquiring method, server, terminal equipment and medium
US20200380274A1 (en) * 2019-06-03 2020-12-03 Nvidia Corporation Multi-object tracking using correlation filters in video analytics applications
CN112203115A (en) * 2020-10-10 2021-01-08 腾讯科技(深圳)有限公司 Video identification method and related device

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110310237A1 (en) * 2010-06-17 2011-12-22 Institute For Information Industry Facial Expression Recognition Systems and Methods and Computer Program Products Thereof
US20120093365A1 (en) * 2010-10-15 2012-04-19 Dai Nippon Printing Co., Ltd. Conference system, monitoring system, image processing apparatus, image processing method and a non-transitory computer-readable storage medium
US20150318020A1 (en) * 2014-05-02 2015-11-05 FreshTake Media, Inc. Interactive real-time video editor and recorder
CN106534967A (en) * 2016-10-25 2017-03-22 司马大大(北京)智能系统有限公司 Video editing method and device
EP3410353A1 (en) * 2017-06-01 2018-12-05 eyecandylab Corp. Method for estimating a timestamp in a video stream and method of augmenting a video stream with information
US20190222776A1 (en) * 2018-01-18 2019-07-18 GumGum, Inc. Augmenting detected regions in image or video data
CN108805080A (en) * 2018-06-12 2018-11-13 上海交通大学 Multi-level depth Recursive Networks group behavior recognition methods based on context
CN109618184A (en) * 2018-12-29 2019-04-12 北京市商汤科技开发有限公司 Method for processing video frequency and device, electronic equipment and storage medium
US20200380274A1 (en) * 2019-06-03 2020-12-03 Nvidia Corporation Multi-object tracking using correlation filters in video analytics applications
CN110691202A (en) * 2019-08-28 2020-01-14 咪咕文化科技有限公司 Video editing method, device and computer storage medium
CN111063011A (en) * 2019-12-16 2020-04-24 北京蜜莱坞网络科技有限公司 Face image processing method, device, equipment and medium
CN111131884A (en) * 2020-01-19 2020-05-08 腾讯科技(深圳)有限公司 Video clipping method, related device, equipment and storage medium
CN111294524A (en) * 2020-02-24 2020-06-16 中移(杭州)信息技术有限公司 Video editing method and device, electronic equipment and storage medium
CN111800644A (en) * 2020-07-14 2020-10-20 深圳市人工智能与机器人研究院 Video sharing and acquiring method, server, terminal equipment and medium
CN112203115A (en) * 2020-10-10 2021-01-08 腾讯科技(深圳)有限公司 Video identification method and related device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李刚等: "人脸自动识别方法综述", 《计算机应用研究》 *
李刚等: "人脸自动识别方法综述", 《计算机应用研究》, no. 08, 28 August 2003 (2003-08-28) *

Also Published As

Publication number Publication date
CN113038271B (en) 2023-09-08

Similar Documents

Publication Publication Date Title
US11776131B2 (en) Neural network for eye image segmentation and image quality estimation
Zhang et al. Deep future gaze: Gaze anticipation on egocentric videos using adversarial networks
US9813909B2 (en) Cloud server for authenticating the identity of a handset user
WO2022078041A1 (en) Occlusion detection model training method and facial image beautification method
CN108776775B (en) Old people indoor falling detection method based on weight fusion depth and skeletal features
US20200327641A1 (en) Method and apparatus for generating special deformation effect program file package, and method and apparatus for generating special deformation effects
CN111310705A (en) Image recognition method and device, computer equipment and storage medium
Mocanu et al. Deep-see face: A mobile face recognition system dedicated to visually impaired people
TW202141340A (en) Image processing method, electronic device and computer readable storage medium
CN110443230A (en) Face fusion method, apparatus and electronic equipment
CN113160244B (en) Video processing method, device, electronic equipment and storage medium
CN111723707A (en) Method and device for estimating fixation point based on visual saliency
JP6349448B1 (en) Information processing apparatus, information processing program, and information processing method
JP2022185096A (en) Method and apparatus of generating virtual idol, and electronic device
CN113038271A (en) Video automatic editing method, device and computer storage medium
CN112714337A (en) Video processing method and device, electronic equipment and storage medium
CN117455989A (en) Indoor scene SLAM tracking method and device, head-mounted equipment and medium
WO2022222735A1 (en) Information processing method and apparatus, computer device, and storage medium
JP2019040592A (en) Information processing device, information processing program, and information processing method
Li et al. Real-time human tracking based on switching linear dynamic system combined with adaptive Meanshift tracker
JP2009003615A (en) Attention region extraction method, attention region extraction device, computer program, and recording medium
CN115426505B (en) Preset expression special effect triggering method based on face capture and related equipment
US11810353B2 (en) Methods, systems, and media for detecting two-dimensional videos placed on a sphere in abusive spherical video content
CN114356088B (en) Viewer tracking method and device, electronic equipment and storage medium
US20230267671A1 (en) Apparatus and method for synchronization with virtual avatar, and system for synchronization with virtual avatar

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant