CN110868632A

CN110868632A - Video processing method and device, storage medium and electronic equipment

Info

Publication number: CN110868632A
Application number: CN201911035914.0A
Authority: CN
Inventors: 胡风
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-10-29
Filing date: 2019-10-29
Publication date: 2020-03-06
Anticipated expiration: 2039-10-29
Also published as: CN110868632B

Abstract

The application discloses a video processing method, a video processing device, a storage medium and an electronic device, wherein the video processing method comprises the following steps: dividing a video to be processed to obtain a plurality of video segments; updating the video clip according to the front and back image frames at each division position; determining the updated key image frame of the video clip; and generating an abstract image of the video to be processed according to the key image frame so as to display the video to be processed, so that video content can be acquired without playing the video, and the method and the device are high in flexibility and high in acquisition speed.

Description

Video processing method and device, storage medium and electronic equipment

Technical Field

The present application relates to the field of computers, and in particular, to a video processing method, an apparatus, a storage medium, and an electronic device.

Background

With the popularization of smart phones and the development of the internet, videos become increasingly important information carriers and are indispensable to work, life and entertainment of people.

At present, the popularization of smart phones with cameras greatly promotes the production and the propagation of videos, but since video content acquisition needs to wait for the playing of videos, video playing is not suitable in many occasions, such as during meetings or in occasions where books are not easily made loud, so that users are difficult to know the video content, and the video content acquisition mode is too single and has strong limitation.

Disclosure of Invention

The embodiment of the application provides a video processing method and device, a storage medium and an electronic device, which can acquire video content without playing video and have high flexibility.

The embodiment of the application provides a video processing method, which comprises the following steps:

dividing a video to be processed to obtain a plurality of video segments;

updating the video clip according to the front and back image frames at each division position;

determining a key image frame of the updated video clip;

and generating an abstract image of the video to be processed according to the key image frame so as to display the video to be processed.

An embodiment of the present application further provides a video processing apparatus, including:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a video processing instruction, the video processing instruction carries a video to be processed, and the video to be processed comprises a plurality of continuous image frames;

the dividing module is used for dividing the video to be processed to obtain a plurality of video segments;

the updating module is used for updating the video clip according to the front and rear image frames at each division position;

a determining module, configured to determine an updated key image frame of the video segment;

and the generating module is used for generating the abstract image of the video to be processed according to the key image frame so as to display the video to be processed.

Wherein, the update module specifically includes:

the determining unit is used for taking two image frames before and after the current division position as a target image frame;

the first moving unit is used for moving the target image frame to a previous video segment corresponding to the division position to obtain a first moving segment, and taking a next video segment corresponding to the division position after moving as a second moving segment;

the second moving unit is used for moving the target image frame to the next video segment corresponding to the division position to obtain a third moving segment, and taking the previous video segment corresponding to the division position after moving as a fourth moving segment;

an updating unit, configured to update the video segment according to the first moving segment, the second moving segment, the third moving segment, and the fourth moving segment.

Wherein the updating unit is specifically configured to:

determining the difference degree between the first moving segment and the corresponding second moving segment to obtain a first difference degree;

determining the difference degree between the third moving segment and the corresponding fourth moving segment to obtain a second difference degree;

determining the difference degree between the previous video clip and the corresponding next video clip to obtain a third difference degree;

updating the video segment according to the first difference degree, the second difference degree and the third difference degree.

Wherein the updating unit is specifically configured to:

taking the difference with the largest value among the first difference, the second difference and the third difference as a target difference;

when the target difference degree is the first difference degree or the second difference degree, or the updated frequency does not reach the preset frequency, taking the moved front and back two video segments corresponding to the target difference degree as corresponding updated video segments, updating the dividing position according to the updated video segments, and then returning to execute the step of taking the front and back two image frames at the current dividing position as target image frames;

and when the target difference degree is the third difference degree or the updated times reach preset times, taking the front and rear video clips corresponding to the target difference degree as corresponding updated video clips, and stopping updating.

Wherein the updating unit is specifically configured to:

determining a gray value array corresponding to each image frame;

and determining the square difference between any one video segment and the corresponding other video segment according to the gray value array so as to obtain the difference degree.

Wherein the updating unit is specifically configured to:

determining a gray scale map for each of the image frames;

reading gray values of pixel points in the gray map according to a preset sequence;

and generating a gray value array corresponding to the image frame according to the read gray value.

Wherein the determining module is specifically configured to:

filtering out a front preset image frame and a rear preset image frame of each updated video clip to obtain a corresponding filtered clip;

determining the image quality score and/or the image aesthetic score of each image frame in the filtered segment;

and determining the key image frame from the corresponding filtered segment according to the image quality score and/or the image aesthetic score.

Wherein the determining module is specifically configured to:

filtering the image frames in the filtered segments according to the image quality scores and/or the image aesthetic scores to obtain retained segments;

determining corresponding center time according to the generation time of each image frame in the updated video clip;

determining a key image frame from the retention segment according to the center time.

Wherein the generating module comprises:

a first generating unit, configured to generate target subtitle information of a corresponding video segment according to subtitle information of an image frame in each video segment when the target disparity is the first disparity or the second disparity or an updated number of times does not reach a preset number of times; when the target difference degree is the third difference degree or the updated times reach preset times, generating target subtitle information of each video clip according to the audio frame and the image frame of the video to be processed or the audio frame;

and the second generating unit is used for generating the abstract image of the video to be processed according to the target subtitle information and the key image frame.

Wherein the second generating unit is specifically configured to:

generating the target caption information on the key image frame to obtain a corresponding image-text frame;

splicing the image-text frames in sequence according to the arrangement sequence of the video clips;

and generating the abstract image of the video to be processed according to the spliced image-text frame.

The first generating unit is specifically configured to:

when the subtitle information does not exist on each image frame in the video to be processed, determining the text information of the audio frame corresponding to each video clip; generating target subtitle information of each video clip according to the text information;

when subtitle information does not exist on a part of image frames in the video to be processed, determining character information of the audio frames corresponding to the part of image frames; and generating target subtitle information of each video clip according to the text information and the subtitle information.

The embodiment of the application also provides a computer readable storage medium, wherein a plurality of instructions are stored in the storage medium, and the instructions are suitable for being loaded by a processor to execute any one of the video processing methods.

An embodiment of the present application further provides an electronic device, which includes a processor and a memory, where the processor is electrically connected to the memory, the memory is used to store instructions and data, and the processor is used to execute any of the steps in the video processing method.

According to the video processing method, the video processing device, the storage medium and the electronic equipment, the video to be processed is divided to obtain a plurality of video clips, the video clips are updated according to the front and rear image frames of each divided position, the key image frames of the updated video clips are determined, the abstract images of the video to be processed are generated according to the key image frames, the video to be processed is displayed, the video content can be obtained without playing the video, the flexibility is high, the obtaining speed is high, the video processing method and the video processing device are suitable for various scenes, and the use limitation is small.

Drawings

The technical solution and other advantages of the present application will become apparent from the detailed description of the embodiments of the present application with reference to the accompanying drawings.

Fig. 1 is a scene schematic diagram of a video processing system according to an embodiment of the present application.

Fig. 2 is a schematic flowchart of a video processing method according to an embodiment of the present application.

Fig. 3 is a schematic diagram of reference factors in image quality evaluation provided in an embodiment of the present application.

Fig. 4 is another schematic flow chart of a video processing method according to an embodiment of the present application.

Fig. 5 is a schematic block diagram of a video processing flow according to an embodiment of the present disclosure.

Fig. 6 is a schematic view illustrating a user operation flow in a video content display process according to an embodiment of the present application.

Fig. 7 is a schematic structural diagram of a video processing apparatus according to an embodiment of the present application.

Fig. 8 is a schematic structural diagram of another video processing apparatus according to an embodiment of the present application.

Fig. 9 is a schematic structural diagram of another video processing apparatus according to an embodiment of the present application.

Fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides a video processing method, a video processing device, a storage medium and electronic equipment.

Referring to fig. 1, fig. 1 is a schematic view of a video processing system, where the video processing system may include any one of the video processing apparatuses provided in the embodiments of the present application, and the video processing apparatus may be integrated in an electronic device, where the electronic device may be a server or a client, the server may be a backend server of a video editing website, a backend server of a web browser or a backend server of an instant messaging software, and the client may be a video playing device, a smart phone, a tablet computer, or the like.

The electronic equipment can divide a video to be processed to obtain a plurality of video segments; updating the video clip according to the front and back image frames at each division position; determining the updated key image frame of the video clip; and generating an abstract image of the video to be processed according to the key image frame so as to display the video to be processed.

The video to be processed can be a video played by a video website, a video shared or sent when a friend chats, a video inserted in an information article, and the like, and a user can generate a video processing instruction in a video long-press mode and the like to trigger a step of processing the video serving as the video to be processed. The dividing mode can be set in advance, for example, the number of segments of the video with different time lengths, which need to be divided equally, is set in advance, and the video to be processed is divided equally based on the number of segments. The number of updates of the video segment may be multiple times, and the division position changes once every update. The key image frame is usually the most representative image frame with the best shooting effect in the corresponding video clip. The abstract image can be a long image formed by splicing a plurality of images from top to bottom or from left to right, and can be displayed in a reduced size during display, or the original image can be displayed in an up-down rolling or left-right rolling mode. The abstract image comprises an image part and a character part and is mainly used for displaying the main content of the video clip in a picture mode.

For example, referring to fig. 1, when a user a chats with a friend B, for a shared video sent by the friend B, the user a may generate a video processing instruction by long pressing the shared video on a chat interface to trigger processing of the shared video, specifically, the shared video may be initially divided into n video segments by an equal division manner, and the lengths of the video segments are adjusted (i.e., updated) multiple times according to two image frames before and after the latest division position, and then a key image frame may be selected from each adjusted video segment, and a spliced long image (i.e., a summary image) is formed according to the key image frames, and the spliced long image is displayed on the chat interface, so as to show the main content of the shared video to the user without video playing.

As shown in fig. 2, fig. 2 is a schematic flowchart of a video processing method provided in this embodiment, where the video processing method is applied to an electronic device such as a server or a client, where the server may be a background server of a video editing website, a background server of a web browser, or a background server of instant messaging software, and the client may be a video playing device, a smart phone, a tablet computer, or the like, and a specific flow may be as follows:

s101, dividing a video to be processed to obtain a plurality of video segments.

In this embodiment, the video to be processed may be a video played on a video website, a video shared or sent when a friend chats, or a video inserted into an information webpage, and the like. Before processing a video to be processed, it may be detected whether a video processing instruction is received, where the video processing instruction is used to instruct the video to be processed, and the video processing instruction may be automatically generated by the system, for example, when a user uploads a produced video to a video editing website, a website backend server may automatically generate the video processing instruction, and the video processing instruction may also be generated by manual triggering of the user, for example, when the user browses a webpage into which the video is inserted, or receives a chat message with the video sent by a friend, or in a video playing process, the video processing instruction may be generated by long-pressing the video or clicking a right button to select an option.

The dividing mode is usually equal division, the equal division number can be set in advance, for example, the number of segments of the video with different time lengths, which need to be equally divided, is set in advance, the actual number of segments is determined according to the actual time length of the video to be processed, and then the segments are equally divided, or the places where black screens or sudden changes of pictures occur in the video to be processed can be detected, and the video segments are divided based on the places.

And S102, updating the video clip according to the front and back image frames at each division position.

In this embodiment, when the video to be processed is divided according to time, a black screen, or a sudden change in picture, only one preliminary division is performed, and two adjacent video segments may have image frames with similar partial contents, so that different video segments can express different video contents as much as possible, and the preliminarily divided video segments can be further adjusted.

For example, the step S102 may specifically include:

taking two image frames before and after the current division position as target image frames;

moving the target image frame to a previous video segment corresponding to the division position to obtain a first moving segment, and taking a next video segment corresponding to the division position after moving as a second moving segment;

moving the target image frame to the next video segment corresponding to the division position to obtain a third moving segment, and taking the previous video segment corresponding to the division position after moving as a fourth moving segment;

updating the video segment according to the first moving segment, the second moving segment, the third moving segment, and the fourth moving segment.

In this embodiment, two changes may be performed on the currently divided video segment, one is to classify the front and rear image frames at the division position into front video segments, and the other is to classify the front and rear image frames at the division position into rear video segments, and then determine whether or not to update the original video segments and how to update the original video segments according to the changed front and rear video segments and the original video segments.

For example, the step of "updating the video segment according to the first moving segment, the second moving segment, the third moving segment and the fourth moving segment" may further include:

and updating the video segment according to the first difference, the second difference and the third difference.

In this embodiment, the first difference, the second difference and the third difference may be determined by similar methods, wherein the determining the difference may include:

determining a gray value array corresponding to each image frame;

In this embodiment, the gray value array is a one-dimensional array composed of a plurality of gray values, each image frame corresponds to a gray value array, and any one of the video segments may be one of the first moving segment, the second moving segment, the third moving segment, the previous video segment and the next video segment. The average value of all the gray value arrays corresponding to each video segment can be obtained to obtain a corresponding average gray value array, and then the square difference of the average gray value arrays of two corresponding video segments can be obtained to obtain the difference. Generally, the larger the square difference between the two average gray value arrays, the more dissimilar the contents of the corresponding two video segments, that is, the greater the difference, which may be expressed by cosine values in addition to the square difference.

The step of determining the gray value array corresponding to each image frame specifically includes:

determining a gray scale map for each of the image frames;

reading the gray values of the pixel points in the gray map according to a preset sequence;

In this embodiment, the image frame of the RGB channel may be first converted into a single-channel grayscale image, and then the grayscale value of each pixel point in the grayscale image is sequentially read in sequence to form an array as the grayscale value array.

Wherein, the step of updating the video segment according to the first difference, the second difference and the third difference specifically includes:

and when the target difference degree is the third difference degree or the updated times reach the preset times, taking the front and rear video segments corresponding to the target difference degree as the corresponding updated video segments, and stopping updating.

In this embodiment, the preset number of times may be a longest iteration step number set in advance, for example, 4 times. Generally, if the video segment with the largest difference is an untransformed video segment, it may be determined that the current division manner is already the best, and the division position does not need to be adjusted, if the video segment with the largest difference is a transformed video segment, it indicates that the current division manner is not the best, and the division position needs to be continuously adjusted with the two video segments with the largest difference as a reference, and operations such as analyzing and comparing the difference are performed again until the video segment with the largest difference is an untransformed video segment, or the adjustment times (i.e., the updated times) reach the longest iteration steps, and the updating process is ended.

And S103, determining the updated key image frame of the video clip.

For example, the step S103 may specifically include:

In this embodiment, the values of the front preset sheets and the rear preset sheets can be set manually, such as 1 or 2. After the final video segment is determined, an image frame with a high representativeness needs to be selected from each video segment and used as a key image frame, generally, the probability that the starting image frame and the ending image frame of the video segment become the key image frames is low, and the probability that the image frame at the middle moment of the video segment becomes the key image frame is high, so that when the key image frame is selected, the image frame at the middle moment of the video segment can be directly used as the key image frame, or one or more (namely, preset) image frames at the starting position and the ending position in the video segment can be filtered out first, and then the image frame is selected from the rest video segments.

Because a phenomenon that a part of images appear blurred and distorted or a person has a poor shooting effect (for example, a person has closed eyes in a person photograph) is easily caused by object motion, lens shake, rotation and the like in the shooting of a video, when a key image frame is selected, the image frames in the remaining video clips need to be analyzed from two evaluation angles of image quality score and/or image aesthetic score, and the key image frame is selected from the image frames.

Wherein, the image quality score is obtained after the image quality evaluation processing is carried out on the video segment, the image quality evaluation is one of the basic technologies in the image processing, the characteristic analysis and research are mainly carried out on the image, then, the quality of the image is evaluated, and referring to fig. 3, fig. 3 shows various reference factors involved in the evaluation of the image quality, which may include roughly 9 types of exposure, sharpness, focus, artifact, color, flash, anti-handshake, noise, texture, etc., each type relating to at least one reference factor, such as, focusing involves two reference factors, namely focusing speed and focusing repetition precision, different types of reference factors can cause different image defects, most of the reference factors are realized by means of a deep learning model, can help people filter out image frames with low image quality (namely low image quality) such as low definition, much noise and the like from video clips.

The image aesthetic score is obtained after the image aesthetic evaluation processing is carried out on the video segment, and the image aesthetic evaluation is a technology which enables a computer or a robot to 'find the beauty of the image' and 'understand the beauty of the image', and imitates the evaluation of the beauty and the ugly of the image by a human. The image aesthetic evaluation is mainly evaluated according to high-level semantic features such as composition, character expression and color of an image, and is mostly realized by means of a deep learning model, so that people can be helped to filter image frames with low aesthetic value (namely low image aesthetic score) such as the fact that the characters close eyes and are positioned at corners of a lens from a video clip, and the image frames with high aesthetic value such as the fact that the characters are rich in expression and are positioned at the center position are selected.

Wherein, the step of determining the key image frame from the corresponding filtered segment according to the image quality score and/or the image aesthetic score specifically comprises the following steps:

key image frames are determined from the retention segment according to the center time.

In this embodiment, the video segment may be filtered twice separately or simultaneously based on the image quality score or the image aesthetic score, leaving an image frame with the image quality score reaching a certain score and/or each image academic score reaching a certain score, and then selecting an image frame closest to the center of the video segment from the remaining image frames (i.e., the retained segments) as the key image frame.

And S104, generating an abstract image of the video to be processed according to the key image frame so as to display the video to be processed.

In this embodiment, the abstract image may be a long image formed by splicing a plurality of images from top to bottom or from left to right, and the display may be reduced during the display, or the original image may be displayed by scrolling up and down or scrolling left and right. The abstract image comprises a picture part and a character part and is mainly used for displaying the main content of the video clip in a picture and text mode.

For example, the step S105 may specifically include:

when the subtitle information exists on each image frame in the video to be processed, generating target subtitle information corresponding to the video clip according to the subtitle information of the image frame in each video clip;

when the video to be processed comprises the image frame without the subtitle information, generating target subtitle information of each video segment according to the audio frame of the video to be processed and the image frame or the audio frame;

and generating the abstract image of the video to be processed according to the target subtitle information and the key image frame.

In this embodiment, the caption information refers to non-image content that displays explanatory characters similar to dialogues in drama, movie, and stage works in a text form, such as the title, the credits, the lyrics, the dialogues, and the explanatory terms of the movie can be displayed in the form of caption information, and the caption information also refers to the text that is processed in the later stage of the movie (i.e., video), and usually appears in the lower area of the video, or can appear in both sides or the upper area. The detection of the subtitle information may be implemented by means of an OCR (Optical Character Recognition) technology, where OCR may recognize text, a font of the text, and a font size of the text in the video based on a deep learning method. The target subtitle information may be the main or all of the narrative content contained in a single video segment.

It should be noted that, generally, a video file includes two parts, namely, image data and audio data, which are in one-to-one correspondence in playing time, the image data of the video file can be extracted by means of FFmpeg, free studio and other tools, and the image frames in the image data are saved in a fixed format similar to jpeg in chronological order, and the audio data of the video file can be extracted by means of video format conversion software such as raccoon, format factory and the like, and stored in a fixed format similar to mp 3.

Since there may not be subtitle information on each image frame, even some videos are completely subtitle-free videos, when there is no image frame without subtitle information in a video, the text narration content (i.e. subtitle information) may be obtained directly by means of an OCR technique, when there is an image frame without subtitle information in a video, the text narration content may be obtained by separately recognizing audio data through a speech recognition technique, or may also be obtained by combining the speech recognition technique and the OCR technique, that is, the above step "generating the target subtitle information of each video segment from an audio frame of a video to be processed and the image frame, or the audio frame" may specifically include:

when the subtitle information does not exist on each image frame in the video to be processed, determining the text information of the audio frame corresponding to each video clip; generating target subtitle information of each video clip according to the character information;

In this embodiment, the Speech Recognition technology may be an ASR (Automatic Speech Recognition) technology. When the ASR technology and the OCR technology are combined to obtain the character explanatory content, the image frames with the subtitle information can be processed by the OCR technology, the image frames without the subtitle information can be processed by the ASR technology, then the two processing results are combined to obtain the complete subtitle information of a single video clip, at the moment, the complete subtitle information can be directly used as the target subtitle information of the video clip, and the subtitle information of the core content can also be selected from the complete subtitle information as the target subtitle information of the video clip.

In addition, the step of "generating the abstract image of the video to be processed according to the target subtitle information and the key image frame" includes:

In this embodiment, if subtitle information exists on a key image frame, the subtitle information may be removed first, and then target subtitle information is generated on the key image frame, where the generation position may be artificially defined, for example, a lower region of the key image frame, in the generation process, the size, font, and layout of characters in the target subtitle information may all be adjusted, and then the key image frames with the target subtitle information generated are spliced together according to a time sequence to form a long image, and the splicing may be performed from left to right, or from top to bottom, and the like.

According to the video processing method provided by the application, the video processing instruction is obtained, the video to be processed is carried by the video processing instruction, the video to be processed comprises a plurality of continuous image frames, the video to be processed is divided to obtain a plurality of video segments, the video segments are updated according to the front image frame and the rear image frame of each divided position, the key image frames of the updated video segments are determined, the abstract images of the video to be processed are generated according to the key image frames, the video to be processed is displayed, the video content can be obtained without playing the video, the flexibility is high, the obtaining speed is high, the video processing method is suitable for various scenes, and the use limitation is small.

According to the method described in the above embodiment, the following description will be made in detail by taking an example in which the video processing method is applied to a client.

Referring to fig. 4 and fig. 5, fig. 4 is a schematic flowchart of a video processing method according to an embodiment of the present application, and fig. 5 is a schematic block diagram of a video processing flow according to an embodiment of the present application, where the video processing method includes the following steps:

s201, a video processing instruction is obtained, wherein the video processing instruction carries a video to be processed, and the video to be processed comprises a plurality of continuous image frames and audio frames.

For example, the video to be processed may be a video played by a video website, a video shared or sent by a friend while chatting, or a video inserted in an information webpage. The video processing instruction may be generated by a user through manual triggering, for example, when the user browses a webpage into which a video is inserted, or receives a chat message carrying the video sent by a friend, the video processing instruction may be generated by triggering in a manner of pressing the video for a long time or clicking a right button to select an option.

S202, the video to be processed is divided to obtain a plurality of video segments.

For example, assuming that the provided to-be-processed video includes a frame sequence F, and the number of segments to be divided is determined to be n according to the actual duration of the to-be-processed video, the frame sequence F may be equally divided into n subsequences, each subsequence being a video segment.

S203, taking the front and the back image frames at the current dividing position as target image frames, moving the target image frames to the front video segment at the corresponding dividing position to obtain a first moving segment, taking the back video segment at the corresponding dividing position after moving as a second moving segment, simultaneously moving the target image frames to the back video segment at the corresponding dividing position to obtain a third moving segment, and taking the front video segment at the corresponding dividing position after moving as a fourth moving segment.

S204, determining the difference degree between the first moving segment and the corresponding second moving segment to obtain a first difference degree, determining the difference degree between the third moving segment and the corresponding fourth moving segment to obtain a second difference degree, and determining the difference degree between the previous video segment and the corresponding next video segment to obtain a third difference degree.

For example, when the currently divided two adjacent video segments a1 and a2 are moved, there are two moving methods, one is a video segment in which the two front and rear image frames at the divided position are divided into front frames (method ①), the video segments B1 and B2 are obtained after the movement, the other is a video segment in which the two front and rear image frames at the divided position are divided into rear frames (method ②), and the video segments C1 and C C2. obtained after the movement need not optimize the divided positions of a1 and a2 if it is determined that the divided positions of a1 and a2, B1 and B2, and C1 and C2 need to be calculated respectively.

In this embodiment, the first difference, the second difference and the third difference may be determined by a similar method, wherein the determining the difference may include:

determining a gray value array corresponding to each image frame;

determining a gray scale map for each of the image frames;

S205, the difference with the largest value among the first difference, the second difference and the third difference is taken as a target difference.

S206, when the target difference is the third difference or the updated times reach the preset times, taking the front and back video clips corresponding to the target difference as the corresponding updated video clips, and stopping updating.

S207, when the target difference degree is the first difference degree or the second difference degree, or the updated times do not reach the preset times, taking the moved front and rear video clips corresponding to the target difference degree as corresponding updated video clips, updating the division position according to the updated video clips, and then returning to execute the step S203.

For example, if the difference between a1 and a2 is the largest, it indicates that the partition positions of the two video segments are already the best, and no further optimization is needed, if the difference between B1 and B2 is the largest, it indicates that the current partition position can be optimized by using the moving mode ①, and the optimization steps of a1 and a2 are repeated based on the optimized B1 and B2, and if the difference between C1 and C2 is the largest, it indicates that the current partition position can be optimized by using the moving mode ②, and the optimization steps of a1 and a2 are repeated based on the optimized C1 and C2 until the best or the maximum number of times of optimization is reached.

S208, filtering out the front preset image frame and the rear preset image frame of each updated video clip to obtain the corresponding filtered clip, and determining the image quality score and/or the image aesthetic score of each image frame in the filtered clip.

S209, filtering the image frames in the filtered segment according to the image quality score and/or the image aesthetic score to obtain a reserved segment, and determining the corresponding center time according to the generation time of each image frame in the updated video segment.

For example, the first and last two image frames of each updated video segment may be filtered, and then the remaining image frames are input into the trained image quality assessment model and/or image aesthetic assessment model to obtain the image quality score and/or image aesthetic score of each image frame, and the image frames with lower image quality score and/or image aesthetic score are filtered out.

And S210, determining a key image frame from the retention segment according to the central moment.

And S211, when subtitle information exists on each image frame in the video to be processed, generating target subtitle information corresponding to the video clip according to the subtitle information of the image frame in each video clip.

S212, when the image frame without the subtitle information is included in the video to be processed, generating the target subtitle information of each video segment according to the audio frame of the video to be processed and the image frame or the audio frame.

In this embodiment, the step S212 may specifically include:

For example, for an image frame with subtitle information, the character description content can be obtained directly by means of an OCR technology, for an image frame without subtitle information, the character description content can be obtained by recognizing audio data through an ASR technology, and then the target subtitle information is generated according to the character description content of the same video segment.

And S213, generating the abstract image of the video to be processed according to the target subtitle information and the key image frame.

For example, referring to fig. 6, when a user presses a certain video, the background may be triggered to process the video to obtain a plurality of key image frames and target subtitle information for each image frame, and the specific processing steps may refer to the above steps S202-S212, and then, if the subtitle information exists on the key image frame, the subtitle information may be removed first, and then the target subtitle information is generated in the lower area of the key image frame according to the designated font, the text type and the text size, and then the key image frames are spliced together from top to bottom according to the time sequence to form a long image, so as to obtain a summary image, and the summary image is displayed to the user, wherein the user may browse the summary image in a full text manner by scrolling up and down.

According to the methods described in the above embodiments, the present embodiment will be further described from the perspective of a video processing apparatus, which may be specifically implemented as a stand-alone entity or integrated in an electronic device.

Referring to fig. 7, fig. 7 specifically illustrates a video processing apparatus provided in an embodiment of the present application, which is applied to an electronic device, and the video processing apparatus may include: a dividing module 10, an updating module 20, a determining module 30 and a generating module 40, wherein:

(1) partitioning module 10

The dividing module 10 is configured to divide a video to be processed to obtain a plurality of video segments.

(2) Update module 20

And the updating module 20 is used for updating the video segment according to the front image frame and the rear image frame at each division position.

For example, referring to fig. 8, the update module 20 specifically includes:

a determining unit 21, configured to take two previous and next image frames at the current division position as target image frames;

the first moving unit 22 is configured to move the target image frame to a previous video segment corresponding to the division position to obtain a first moving segment, and use a next video segment corresponding to the division position after the movement as a second moving segment;

a second moving unit 23, configured to move the target image frame to the next video segment corresponding to the division position to obtain a third moving segment, and use the previous video segment corresponding to the division position after moving as a fourth moving segment;

an updating unit 24, configured to update the video segment according to the first moving segment, the second moving segment, the third moving segment, and the fourth moving segment.

The updating unit 24 is specifically configured to:

In this embodiment, the first difference, the second difference and the third difference can be determined by similar methods, wherein the updating unit 24 is specifically configured to:

determining a gray value array corresponding to each image frame;

The updating unit 24 is specifically configured to:

determining a gray scale map for each of the image frames;

The updating unit 24 is specifically configured to:

(3) Determination module 30

A determining module 30, configured to determine the updated key image frame of the video segment.

For example, the determining module 30 is specifically configured to:

Wherein the determining module 30 is specifically configured to:

(4) Generating module 40

And the generating module 40 is configured to generate an abstract image of the to-be-processed video according to the key image frame, so as to display the to-be-processed video.

For example, referring to fig. 9, the generating module 40 includes:

a first generating unit 41, configured to generate target subtitle information of a corresponding video segment according to subtitle information of an image frame in each video segment when subtitle information exists on each image frame in the video to be processed; when the video to be processed comprises the image frame without the subtitle information, generating target subtitle information of each video segment according to the audio frame of the video to be processed and the image frame or the audio frame;

and a second generating unit 42, configured to generate a summary image of the to-be-processed video according to the target subtitle information and the key image frame.

Since there may not be subtitle information on each image frame, even some videos are completely subtitle-free videos, when there is no image frame without subtitle information in a video, the text narration content (i.e., subtitle information) may be obtained directly by means of an OCR technology, when there is an image frame without subtitle information in a video, the text narration content may be obtained by separately recognizing audio data through a voice recognition technology, or may be obtained by combining the voice recognition technology and the OCR technology, that is, the second generating unit 42 is specifically configured to:

The first generating unit 41 is specifically configured to:

In a specific implementation, the above units may be implemented as independent entities, or may be combined arbitrarily to be implemented as the same or several entities, and the specific implementation of the above units may refer to the foregoing method embodiments, which are not described herein again.

As can be seen from the above, in the video processing apparatus provided in this embodiment, the to-be-processed video is divided by the dividing module 10 to obtain a plurality of video segments, then the updating module 20 updates the video segments according to the two front and rear image frames at each divided position, then the determining module 30 determines the updated key image frame of the video segment, and the generating module 40 generates the abstract image of the to-be-processed video according to the key image frame to display the to-be-processed video, so that the video content can be acquired without playing the video, the flexibility is high, the acquisition speed is high, the apparatus is suitable for various scenes, and the use limitation is small.

Correspondingly, the embodiment of the invention also provides a video processing system, which comprises any one of the video processing devices provided by the embodiment of the invention, and the video processing device can be integrated in electronic equipment.

The specific implementation of each device can be referred to the previous embodiment, and is not described herein again.

Since the video processing system may include any video processing apparatus provided in the embodiment of the present invention, beneficial effects that can be achieved by any video processing apparatus provided in the embodiment of the present invention can be achieved, for details, see the foregoing embodiment, and are not described herein again.

Accordingly, an embodiment of the present invention further provides an electronic device, as shown in fig. 10, which shows a schematic structural diagram of the electronic device according to the embodiment of the present invention, specifically:

the electronic device may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, Radio Frequency (RF) circuitry 403, a power supply 404, an input unit 405, and a display unit 406. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 10 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 401 is a control center of the electronic device, connects various parts of the whole electronic device by various interfaces and lines, performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby performing overall monitoring of the electronic device. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.

The RF circuit 403 may be used for receiving and transmitting signals during information transmission and reception, and in particular, for receiving downlink information of a base station and then processing the received downlink information by the one or more processors 401; in addition, data relating to uplink is transmitted to the base station. In general, the RF circuitry 403 includes, but is not limited to, an antenna, at least one Amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuitry 403 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to Global System for mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), and the like.

The electronic device further includes a power supply 404 (e.g., a battery) for supplying power to the various components, and preferably, the power supply 404 is logically connected to the processor 401 via a power management system, so that functions of managing charging, discharging, and power consumption are implemented via the power management system. The power supply 404 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The electronic device may further include an input unit 405, and the input unit 405 may be used to receive input numeric or character information and generate a keyboard, mouse, joystick, optical or trackball signal input in relation to user settings and function control. Specifically, in one particular embodiment, input unit 405 may include a touch-sensitive surface as well as other input devices. The touch-sensitive surface, also referred to as a touch display screen or a touch pad, may collect touch operations by a user (e.g., operations by a user on or near the touch-sensitive surface using a finger, a stylus, or any other suitable object or attachment) thereon or nearby, and drive the corresponding connection device according to a predetermined program. Alternatively, the touch sensitive surface may comprise two parts, a touch detection means and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 401, and can receive and execute commands sent by the processor 401. In addition, touch sensitive surfaces may be implemented using various types of resistive, capacitive, infrared, and surface acoustic waves. The input unit 405 may include other input devices in addition to the touch-sensitive surface. In particular, other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The electronic device may also include a display unit 406, and the display unit 406 may be used to display information input by or provided to the user as well as various graphical user interfaces of the electronic device, which may be made up of graphics, text, icons, video, and any combination thereof. The Display unit 406 may include a Display panel, and optionally, the Display panel may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch-sensitive surface may overlay the display panel, and when a touch operation is detected on or near the touch-sensitive surface, the touch operation is transmitted to the processor 401 to determine the type of the touch event, and then the processor 401 provides a corresponding visual output on the display panel according to the type of the touch event. Although in FIG. 10 the touch sensitive surface and the display panel are two separate components to implement input and output functions, in some embodiments the touch sensitive surface may be integrated with the display panel to implement input and output functions.

Although not shown, the electronic device may further include a camera, a bluetooth module, and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 401 in the electronic device loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application program stored in the memory 402, thereby implementing various functions as follows:

dividing the processed video to obtain a plurality of video segments;

determining the updated key image frame of the video clip;

The electronic device can achieve the effective effect that can be achieved by any video processing apparatus provided in the embodiments of the present invention, which is detailed in the foregoing embodiments and not described herein again.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

The foregoing detailed description has provided a video processing method, an apparatus, a storage medium, and an electronic device according to embodiments of the present invention, and specific examples are applied herein to illustrate the principles and implementations of the present invention, and the above descriptions of the embodiments are only used to help understanding the method and the core concept of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A video processing method, comprising:

dividing a video to be processed to obtain a plurality of video segments;

determining a key image frame of the updated video clip;

2. The video processing method according to claim 1, wherein the updating the video segment according to the two previous and next image frames at each division position comprises:

3. The video processing method according to claim 2, wherein said updating the video segment according to the first moving segment, the second moving segment, the third moving segment, and the fourth moving segment comprises:

4. The video processing method according to claim 3, wherein said updating the video segment according to the first degree of difference, the second degree of difference, and the third degree of difference comprises:

5. The video processing method according to claim 3, wherein the determining the degree of difference comprises:

determining a gray value array corresponding to each image frame;

6. The method of claim 5, wherein the determining the gray value array corresponding to each image frame comprises:

determining a gray scale map for each of the image frames;

7. The video processing method according to any of claims 1-6, wherein said determining the updated key image frames of the video segment comprises:

8. The video processing method according to claim 7, wherein said determining key image frames from corresponding filtered segments according to the image quality score and/or image aesthetic score comprises:

9. The video processing method according to any one of claims 1 to 6, wherein the generating a summary image of the video to be processed according to the key image frame comprises:

when subtitle information exists on each image frame in the video to be processed, generating target subtitle information corresponding to the video clip according to the subtitle information of the image frame in each video clip;

10. The video processing method according to claim 9, wherein the generating a summary image of the video to be processed according to the target subtitle information and the key image frame comprises:

11. The video processing method according to claim 9, wherein said generating the target caption information of each of the video segments according to the audio frame and the image frame of the video to be processed, or the audio frame, comprises:

12. A video processing apparatus, comprising:

13. The video processing apparatus according to claim 12, wherein the update module specifically includes:

14. A computer-readable storage medium having stored thereon a plurality of instructions adapted to be loaded by a processor to perform the video processing method of any of claims 1 to 11.

15. An electronic device comprising a processor and a memory, the processor being electrically connected to the memory, the memory being configured to store instructions and data, the processor being configured to perform the steps of the video processing method according to any one of claims 1 to 11.