CN117237843A

CN117237843A - Video highlight extraction method and device, computer equipment and storage medium

Info

Publication number: CN117237843A
Application number: CN202311186901.XA
Authority: CN
Inventors: 张雅君; 李亘杰
Original assignee: Shanghai Bilibili Technology Co Ltd
Current assignee: Shanghai Bilibili Technology Co Ltd
Priority date: 2023-09-13
Filing date: 2023-09-13
Publication date: 2023-12-15

Abstract

The application discloses a video highlight extraction method. The method comprises the following steps: acquiring a subtitle file of a target video, wherein the subtitle file comprises a subtitle text and a timestamp corresponding to the subtitle text; performing content recognition on the caption text through the trained large language model to obtain a content recognition result, wherein the content recognition result comprises at least one highlight caption text; determining a time stamp corresponding to the at least one highlight subtitle text according to the subtitle file, and extracting at least one primary video segment from the target video according to the determined time stamp; scoring the at least one primary video segment through the trained aesthetic model to obtain aesthetic scoring values of the primary video segments; and screening the primary selected video segments according to the aesthetic grading values of the primary selected video segments to obtain video highlight segments. The application can accurately screen out the video highlight.

Description

Video highlight extraction method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of video technologies, and in particular, to a method and apparatus for extracting a video highlight, a computer device, and a storage medium.

Background

As networks evolve, the amount of video on the network increases. Video is a rich and colorful medium that contains a large amount of content, however, the truly engaging or critical parts of video are often submerged in lengthy material. When people's life pace is fast, there is not much time wasted in watching lengthy videos, and thus, there is a need for highlight clips extracted from videos.

In the prior art, the filtering of highlight video segments mostly requires manual intervention, i.e. the highlight segments in the video are manually selected and clipped according to preset standards or requirements by manually watching the video. However, manually screening highlight video clips is time consuming and labor intensive.

Disclosure of Invention

In view of the above, a method, apparatus, computer device and computer readable storage medium for extracting video highlight are now provided to solve the above-mentioned problems.

The application provides a video highlight extraction method, which comprises the following steps:

acquiring a subtitle file of a target video, wherein the subtitle file comprises a subtitle text and a timestamp corresponding to the subtitle text;

performing content recognition on the caption text through the trained large language model to obtain a content recognition result, wherein the content recognition result comprises at least one highlight caption text;

Determining a time stamp corresponding to the at least one highlight subtitle text according to the subtitle file, and extracting at least one primary video segment from the target video according to the determined time stamp;

scoring the at least one primary video segment through the trained aesthetic model to obtain aesthetic scoring values of the primary video segments;

and screening the primary selected video segments according to the aesthetic grading values of the primary selected video segments to obtain video highlight segments.

Optionally, the method further comprises:

detecting the scene switching frequency of the at least one primary selected video segment to obtain the scene switching frequency of each primary selected video segment;

determining the dynamic performance scoring value of each primary selected video segment according to the scene switching frequency of each primary selected video segment;

the step of screening the primary selected video segments according to the aesthetic grading values of the primary selected video segments to obtain video highlight segments comprises the following steps:

and screening the primary selected video segments according to the aesthetic score value and the dynamic performance score value of each primary selected video segment to obtain video highlight segments.

Optionally, the method further comprises:

acquiring the number of barrages corresponding to each primary video segment;

Determining the heat scoring value of each primary selected video segment according to the bullet screen quantity corresponding to each primary selected video segment;

and screening the primary selected video segments according to the aesthetic score value and the heat score value of each primary selected video segment to obtain video highlight segments.

Optionally, the method further comprises:

acquiring the number of barrages corresponding to each primary video segment;

and screening the primary selected video segments according to the aesthetic score value, the dynamic performance score value and the heat score value of each primary selected video segment to obtain video highlight segments.

Optionally, the screening the primary selected video segments according to the aesthetic score value, the dynamic performance score value and the heat score value of each primary selected video segment to obtain the video highlight segment includes:

acquiring the aesthetic grading value, the dynamic performance grading value and the weight value corresponding to the heat grading value;

Calculating the total score value of each primary selected video segment according to the aesthetic score value, the dynamic performance score value, the heat score value, the weight value corresponding to the aesthetic score value, the weight value corresponding to the dynamic performance score value and the weight value corresponding to the heat score value of each primary selected video segment;

and selecting the initially selected video clips with total score values meeting preset conditions as video highlight clips.

Optionally, the obtaining the aesthetic score value, the dynamic performance score value, and the weight value corresponding to the heat score value includes:

acquiring type information of the target video;

and determining the aesthetic grading value, the dynamic performance grading value and the weight value corresponding to the heat grading value according to the type information.

Optionally, the content recognition of the caption text through the trained large language model, and obtaining a content recognition result includes:

inputting a prompt word into the large language model, wherein the prompt word comprises character information, the subtitle text, output guidance and output examples;

and carrying out content recognition on the caption text through the large language model to obtain a content recognition result.

Optionally, scoring the at least one primary video segment with the trained aesthetic model to obtain aesthetic scoring values for each primary video segment includes:

Respectively performing frame extraction processing on each primary selected video segment to obtain at least one key frame;

scoring key frames corresponding to each primary video segment through the trained aesthetic model to obtain aesthetic scoring values of the key frames of each primary video segment;

the aesthetic score value of each primary video segment is determined based on the aesthetic score values of the key frames of each primary video segment.

Optionally, the screening the primary selected video segments according to the aesthetic score values of the primary selected video segments to obtain video highlight segments includes:

detecting whether aesthetic grading values lower than a preset grading value exist in the aesthetic grading values of key frames of each primary selected video segment;

if the aesthetic score value lower than the preset score value exists in the aesthetic score values of the key frames of the current primary video clips, deleting the current primary video clips;

and screening the primary video segments according to the aesthetic score values of the key frames of other primary video segments to obtain video highlight segments, wherein the other primary video segments are the primary video segments with the aesthetic score values lower than the preset score value in the aesthetic score values of the key frames of the primary video segments.

Optionally, the method further comprises:

inputting caption text corresponding to the video highlight into the large language model;

and generating the content description corresponding to the video highlight through the large language model.

The application also provides a video highlight extraction device, which comprises:

the acquisition module is used for acquiring a subtitle file of a target video, wherein the subtitle file comprises a subtitle text and a timestamp corresponding to the subtitle text;

the recognition module is used for carrying out content recognition on the caption text through the trained large language model to obtain a content recognition result, wherein the content recognition result comprises at least one highlight caption text;

the determining module is used for determining a time stamp corresponding to the at least one highlight subtitle text according to the subtitle file, and extracting at least one primary video segment from the target video according to the determined time stamp;

the scoring module is used for scoring the at least one primary video segment through the trained aesthetic model to obtain aesthetic scoring values of the primary video segments;

and the screening module is used for screening the primary selected video clips according to the aesthetic grading values of the primary selected video clips to obtain video highlight clips.

The application also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.

The application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the above method.

According to the video highlight extraction method, the subtitle file of the target video is obtained, and the subtitle file comprises a subtitle text and a timestamp corresponding to the subtitle text; performing content recognition on the caption text through the trained large language model to obtain a content recognition result, wherein the content recognition result comprises at least one highlight caption text; determining a time stamp corresponding to the at least one highlight subtitle text according to the subtitle file, and extracting at least one primary video segment from the target video according to the determined time stamp; scoring the at least one primary video segment through the trained aesthetic model to obtain aesthetic scoring values of the primary video segments; and screening the primary selected video segments according to the aesthetic grading values of the primary selected video segments to obtain video highlight segments. According to the video highlight extraction method, global analysis is carried out on the basis of the video subtitle files through the trained large language model, so that highlight text is positioned, then a primary video segment is extracted based on a timestamp corresponding to the highlight text, finally, the picture characteristics of the primary video segment are analyzed according to the aesthetic model, and therefore the final video highlight is screened out. The video highlight extraction method can accurately and rapidly extract highlight video clips without manual watching.

Drawings

FIG. 1 is a schematic view of an application environment of an embodiment of a video highlight extraction method according to the present application;

FIG. 2 is a flowchart of an embodiment of a video highlight extraction method according to the present application;

FIG. 3 is a detailed schematic diagram of steps for obtaining a content recognition result by performing content recognition on the caption text through a trained large language model according to an embodiment of the present application;

FIG. 4 is a detailed schematic diagram of the step of scoring the at least one primary video segment with a trained aesthetic model to obtain aesthetic scoring values for each primary video segment in an embodiment of the present application;

FIG. 5 is a detailed schematic diagram of the steps for selecting a primary video segment according to the aesthetic score value of each primary video segment to obtain a video highlight according to an embodiment of the present application;

FIG. 6 is a detailed schematic diagram of steps for determining dynamic performance score values of each of the initially selected video clips according to an embodiment of the present application;

FIG. 7 is a detailed schematic diagram of a step of determining a heat score value for each of the first selected video clips according to an embodiment of the present application;

FIG. 8 is a detailed schematic diagram of a step of filtering primary video clips according to aesthetic score values, dynamic performance score values and heat score values of each primary video clip to obtain a video highlight according to an embodiment of the present application;

FIG. 9 is a detailed schematic diagram of steps for obtaining the aesthetic score, the dynamic performance score, and the weight corresponding to the heat score according to an embodiment of the present application;

FIG. 10 is a flowchart of another embodiment of a video highlight extraction method according to the present application;

FIG. 11 is a block diagram illustrating a video highlight extraction apparatus according to an embodiment of the present application;

fig. 12 is a schematic hardware structure of a computer device for performing the video highlight extraction method according to an embodiment of the present application.

Detailed Description

Advantages of the application are further illustrated in the following description, taken in conjunction with the accompanying drawings and detailed description.

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in this disclosure to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

In the description of the present application, it should be understood that the numerical references before the steps do not identify the order in which the steps are performed, but are merely used to facilitate description of the present application and to distinguish between each step, and thus should not be construed as limiting the present application.

The following is a term explanation of the present application:

large language model: the large language model is a natural language processing model constructed by utilizing a deep learning technology, can process a large amount of text data, and has strong language understanding and generating capability. It uses billions of parameters to train, understand semantics, context, and generate natural language responses. Large language models achieve significant achievements in a number of fields, such as natural language understanding, dialog systems, text summaries, etc.

Shot boundary detection algorithm (scene cut): is a technique commonly used in video editing and processing for detecting switching points between different scenes in a video sequence. The method identifies the position of scene change by analyzing the difference between adjacent frames, thereby realizing automatic segmentation of video shots.

Visual aesthetic algorithm: visual aesthetic algorithms are a method based on computer vision and artificial intelligence techniques aimed at evaluating and optimizing the aesthetic appeal of an image by analyzing its visual characteristics such as color, composition, symmetry, etc. The algorithm utilizes machine learning and deep learning technologies, so that a computer can better understand and simulate human perception and preference of aesthetics, and beneficial assistance is provided for the fields of image editing, artistic creation, visual design and the like.

ASR (Automatic Speech Recognition ) algorithm: is a method for converting a speech signal into text by using artificial intelligence technology. It is composed of acoustic model, language model and pronunciation dictionary, etc. and recognizes and decodes the corresponding text output from the speech input.

OCR (Optical Character Recognition ) algorithm: is a technology for automatically converting text information in an image into editable text. The method utilizes computer vision and pattern recognition technology to convert scanned documents, pictures or handwritten characters into digital texts through the steps of image preprocessing, character segmentation, feature extraction, classification and the like.

An exemplary application environment for the present application is provided below. Fig. 1 schematically illustrates an application environment of a video highlight extraction method according to an embodiment of the present application.

In an exemplary embodiment, the system of the application environment may include a terminal device 10, a server 20. Wherein the terminal device 10 is connected with the server 20 through a wireless or wired network. The terminal device 10 is deployed with a client, which may be a web client, an APP client, a web client, etc. The terminal device 10 may be a mobile terminal, a fixed terminal, or the like. The server 20 may be a rack server, a blade server, a tower server, or a rack server (including a stand-alone server, or a server cluster composed of multiple servers), etc. The network may include various network devices such as routers, switches, multiplexers, hubs, modems, bridges, repeaters, firewalls, and/or proxy devices, etc. The network may also include physical links such as coaxial cable links, twisted pair cable links, fiber optic links, combinations thereof, and/or the like.

In the related art, a video is generally watched manually, and highlight clips in the video are manually selected and clipped according to a preset standard or requirement. Or extracting unique features or fingerprints of the video through a video fingerprint technology, and then quickly searching and matching corresponding fragments from the secondary video uploaded by a large-scale user as video highlight fragments.

However, obtaining video highlights manually is time consuming and laborious. It is only suitable for small-scale extraction tasks, but has lower efficiency in large-scale video processing, and cannot meet the requirement of high efficiency. The effectiveness of the technique by video fingerprinting depends on the quality of feature extraction, which may lead to inaccuracy in matching if the extracted features are not sufficiently accurate or rich.

Based on the problems, the method and the device acquire the subtitle file of the target video, wherein the subtitle file comprises a subtitle text and a timestamp corresponding to the subtitle text; performing content recognition on the caption text through the trained large language model to obtain a content recognition result, wherein the content recognition result comprises at least one highlight caption text; determining a time stamp corresponding to the at least one highlight subtitle text according to the subtitle file, and extracting at least one primary video segment from the target video according to the determined time stamp; scoring the at least one primary video segment through the trained aesthetic model to obtain aesthetic scoring values of the primary video segments; and screening the primary selected video segments according to the aesthetic grading values of the primary selected video segments to obtain video highlight segments. According to the video highlight extraction method, global analysis is carried out on the basis of the video subtitle files through the trained large language model, so that highlight text is positioned, then a primary video segment is extracted based on a timestamp corresponding to the highlight text, finally, the picture characteristics of the primary video segment are analyzed according to the aesthetic model, and therefore the final video highlight is screened out. The video highlight extraction method can accurately and rapidly extract highlight video clips without manual watching.

In the following, several embodiments will be provided in the above exemplary application environment to illustrate the video highlight extraction scheme in the present application. Fig. 2 is a flowchart illustrating a video highlight extraction method according to an embodiment of the application. The flow diagrams in the method embodiments are not intended to limit the order in which the steps are performed. As can be seen from the figure, the video highlight extraction method provided in the present embodiment includes:

step S20, acquiring a subtitle file of a target video, wherein the subtitle file comprises a subtitle text and a timestamp corresponding to the subtitle text.

Specifically, the subtitle exists in different forms in the target video, and the subtitle file is acquired in different manners. Specifically, when a subtitle exists in a target video in the form of a plug-in subtitle (soft subtitle), the subtitle file may be directly acquired from a video file corresponding to the target video when the subtitle file is acquired. When the subtitle does not exist in the target video, the audio in the target video can be converted into subtitle text through an automatic voice recognition technology. For example, an ASR algorithm is used to convert audio in the target video to corresponding subtitle text. After the subtitle text is obtained, the time stamp of each subtitle text is determined according to the time stamp of the audio. When the subtitle exists in the target video in the form of a hard subtitle, the hard subtitle in the target video can be converted into the subtitle file by a text character recognition technology when the subtitle file is acquired. For example, hard captions in the target video are recognized through an OCR algorithm, and caption text is obtained. After the subtitle text is obtained, determining the time stamp of each subtitle text according to the time stamp of each picture in the target video.

And S21, carrying out content recognition on the caption text through the trained large language model to obtain a content recognition result, wherein the content recognition result comprises at least one highlight caption text.

Specifically, the large language model may be ChatGPT of OpenAI, text-to-speech of hundred degrees, and the like.

The large language model has strong context awareness capability and can quickly capture key information in caption text. And in the process of carrying out content recognition on the caption text, the large language model analyzes the vocabulary, recognizes the part related to the theme and rapidly positions the highlight in the caption text so as to obtain at least one highlight caption text.

In an exemplary embodiment, in order to improve accuracy, referring to fig. 3, the performing content recognition on the subtitle text through the trained large language model, to obtain a content recognition result includes:

step S30, inputting a prompt word into the large language model, wherein the prompt word comprises character information, the caption text, output guidance and output examples.

Specifically, the prompt word (prompt) can help the large language model to understand the user intention more quickly, so that the content recognition result output by the large language model is more accurate.

The character information is used for determining which character caption text is specifically required to be output.

The output guideline is used to determine in what form the large language model outputs the content recognition result.

The output examples are used as examples for reference when the large language model outputs the content recognition result.

And S31, carrying out content recognition on the caption text through the large language model to obtain a content recognition result.

Specifically, after the large language model receives the prompt word, content recognition is performed on the subtitle text according to the requirement in the prompt word, and a content recognition result is obtained.

According to the application, the subtitle text is input into the large language model in a prompt word (prompt) mode, so that the accuracy of the large language model in recognition of the subtitle text is improved.

And S22, determining a time stamp corresponding to the at least one highlight subtitle text according to the subtitle file, and extracting at least one primary video segment from the target video according to the determined time stamp.

Specifically, since the subtitle file includes the time stamp corresponding to each subtitle text, after the highlight subtitle text is obtained, the obtained highlight subtitle text and the subtitle text in the subtitle file can be subjected to matching processing, and after the matched subtitle text is found, the time stamp corresponding to the found subtitle text is used as the time stamp corresponding to the highlight subtitle text.

In an embodiment, matching processing of the highlight subtitle text and the subtitle text in the subtitle file can be achieved through a regular expression mode.

After determining the corresponding time stamps of each highlight text, the corresponding primary video segments can be extracted from the target video according to the time stamps.

It will be appreciated that the number of extracted primary video clips is the same as the number of highlight text.

As an example, if the determined time stamp is 0-1 min, 3-5 min, and 20-22 min, then video clips of 0-1 min, 3-5 min, and 20-22 min may be extracted from the target video as the initial video clips.

And S23, scoring the at least one primary video segment through the trained aesthetic model to obtain the aesthetic scoring value of each primary video segment.

In particular, the aesthetic model is a model that scores the aesthetics of an image using visual aesthetic algorithms.

Aesthetic evaluation of video frames plays an important role in measuring visual appeal and level of highlights of video, and thus, by aesthetically scoring individual primary video segments, primary video segments with lower aesthetic scores can be selected.

In an exemplary embodiment, referring to fig. 4, scoring the at least one preliminary video segment with the trained aesthetic model to obtain the aesthetic scoring value of each preliminary video segment includes:

and S40, respectively performing frame extraction processing on each primary selected video segment to obtain at least one key frame.

Specifically, before each primary selected video segment is scored by using the aesthetic model, frame extraction processing is performed on each primary selected video segment, so that key frames after frame extraction processing are only required to be scored later, and all video frames in each primary selected video segment are not required to be scored, thereby reducing data processing amount and improving scoring efficiency.

In an embodiment, when frame extraction is performed, frame extraction may be performed according to a preset frame extraction frequency, that is, an equidistant frame extraction manner is used to extract frames from the initial video segment. For example, the interval is 2 frames, and when frame extraction is performed, the 1 st frame, the 3 rd frame, the 5 th frame, the 7 th frame, and the like are extracted from each initial video segment as key frames.

And S41, scoring the key frames corresponding to the primary selected video clips through the trained aesthetic model to obtain aesthetic scoring values of the key frames of the primary selected video clips.

Specifically, after the frame extraction process is completed on each primary selected video segment, the key frames corresponding to each primary selected video segment are input into the aesthetic model, so that the aesthetic model can score each key frame and output the aesthetic scoring value of each key frame.

In one particular scenario, the aesthetic score value may be set between 0 and 10. It will be appreciated that in other scenarios, the aesthetic score value may be set between 0 and 100, and is not limited in this embodiment.

Step S42, the aesthetic grading value of each primary video segment is determined according to the aesthetic grading value of the key frame of each primary video segment.

Specifically, in deriving the aesthetic score values of all the key frames of each of the preliminary video clips, an average value of the aesthetic score values of all the key frames of each of the preliminary video clips may be taken as the aesthetic score value of the preliminary video clip.

As an example, the initial video clip 1 has 3 key frames, and the aesthetic score values of the 3 key frames are a, b, and c, respectively, then the aesthetic score value of the initial video clip 1 is: (a+b+c)/3.

In another embodiment, the median of the aesthetic score values of all key frames of each of the primary video segments may also be taken as the aesthetic score value of that primary video segment.

And S24, screening the primary selected video segments according to the aesthetic grading values of the primary selected video segments to obtain video highlight segments.

Specifically, after the aesthetic score value of each primary video segment is obtained, the primary video segments can be screened according to the aesthetic score value, so that the video segments with better aesthetic effects can be screened out as video highlight segments.

In one embodiment, the first selected video segment with the top N number of aesthetic score values may be used as the highlight video segment.

In another embodiment, the first selected video segment having an aesthetic score value greater than a preset aesthetic score threshold may also be used as the highlight video segment.

According to the method, the subtitle file of the target video is obtained, and the subtitle file comprises a subtitle text and a timestamp corresponding to the subtitle text; performing content recognition on the caption text through the trained large language model to obtain a content recognition result, wherein the content recognition result comprises at least one highlight caption text; determining a time stamp corresponding to the at least one highlight subtitle text according to the subtitle file, and extracting at least one primary video segment from the target video according to the determined time stamp; scoring the at least one primary video segment through the trained aesthetic model to obtain aesthetic scoring values of the primary video segments; and screening the primary selected video segments according to the aesthetic grading values of the primary selected video segments to obtain video highlight segments. According to the video highlight extraction method, global analysis is carried out on the basis of the video subtitle files through the trained large language model, so that highlight text is positioned, then a primary video segment is extracted based on a timestamp corresponding to the highlight text, finally, the picture characteristics of the primary video segment are analyzed according to the aesthetic model, and therefore the final video highlight is screened out. The video highlight extraction method can accurately and rapidly extract highlight video clips without manual watching.

In an exemplary embodiment, referring to fig. 5, the screening the primary video segments according to the aesthetic score values of the respective primary video segments, to obtain the video highlight segments includes:

in step S50, it is detected whether the aesthetic score value lower than the preset score value exists in the aesthetic score values of the key frames of the respective initially selected video clips.

In particular, in practical application scenarios, confusing or faint pictures tend to impair the overall visual experience, and therefore video clips containing these pictures are not suitable for inclusion in the category of highlights. Therefore, in order to effectively prevent such pictures from being taken into consideration. After the aesthetic score value of each key frame is obtained, it is determined whether or not there is an aesthetic score value lower than a preset score value among the aesthetic score values of each key frame.

The preset score value may be set and modified according to actual situations, for example, the preset score value is 2 points.

In step S51, if there is an aesthetic score value lower than the preset score value in the aesthetic score values of the key frames of the current primary video clip, the current primary video clip is deleted.

Specifically, if the aesthetic score value of the key frame of the current primary selected video segment is lower than the aesthetic score value of the preset score value, the display of poor visual quality in the current primary selected video segment is indicated. Such pictures are not suitable for inclusion in the category of highlights, and therefore the current, initially selected video clip is deleted, thereby eliminating video clips with poor visual quality.

Step S52, screening the primary video segments according to the aesthetic score values of the key frames of other primary video segments to obtain video highlight segments, wherein the other primary video segments are the primary video segments with no aesthetic score value lower than the preset score value in the aesthetic score values of the key frames of the primary video segments.

Specifically, after eliminating the video segments with poor visual quality, further screening is performed on other initially selected video segments, so that the final video highlight segments are screened out.

In this embodiment, when the primary video clips are selected according to the aesthetic score values of the key frames of the other primary video clips, the aesthetic score values of the respective primary video clips are also determined according to the aesthetic score values of all the key frames of the other primary video clips. After the confirmation of the aesthetic score values of all other primary video segments is completed, the primary video segments are screened according to the determined aesthetic score values, so that the final video highlight segments are obtained.

It should be noted that, how to determine the aesthetic score value of each primary video segment according to the aesthetic score value of the key frame of each primary video segment, and how to screen the primary video segment according to the determined aesthetic score value, so as to obtain the final video highlight segment is described in detail in the above embodiments, which are not repeated in this embodiment.

In this embodiment, the video clips including the key frames with significantly lower scores are eliminated, so that the aesthetic effect of the video highlight clips obtained by final screening is better.

In an exemplary embodiment, referring to fig. 6, the method further includes:

step S60, detecting the scene switching frequency of the at least one primary selected video segment to obtain the scene switching frequency of each primary selected video segment.

Specifically, the scene switching frequency of each primary video segment can be detected by a shot boundary detection algorithm, so as to obtain the scene switching frequency of each primary video segment.

The scene switching frequency refers to the number of times that each primary video clip has scene switching in unit time.

It should be noted that, when the scene-cut detection is performed, the shot boundary detection algorithm determines whether scene cut occurs based on whether the difference of the features between the continuous frames is greater than a preset threshold, that is, when the difference value of the features of the current frame and the features of the frame previous to the current frame is greater than the preset threshold, it indicates that a scene cut occurs. After the scene switching detection is completed on all the video frames of the initially selected video clip, the number of times N of scene switching can be counted. Finally, the scene switching frequency of the current primary video clip, that is, the scene switching frequency=n/T, can be determined according to the counted number of times N of scene switching and the duration T of the current primary video clip.

Step S61, determining the dynamic performance scoring value of each primary selected video segment according to the scene switching frequency of each primary selected video segment.

Specifically, a mapping table of each scene switching frequency range and the dynamic performance score value may be pre-established, so that after the scene switching frequency of the initially selected video segment is obtained, the dynamic performance score value of the current initially selected video segment may be determined by querying the mapping table.

It should be noted that, the higher the dynamic performance score value, the more full the video clip is.

In an embodiment, the product of the scene switching frequency and the preset constant may be directly used as the dynamic performance score value of the initially selected video segment, for example, if the preset constant is b and the scene switching frequency is f, the dynamic performance score value of the initially selected video segment=b×f.

In another embodiment, the dynamic performance score value of each primary selected video segment may also be calculated according to the scene switching frequency corresponding to each primary selected video segment and other dynamic performance scoring algorithms.

Correspondingly, the step of screening the primary selected video segments according to the aesthetic grading values of the primary selected video segments to obtain video highlight segments comprises the following steps: and screening the primary selected video segments according to the aesthetic score value and the dynamic performance score value of each primary selected video segment to obtain video highlight segments.

Specifically, after the aesthetic score value and the dynamic performance score value of each primary video segment are obtained, the primary video segments can be screened by combining the aesthetic score value and the dynamic performance score value of each primary video segment, so that the video segments with excellent aesthetic effects and full vitality and dynamic sense are screened out from the primary video segments to serve as video highlight segments.

In one embodiment, the top N first selected video clip may be used as the highlight video clip.

In another embodiment, the first selected video segment with the sum value of the aesthetic score value and the dynamic performance score value being greater than the preset score threshold value may also be used as the highlight video segment.

In another embodiment, the aesthetic score value and the weight value of the dynamic performance score value may be obtained first, and then the total score value of each primary video segment may be calculated according to the aesthetic score value, the dynamic performance score value, the weight value corresponding to the aesthetic score value, and the weight value corresponding to the dynamic performance score value of each primary video segment. After the total score value of each primary video segment is calculated, the primary video segment with the total score value being ranked in the first N bits can be used as a highlight video segment, or the primary video segment with the total score value being greater than the preset score threshold value can be used as a highlight video segment.

In this embodiment, because the viewer is usually more interested in the picture full of vitality and dynamic, the dynamic performance of the picture directly affects the attraction and the viewing rate of the video, and thus the initially selected video clips are screened by combining the aesthetic score value and the dynamic performance score value of each initially selected video clip in this embodiment, so that the finally obtained video highlight clip is more accurate.

In an exemplary embodiment, referring to fig. 7, the method further includes:

step S70, obtaining the number of barrages corresponding to each primary selected video clip.

Specifically, since each bullet screen includes time stamp information, the number of bullet screens sent by the user in these time periods can be obtained from the bullet screen system according to the time stamp corresponding to each primary selected video clip.

As an example, if the timestamp corresponding to the initially selected video clip a is 5-8 minutes, the number of all the shots sent by the user in the 5-8 minutes may be obtained from the bullet screen system.

Step S71, determining the heat scoring value of each primary selected video segment according to the bullet screen quantity corresponding to each primary selected video segment.

Specifically, a mapping table of the scope of the number of the bullet screens and the heat score value can be preset, so that after the number of bullet screens is obtained, the heat score value of the current primary selected video segment can be obtained by querying the mapping table.

In an embodiment, the product of the number of the shots and the preset constant may be preset as the heat score value of the initially selected video segment, for example, the preset constant is c, and the heat score value=c×n of the initially selected video segment is n.

In another embodiment, the heat score value of each primary selected video segment may also be calculated according to the number of barrages corresponding to each primary selected video segment and other heat scoring algorithms.

Correspondingly, the step of screening the primary selected video segments according to the aesthetic grading values of the primary selected video segments to obtain video highlight segments comprises the following steps: and screening the primary selected video segments according to the aesthetic score value and the heat score value of each primary selected video segment to obtain video highlight segments.

Specifically, after the aesthetic score value and the heat score value of each primary video segment are obtained, the primary video segments can be screened by combining the aesthetic score value and the heat score value of each primary video segment, so that the video segments with excellent aesthetic effects and high heat can be screened out as video highlight segments.

In another embodiment, the first selected video segment with the sum value of the aesthetic score value and the heat score value larger than the preset score threshold value can be used as the highlight video segment.

In another embodiment, the aesthetic score value and the weight value of the heat score value may be obtained first, and then the total score value of each primary video segment may be calculated according to the aesthetic score value, the heat score value, the weight value corresponding to the aesthetic score value, and the weight value corresponding to the heat score value of each primary video segment. After the total score value of each primary video segment is calculated, the primary video segment with the total score value being ranked in the first N bits can be used as a highlight video segment, or the primary video segment with the total score value being greater than the preset score threshold value can be used as a highlight video segment.

In this embodiment, because the audience often sends the barrage when seeing the more wonderful picture, the method screens the primary selected video clips by combining the aesthetic score value and the heat score value of each primary selected video clip, so that the finally obtained video wonderful clip is more accurate.

In an exemplary embodiment, when determining the dynamic performance score value and the heat score value of each of the first selected video segments, the selecting the first selected video segments according to the aesthetic score values of each of the first selected video segments, to obtain the video highlight segments includes: and screening the primary selected video segments according to the aesthetic score value, the dynamic performance score value and the heat score value of each primary selected video segment to obtain video highlight segments.

Specifically, after the aesthetic score value, the dynamic performance score value and the heat score value of each primary video segment are obtained, the primary video segment can be screened by combining the aesthetic score value, the dynamic performance score value and the heat score value of each primary video segment, so that the video segment with better aesthetic effect, higher heat and full of dynamic and dynamic pictures can be screened as the video highlight segment.

In one embodiment, the top N first selected video clip may be used as the highlight video clip with the sum of the aesthetic score value, the dynamic performance score value, and the heat score value.

In another embodiment, the first selected video segment with the sum of the aesthetic score value, the dynamic performance score value and the heat score value being greater than the preset score threshold value can be used as the highlight video segment.

In another exemplary embodiment, referring to fig. 8, the filtering the primary selected video segments according to the aesthetic score value, the dynamic performance score value and the heat score value of each primary selected video segment, to obtain the video highlight segment includes:

and S80, acquiring the aesthetic grading value, the dynamic performance grading value and the weight value corresponding to the heat grading value.

Specifically, an aesthetic score value, the dynamic performance score value, and a weight value corresponding to the heat score value may be preset. The aesthetic grading value, the dynamic performance grading value and the weight value corresponding to the heat grading value can be set and modified according to actual conditions. For example, the aesthetic score value corresponds to a weight value of d1, the dynamic performance score value corresponds to a weight value of d2, and the heat score value corresponds to a weight value of d3.

Step S81, calculating the total score value of each primary selected video segment according to the aesthetic score value, the dynamic performance score value, the heat score value, the weight value corresponding to the aesthetic score value, the weight value corresponding to the dynamic performance score value and the weight value corresponding to the heat score value of each primary selected video segment.

And S82, selecting the initially selected video segments with total score values meeting preset conditions as video highlight segments.

Specifically, the preset condition is a preset condition for determining whether the initially selected video clip can be used as a video highlight clip, and can be set according to actual situations. For example, the preset condition is that the first selected video segments with the total score value ranked in the first N bits are selected as the wonderful video segments, or the first selected video segments with the total score value larger than the preset score threshold value are selected as the wonderful video segments.

In this embodiment, by giving different weight values to the aesthetic score value, the dynamic performance score value and the heat score value, the calculated total score value is more accurate, and the screened video highlight fragment meets the requirements of the user.

In an exemplary embodiment, referring to fig. 9, the obtaining the aesthetic score value, the dynamic performance score value, and the weight value corresponding to the heat score value includes:

step S90, obtaining the type information of the target video.

Specifically, the type information is used to determine the type to which the target video belongs, and the type information may include a documentary, a television show, a movie, and the like.

And S91, determining the aesthetic grading value, the dynamic performance grading value and the weight value corresponding to the heat grading value according to the type information.

Specifically, aesthetic score values, dynamic performance score values and weight values corresponding to heat score values corresponding to different types of videos are preset, so that after the type information of the target video is obtained later, the aesthetic score values, the dynamic performance score values and the weight values corresponding to the heat score values can be determined according to the type information.

In this embodiment, the dynamic battle scene is especially able to capture the attention of the audience, as in some types of video. However, in some types of video, such as documentaries, the overall scene cut frequency is low, in which case, a satisfactory video highlight may not be obtained if the video segments are screened with reference to the dynamic performance score value. Therefore, in order to solve the above problems, the present application sets different weight values for the aesthetic score value, the dynamic performance score value, and the heat score value for different types of videos, so that accurate highlight video clips can be screened regardless of the types of videos.

In an exemplary embodiment, referring to fig. 10, the method further includes:

and step S100, inputting the caption text corresponding to the video highlight into the large language model.

And step S101, generating a content description corresponding to the video highlight through the large language model.

Specifically, in this embodiment, after obtaining the highlight video clip, the subtitle text corresponding to the video highlight clip may be input into the large language model, so that the content description of the video highlight clip is generated by using the large language model.

Wherein the content description is a content summary of the description of the video highlight.

To adjust the generated content description, this may be achieved by adjusting the way in which the prompt word (prompt) is entered into the large language model.

Referring to fig. 11, a block diagram of a video highlight extracting apparatus 110 according to an embodiment of the application is shown.

In this embodiment, the video highlight extraction apparatus 110 includes a series of computer program instructions stored on a memory, which when executed by a processor, implement the video highlight extraction functions of the embodiments of the present application. In some embodiments, based on the particular operations implemented by the portions of the computer program instructions, the video highlight extraction apparatus 110 may be divided into one or more modules, which may be specifically divided as follows:

an obtaining module 111, configured to obtain a subtitle file of a target video, where the subtitle file includes a subtitle text and a timestamp corresponding to the subtitle text;

the recognition module 112 is configured to perform content recognition on the caption text through the trained large language model, so as to obtain a content recognition result, where the content recognition result includes at least one highlight caption text;

A determining module 113, configured to determine a timestamp corresponding to the at least one highlighted subtitle text according to the subtitle file, and extract at least one primary video segment from the target video according to the determined timestamp;

the scoring module 114 is configured to score the at least one primary video segment through a trained aesthetic model, so as to obtain an aesthetic score value of each primary video segment;

and the screening module 115 is used for screening the primary selected video segments according to the aesthetic grading values of the primary selected video segments to obtain video highlight segments.

In an exemplary embodiment, the video highlight extraction apparatus 110 further includes a detection module and a dynamic performance score value confirmation module.

The detection module is used for detecting the scene switching frequency of the at least one primary selected video segment to obtain the scene switching frequency of each primary selected video segment.

And the dynamic performance score value confirmation module is used for determining the dynamic performance score value of each primary selected video segment according to the scene switching frequency of each primary selected video segment.

The screening module 115 is further configured to screen the primary selected video segments according to the aesthetic score value and the dynamic performance score value of each primary selected video segment, so as to obtain video highlight segments.

In an exemplary embodiment, the video highlight extraction apparatus 110 further includes a barrage acquisition module and a heat score value confirmation module.

The saidAnd the barrage acquisition module is used for acquiring the number of barrages corresponding to each primary selected video segment.

The saidAnd the heat score value confirmation module is used for determining the heat score value of each primary selected video segment according to the bullet screen quantity corresponding to each primary selected video segment.

The screening module 115 is further configured to screen the primary selected video segments according to the aesthetic score value and the heat score value of each primary selected video segment, so as to obtain video highlight segments.

In an exemplary embodiment, the screening module 115 is further configured to screen the primary selected video segments according to the aesthetic score value, the dynamic performance score value and the heat score value of each primary selected video segment, so as to obtain a video highlight segment.

In an exemplary embodiment, the screening module 115 is further configured to obtain the aesthetic score value, the dynamic performance score value, and a weight value corresponding to the heat score value; calculating the total score value of each primary selected video segment according to the aesthetic score value, the dynamic performance score value, the heat score value, the weight value corresponding to the aesthetic score value, the weight value corresponding to the dynamic performance score value and the weight value corresponding to the heat score value of each primary selected video segment; and selecting the initially selected video clips with total score values meeting preset conditions as video highlight clips.

In an exemplary embodiment, the filtering module 115 is further configured to obtain type information of the target video; and determining the aesthetic grading value, the dynamic performance grading value and the weight value corresponding to the heat grading value according to the type information.

In an exemplary embodiment, the recognition module 112 is further configured to input a prompt word into the large language model, where the prompt word includes character information, the subtitle text, an output guideline, and an output example; and carrying out content recognition on the caption text through the large language model to obtain a content recognition result.

In an exemplary embodiment, the scoring module 114 is further configured to perform frame extraction processing on each of the first selected video segments, so as to obtain at least one key frame; scoring key frames corresponding to each primary video segment through the trained aesthetic model to obtain aesthetic scoring values of the key frames of each primary video segment; the aesthetic score value of each primary video segment is determined based on the aesthetic score values of the key frames of each primary video segment.

In an exemplary embodiment, the screening module 115 is further configured to detect whether the aesthetic score value of the key frame of each of the first selected video segments is lower than the preset score value; if the aesthetic score value lower than the preset score value exists in the aesthetic score values of the key frames of the current primary video clips, deleting the current primary video clips; and screening the primary video segments according to the aesthetic score values of the key frames of other primary video segments to obtain video highlight segments, wherein the other primary video segments are the primary video segments with the aesthetic score values lower than the preset score value in the aesthetic score values of the key frames of the primary video segments.

In an exemplary embodiment, the video highlight extraction apparatus 110 further includes an input module and a generation module.

The input module is used for inputting the caption text corresponding to the video highlight into the large language model.

And the generation module is used for generating the content description corresponding to the video highlight through the large language model.

According to the video highlight extraction scheme, a subtitle file of a target video is obtained, wherein the subtitle file comprises a subtitle text and a timestamp corresponding to the subtitle text; performing content recognition on the caption text through the trained large language model to obtain a content recognition result, wherein the content recognition result comprises at least one highlight caption text; determining a time stamp corresponding to the at least one highlight subtitle text according to the subtitle file, and extracting at least one primary video segment from the target video according to the determined time stamp; scoring the at least one primary video segment through the trained aesthetic model to obtain aesthetic scoring values of the primary video segments; and screening the primary selected video segments according to the aesthetic grading values of the primary selected video segments to obtain video highlight segments. According to the video highlight extraction method, global analysis is carried out on the basis of the video subtitle files through the trained large language model, so that highlight text is positioned, then a primary video segment is extracted based on a timestamp corresponding to the highlight text, finally, the picture characteristics of the primary video segment are analyzed according to the aesthetic model, and therefore the final video highlight is screened out. The video highlight extraction method can accurately and rapidly extract highlight video clips without manual watching.

Fig. 12 schematically shows a hardware architecture diagram of a computer device 12 adapted to implement a video highlight extraction method according to an embodiment of the application. In the present embodiment, the computer device 12 is a device capable of automatically performing numerical calculation and/or information processing in accordance with instructions set or stored in advance. As shown in fig. 12, the computer device 12 includes at least, but is not limited to: memory 120, processor 121, and network interface 122 may be communicatively linked to each other by a system bus. Wherein:

memory 120 includes at least one type of computer-readable storage medium that may be volatile or nonvolatile, and specifically, readable storage media include flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 120 may be an internal storage module of the computer device 12, such as a hard disk or memory of the computer device 12. In other embodiments, the memory 120 may also be an external storage device of the computer device 12, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the computer device 12. Of course, the memory 120 may also include both internal memory modules of the computer device 12 and external memory devices. In this embodiment, the memory 120 is typically used to store an operating system installed on the computer device 12 and various types of application software, such as program code for a video highlight extraction method. In addition, the memory 120 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 121 may be a central processing unit (Central Processing Unit, simply CPU), controller, microcontroller, microprocessor, or other video highlight extraction chip in some embodiments. The processor 121 is typically used to control the overall operation of the computer device 12, such as performing control and processing related to data interaction or communication with the computer device 12, and the like. In this embodiment, the processor 121 is configured to execute program codes or process data stored in the memory 120.

The network interface 122 may include a wireless network interface or a wired network interface, the network interface 122 typically being used to establish a communication link between the computer device 12 and other computer devices. For example, the network interface 122 is used to connect the computer device 12 to an external terminal through a network, establish a data transmission channel and a communication link between the computer device 12 and the external terminal, and the like. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a global system for mobile communications (Global System of Mobile communication, abbreviated as GSM), wideband code division multiple access (Wideband Code Division Multiple Access, abbreviated as WCDMA), a 4G network, a 5G network, bluetooth (Bluetooth), wi-Fi, etc.

It should be noted that fig. 12 only shows a computer device having components 120-122, but it should be understood that not all of the illustrated components are required to be implemented, and that more or fewer components may be implemented instead.

In this embodiment, the video highlight extraction method stored in the memory 120 may be divided into one or more program modules and executed by one or more processors (the processor 121 in this embodiment) to complete the present application.

The embodiment of the application provides a computer readable storage medium, and the computer readable storage medium stores a computer program thereon, and the computer program when executed by a processor realizes the steps of the video highlight extraction method in the embodiment.

In this embodiment, the computer-readable storage medium includes a flash memory, a hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the computer readable storage medium may be an internal storage unit of a computer device, such as a hard disk or a memory of the computer device. In other embodiments, the computer readable storage medium may also be an external storage device of a computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card), etc. that are provided on the computer device. Of course, the computer-readable storage medium may also include both internal storage units of a computer device and external storage devices. In this embodiment, the computer readable storage medium is typically used to store an operating system installed on a computer device and various types of application software, such as program codes of the video highlight extraction method in the embodiment, and the like. Furthermore, the computer-readable storage medium may also be used to temporarily store various types of data that have been output or are to be output.

The apparatus embodiments described above are merely illustrative, wherein elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over at least two network elements. Some or all modules in the system can be screened out according to actual needs to realize the purpose of the scheme of the embodiment of the application. Those of ordinary skill in the art will understand and implement the present application without undue burden.

From the above description of embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus a general purpose hardware platform, or may be implemented by hardware. Those skilled in the art will appreciate that all or part of the processes implementing the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and where the program may include processes implementing the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-only memory (ROM), a random access memory (RandomAccessMemory, RAM), or the like.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application.

Claims

1. A method of video highlight extraction, the method comprising:

2. The video highlight extraction method according to claim 1, characterized in that the method further comprises:

3. The video highlight extraction method according to claim 1, characterized in that the method further comprises:

acquiring the number of barrages corresponding to each primary video segment;

4. The video highlight extraction method according to claim 2, characterized in that the method further comprises:

acquiring the number of barrages corresponding to each primary video segment;

5. The method for extracting video highlight from video segments according to claim 4, wherein said selecting the primary video segments according to the aesthetic score value, the dynamic performance score value and the heat score value of each primary video segment comprises:

6. The method of claim 5, wherein the obtaining the weight values corresponding to the aesthetic score value, the dynamic performance score value, and the popularity score value comprises:

acquiring type information of the target video;

7. The method for extracting a video highlight according to any one of claims 1 to 6, wherein the performing content recognition on the subtitle text through the trained large language model to obtain a content recognition result includes:

8. The method of claim 1, wherein scoring the at least one preliminary video segment with the trained aesthetic model to obtain aesthetic scoring values for each preliminary video segment comprises:

9. The method of claim 8, wherein the screening the primary video segments according to the aesthetic score values of the respective primary video segments to obtain the video highlight segments comprises:

10. The video highlight extraction method according to claim 1, characterized in that the method further comprises:

11. A video highlight extraction apparatus, characterized in that the video highlight extraction apparatus comprises:

12. A computer device, characterized in that it comprises a memory, a processor and a computer program stored on the memory and executable on the processor, which processor implements the steps of the method according to any one of claims 1 to 10 when the computer program is executed.

13. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 10.