CN110837579A

CN110837579A - Video classification method, device, computer and readable storage medium

Info

Publication number: CN110837579A
Application number: CN201911071940.9A
Authority: CN
Inventors: 王瑞琛; 王晓利
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-11-05
Filing date: 2019-11-05
Publication date: 2020-02-25
Anticipated expiration: 2039-11-05
Also published as: WO2021088510A1; CN110837579B

Abstract

The embodiment of the application discloses a video classification method, which comprises the following steps: acquiring a key frame image from a target video; inputting the key frame image into an image search engine to obtain description information of the key frame image, and determining a key phrase of the key frame image according to the description information; acquiring text content characteristics corresponding to the key phrases; and determining the video type label of the target video according to the text content characteristics. By the method and the device, the text content characteristics of the target video can be determined in the image search engine according to the corresponding description information based on the key frame images in the plurality of frame images forming the target video, so that the video type label of the target video is obtained, and the video classification efficiency is improved.

Description

Video classification method, device, computer and readable storage medium

Technical Field

The present application relates to the field of computing technologies, and in particular, to a video classification method, apparatus, computer, and readable storage medium.

Background

With the increasing abundance of video types and the increasing amount of videos, videos which people can watch and video playing applications used by people are more and more diversified, and the types of videos which people like are different, if the people look for videos which people want to watch from a large number of videos, much time is consumed, and even the interest of watching the videos may be lost.

Disclosure of Invention

The embodiment of the application provides a video classification method and device, which can improve the efficiency of video classification.

A first aspect of an embodiment of the present application provides a video classification method, including:

acquiring a key frame image from a target video;

inputting the key frame image into an image search engine to obtain description information of the key frame image, and determining a key phrase of the key frame image according to the description information;

acquiring text content characteristics corresponding to the key phrases;

and determining the video type label of the target video according to the text content characteristics.

Wherein the method further comprises:

acquiring video content characteristics corresponding to the target video according to the content of each frame of image in the target video;

determining the video type tag of the target video according to the text content features comprises:

splicing the text content characteristics and the video content characteristics to obtain first fusion characteristics;

and inputting the first fusion characteristics into a classification model to obtain a video type label of the target video.

The obtaining of the video content characteristics corresponding to the target video according to the content of each frame of image in the target video includes:

acquiring at least one image pair in the target video, wherein each image pair comprises two adjacent frames of images in the target video;

acquiring an optical flow graph between two frames of images in the at least one image pair, and forming the optical flow graph corresponding to the at least one image pair into an optical flow graph sequence of the target video;

and inputting the frame image sequence of the target video and the light flow graph sequence into a video classification model to obtain video content characteristics corresponding to the target video, wherein the frame image sequence is obtained by sequentially arranging all frame images forming the target video.

Wherein the method further comprises:

acquiring audio information of the target video, and inputting the audio information into a voice classification model to obtain voice content characteristics corresponding to the audio information;

the determining the video type label of the target video according to the text content features comprises:

splicing the text content features and the voice content features to obtain second fusion features;

and inputting the second fusion characteristics into a classification model to obtain a video type label of the target video.

Wherein the method further comprises:

identifying image characters in the key frame image, and acquiring subtitle information corresponding to the key frame image;

the determining a key phrase of the key frame image according to the description information includes:

and determining a key phrase of the key frame image according to the description information, the image characters and the subtitle information.

Wherein, the determining the key phrase of the key frame image according to the description information, the image text and the caption information includes:

adding the phrases in the description information, the phrases in the image characters and the phrases in the caption information into a phrase set;

determining an evaluation value corresponding to each phrase in the phrase set according to the occurrence frequency and the type weight of each phrase in the phrase set; the type weight comprises a weight corresponding to the description information, a weight corresponding to the image and the text and a weight corresponding to the subtitle information;

and sequencing each phrase in the phrase set according to the evaluation value, and determining a key phrase of the key frame image from the phrase set according to a sequencing result.

The acquiring of the key frame image from the target video includes:

acquiring a plurality of frame images forming the target video, inputting the frame images into a feature extraction layer in a key frame determination model, and obtaining the image features of each frame image;

inputting the image features of each frame image into a key value determination layer in the key frame determination model, wherein the key value of each frame image is determined based on an attention mechanism in the key value determination layer;

and determining the key frame image in the target video according to the key value of each frame image.

Wherein the determining, in the key value determination layer, the key value of each frame image based on an attention mechanism includes:

determining the correlation degree between the image characteristics of the ith frame image and the image characteristics of a comparison image in the plurality of frame images based on the attention mechanism in the key value determination layer, and obtaining the key value of the ith frame image according to the correlation degree between the image characteristics of the ith frame image and the image characteristics of the comparison image; the comparison image is a frame image except the ith frame image in the plurality of frame images forming the target video, i is a positive integer and is not more than the number of the plurality of frame images;

and when the ith frame image is the last frame image in the plurality of frame images, obtaining a key value of each frame image.

Wherein, the determining the key phrase of the key frame image according to the description information includes:

and counting the occurrence times of each phrase in the phrases contained in the description information, and determining the phrases with the occurrence times larger than a threshold value of the statistical times in the description information as the key phrases of the key frame image.

The acquiring of the text content features corresponding to the keyword group includes:

inputting the key phrase into a text classification model, and extracting initial text characteristics corresponding to the key phrase;

matching the initial text features with a plurality of to-be-matched type features in the text classification model to obtain matching values;

and determining the type feature to be matched with the maximum matching value as the text content feature corresponding to the key phrase.

Wherein, the splicing the text content feature and the voice content feature to obtain a second fusion feature comprises:

adding a default characteristic value at a first designated position in the text content characteristics to obtain the text content characteristics with a first designated length;

adding the default characteristic value at a second designated position in the voice content characteristics to obtain voice content characteristics with a second designated length;

splicing the text content features with the first specified length and the voice content features with the second specified length to obtain second fusion features;

inputting the second fusion feature into a classification model to obtain a video type label of the target video, including:

and inputting the second fusion characteristics into the classification model, and obtaining the video type label of the target video based on a classification weight matrix in the classification model.

Wherein the method further comprises:

adding the target video to a video classification corresponding to the video type label based on the video type label of the target video; or,

and pushing the target video to a target terminal, wherein the target terminal is a terminal marked with the video type label.

A second aspect of the embodiments of the present application provides a video classification apparatus, including:

the first acquisition module is used for acquiring a key frame image from a target video;

the first determining module is used for inputting the key frame image into an image search engine to obtain the description information of the key frame image and determining the key phrase of the key frame image according to the description information;

the second acquisition module is used for acquiring text content characteristics corresponding to the key phrases;

and the second determining module is used for determining the video type label of the target video according to the text content characteristics.

Wherein the apparatus further comprises:

the third acquisition module is used for acquiring video content characteristics corresponding to the target video according to the content of each frame of image in the target video;

the second determining module includes:

the splicing unit is used for splicing the text content characteristics and the video content characteristics to obtain first fusion characteristics;

and the first training unit is used for inputting the first fusion characteristics into a classification model to obtain a video type label of the target video.

Wherein, the third obtaining module includes:

the first acquisition unit is used for acquiring at least one image pair in the target video, wherein each image pair comprises two adjacent frames of images in the target video;

a second acquisition unit, configured to acquire an optical flow graph between two frames of images in the at least one image pair, and compose the optical flow graph corresponding to the at least one image pair into an optical flow graph sequence of the target video;

and the second training unit is used for inputting the frame image sequence of the target video and the light flow graph sequence into a video classification model to obtain the video content characteristics corresponding to the target video, wherein the frame image sequence is obtained by sequentially arranging all the frame images forming the target video.

Wherein the apparatus further comprises:

the fourth acquisition module is used for acquiring the audio information of the target video and inputting the audio information into a voice classification model to obtain the voice content characteristics corresponding to the audio information;

the second determining module includes:

the splicing unit is further configured to splice the text content features and the voice content features to obtain second fusion features;

the first training unit is further configured to input the second fusion feature into a classification model to obtain a video type tag of the target video.

Wherein the apparatus further comprises:

a fifth obtaining module, configured to identify image characters in the key frame image, and obtain subtitle information corresponding to the key frame image;

in the aspect of determining the key phrase of the key frame image according to the description information, the first determining module is specifically configured to:

Wherein the first determining module comprises:

an adding unit, configured to add a phrase in the description information, a phrase in the image text, and a phrase in the subtitle information to a phrase set;

the first determining unit is used for determining an evaluation value corresponding to each phrase in the phrase set according to the occurrence frequency and the type weight of each phrase in the phrase set; the type weight comprises a weight corresponding to the description information, a weight corresponding to the image and the text and a weight corresponding to the subtitle information;

and the second determining unit is used for sequencing each phrase in the phrase set according to the evaluation value and determining the key phrase of the key frame image from the phrase set according to a sequencing result.

Wherein, the first obtaining module comprises:

a third obtaining unit, configured to obtain a plurality of frame images that constitute the target video, input the plurality of frame images into a feature extraction layer in a key frame determination model, and obtain an image feature of each frame image;

a third determination unit configured to input an image feature of the each frame image into a key value determination layer in the key frame determination model, in which a key value of the each frame image is determined based on an attention mechanism;

and the fourth determining unit is used for determining the key frame image in the target video according to the key value of each frame image.

The third determining unit is specifically configured to:

Wherein, in said determining a key phrase of the key frame image according to the description information, the first determining module comprises:

and the counting unit is used for counting the occurrence frequency of each phrase in the phrases contained in the description information, and determining the phrases with the occurrence frequency larger than a threshold value of the statistical frequency in the description information as the key phrases of the key frame image.

Wherein the second obtaining module includes:

the extraction unit is used for inputting the key phrase into a text classification model and extracting initial text characteristics corresponding to the key phrase;

the matching unit is used for matching the initial text features with a plurality of to-be-matched type features in the text classification model to obtain matching values;

and the fifth determining unit is used for determining the type feature to be matched with the maximum matching value as the text content feature corresponding to the keyword group.

In the aspect of obtaining a second fusion feature by splicing the text content feature and the voice content feature, the splicing unit includes:

the first generation subunit is used for adding a default characteristic value at a first specified position in the text content characteristics to obtain the text content characteristics with a first specified length;

the first generating subunit is further configured to add the default feature value to a second specified position in the voice content feature to obtain a voice content feature with a second specified length;

the second generating subunit is configured to splice the text content feature with the first specified length and the voice content feature with the second specified length to obtain the second fusion feature;

the first training unit is specifically configured to:

Wherein the apparatus further comprises:

the adding module is used for adding the target video to the video classification corresponding to the video type label based on the video type label of the target video; or,

and the sending module is used for pushing the target video to a target terminal, and the target terminal is a terminal marked with the video type label.

A third aspect of the embodiments of the present application provides a computer, including a processor, a memory, and an input/output interface;

the processor is connected to the memory and the input/output interface, respectively, where the input/output interface is used for inputting data and outputting data, the memory is used for storing program codes, and the processor is used for calling the program codes to execute the video classification method according to the first aspect of the embodiment of the present application.

A fourth aspect of embodiments of the present application provides a computer-readable storage medium, in which a computer program is stored, the computer program comprising program instructions that, when executed by a processor, perform a video classification method as described in the first aspect of embodiments of the present application.

The embodiment of the application has the following beneficial effects:

according to the method and the device for searching the key frame images, the key frame images are obtained from the target video and input into an image search engine to obtain the description information of the key frame images, the key phrases of the key frame images are determined according to the description information, the text content characteristics corresponding to the key phrases are obtained, and the video type labels of the target video are determined according to the text content characteristics. According to the video classification method and device, the target video is classified based on the text information of the target video, and the situation that manual video classification consumes a large amount of time is avoided by realizing automatic execution of a video classification process, so that the video classification efficiency is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a diagram of a video classification architecture provided in an embodiment of the present application;

fig. 2 is a schematic view of a scene for text-based video classification according to an embodiment of the present application;

fig. 3 is a flowchart of a video classification method according to an embodiment of the present application;

fig. 4 is a scene schematic diagram of a keyword group obtaining method provided in the embodiment of the present application;

fig. 5 is a schematic diagram of a text content feature determination scenario provided in an embodiment of the present application;

fig. 6 is a schematic view of a specific flow of video classification according to an embodiment of the present application;

fig. 7a is a schematic view of a key frame image capturing scene according to an embodiment of the present disclosure;

FIG. 7b is a schematic diagram of a correlation matrix according to an embodiment of the present application;

fig. 8 is a diagram of a text content feature acquisition architecture according to an embodiment of the present application;

fig. 9 is a schematic diagram of determining characteristics of video content according to an embodiment of the present application;

fig. 10 is a schematic diagram of a video type tag determination process provided in an embodiment of the present application;

fig. 11 is a schematic view of a video classification apparatus according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a computer according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Specifically, referring to fig. 1, fig. 1 is a video classification architecture diagram provided in an embodiment of the present application, as shown in fig. 1, an embodiment of the present application may include a user terminal 101, a classification server 102, a plurality of receiving servers 103, and a plurality of target terminals 104, when the user terminal 101 monitors a target video that needs to be video classified, the target video is sent to the classification server 102, the classification server 102 receives the target video, splits the target video to obtain a plurality of frame images that constitute the target video, obtains a key frame image from the plurality of frame images, uses the key frame image as a representative constituent image of the target video, inputs the key frame image into an image search engine to obtain description information of the key frame image, where the description information is composed of a plurality of phrases, determines a key phrase of the key frame image according to the description information, and acquiring text content characteristics corresponding to the key phrase, and determining a video type label of the target video according to the text content characteristics. Optionally, after determining the video type tag of the target video, the classification server 102 may send the target video to the receiving server 103 or the target terminal 104 based on the video type tag, so that the receiving server 103 may add the target video to the corresponding classification based on the video type tag, and the target terminal 104 is a terminal marked with the video type tag, and may consider that a user using the target terminal is interested in the video related to the video type tag, and by performing targeted pushing on the target video after determining the video type tag of the target video, intelligence of video management may be improved. The user terminal 101 and the target terminal 104 may be electronic devices including, but not limited to, a mobile phone, a tablet computer, a desktop computer, a notebook computer, a palm computer, a Mobile Internet Device (MID), a wearable device (e.g., a smart watch, a smart band, etc.), and the receiving server 103 may be a server corresponding to a video playing application.

The method and the device have the advantages that the characteristics of the target video are extracted through the relevant text information of the target video, the classification of the target video is determined based on the characteristics of the target video, the training data volume which can be used for video classification is increased, meanwhile, the relevant text information of the target video is obtained through the image search engine, the situation that the training data volume is insufficient when more text descriptions of the target video are difficult to obtain is avoided, the target video is classified based on the relevant text information of the target video under the situation that the training data volume which is enough for video classification exists, and the accuracy and the effectiveness of video classification are improved.

Further, referring to fig. 2, fig. 2 is a schematic view of a scene for text-based video classification according to an embodiment of the present application. As shown in fig. 2, a target video 201 is obtained by sequentially playing a plurality of frame images, that is, the plurality of frame images constitute the target video 201, after the target video 201 is obtained, the target video 201 is split to obtain a plurality of frame images 202 constituting the target video 201, each frame image 202 is a picture of the target video 201, a key frame image 203 of the target video 201 is obtained from the plurality of frame images 202 constituting the target video 201, and the key frame image 203 is obtained based on a key frame determination model. The acquired key frame images 203 are sequentially input to an image search engine 204, so that description information of the key frame images 203 is obtained, the description information is composed of a plurality of phrases, the description information includes a plurality of phrase descriptions corresponding to each frame image in the key frame image 203 in the image search engine 204, determining a key phrase 205 of the key frame image according to the description information, obtaining the text content characteristics corresponding to the key phrase 205, and determines a video type tag 206 of the target video 201 according to the text content characteristics, and transmits the target video to the target server 207 and/or the target terminal 208 based on the video type tag 206, such that an application corresponding to the target server 207 can add the target video to the corresponding category based on the video type tag 206, the target terminal 208 is a terminal marked with any one of the video type tags 206. For example, it is assumed here that the video type tag 206 of the target video 201 includes two video type tags, which are a first video type tag and a third video type tag, respectively, and after obtaining the video type tag 206, the target video 201 and the video type tag 206 of the target video 201 are sent to the target server 207, and if the target server 207 is a server of a first video playing application, after receiving the video type tag 206, the target server 207 adds the target video 201 in a category corresponding to the first video type tag of the first video playing application and a category corresponding to the third video type tag; or after obtaining the video type tag 206, obtaining a terminal marked with the video type tag 206, that is, obtaining a terminal marked with any one of the first video type tag and the third video type tag, obtaining a target terminal 208, and sending the target video 201 and the video type tag 206 of the target video 201 to the target terminal 208, where the target terminal 208 is marked with the first video type tag and the second video type tag, so that the target terminal 208 displays the target video 201 at the first video type tag in the recommendation page.

Further, please refer to fig. 3, wherein fig. 3 is a flowchart of a video classification method according to an embodiment of the present application. As shown in fig. 3, the video classification process includes the following steps:

step S301, a key frame image is acquired from the target video.

Specifically, after a target video is acquired, the target video is split to obtain a plurality of frame images forming the target video, and a key frame image is acquired from the plurality of frame images. Wherein the key frame image may be determined based on sensitive changes between neighboring images. The key frame image refers to a frame image where a key action in motion or change of characters or objects and the like in a video is located, the key frame image of the video contains little redundant information and can represent key content of the video, and as the video is composed of a plurality of frame images, the content distinction possibly is not very large in continuous multi-frame images in the video, and if each frame image is processed, unnecessary resource waste is caused, therefore, a part of pictures can be extracted from the video to serve as the key frame image, the key frame image serves as a representative image of the corresponding video, and a determination result of the video is determined according to a processing result of the key frame image.

Step S302, inputting the key frame image into an image search engine to obtain the description information of the key frame image, and determining the key phrase of the key frame image according to the description information.

Specifically, the key frame image is input into an image search engine for retrieval to obtain description information of the key frame image obtained in the image search engine, and a key phrase of the key frame image is determined according to the description information. Wherein, the description information includes a plurality of phrases, the key frame image is input into the image search engine for searching, and the search result is obtained, the search result is composed of related description sentences aiming at the key frame image, each description sentence includes a plurality of phrases, the occurrence frequency of each phrase in the description information is counted, the key phrase of the key frame image is determined based on the occurrence frequency of each phrase, for example, when the occurrence frequency is larger than the threshold value of the statistical frequency, the phrase with the occurrence frequency larger than the threshold value of the statistical frequency is obtained, the phrase is determined as the key phrase of the key frame image, or, each phrase appearing in the description information is sequenced based on the occurrence frequency, when sequencing according to the occurrence frequency from high to low, the first N phrases in each sequenced phrase are obtained as the key phrases of the key frame image, when sequencing according to the occurrence frequency from low to high, and acquiring the last N phrases in the sequenced phrases as key phrases of the key frame image, wherein N is a positive integer and is a preset phrase acquisition data volume. When the occurrence frequency of each phrase in the description information is counted, the currently counted phrases are screened, the current counted phrases can be screened according to the part of speech of the currently counted phrases, and the current counted phrases can also be screened according to other sentence analysis methods, so that the condition that phrases without practical meanings are determined as key phrases is avoided, and therefore the interference of the phrases without practical meanings on video classification is reduced. The phrases without practical meaning generally play a role in highlighting or supplementing description and the like in the description information, so that the main content of the video cannot be greatly influenced, and the modified phrases generally appear more when describing the image, which may cause the modified phrases to be determined as the key phrases.

The image search engine is a system for searching images based on the search engine, and can receive the images and obtain the image data and the related text descriptions of the images by identifying the images. And inputting the key frame image into an image search engine to obtain image data of the key frame image and related text descriptions, wherein the text descriptions are description information of the key frame image. Optionally, after determining the key phrase of the key frame image through the description information, if the number of the key phrases appearing in a certain piece of related information is greater than the appearance frequency threshold, the network page corresponding to the piece of related information may be acquired, and the key phrase of the key frame image is extracted from the network page, so that the complete web page content with strong correlation with the key frame image is used as a part of the description information of the key frame image, and the features of the key frame image are enriched.

If the description information is English, performing lexical analysis on the English sentence based on a sentence pattern of the English sentence to obtain a plurality of phrases, and determining a key phrase from the plurality of phrases; if the description information is Chinese, performing lexical analysis on the Chinese sentence based on the format of the Chinese sentence to obtain a plurality of phrases forming the Chinese sentence, and determining a key phrase of the key frame image from the plurality of phrases; if the description information comprises Chinese description information and English description information, performing lexical analysis on a mixed sentence composed of Chinese and English based on a sentence pattern of the Chinese sentence to obtain a plurality of phrases forming the mixed sentence, and determining a key phrase of a key frame image from the plurality of phrases.

Specifically, refer to fig. 4, where fig. 4 is a scene schematic diagram of a keyword group obtaining method provided in the embodiment of the present application. As shown in fig. 4, taking a key frame image of the target video as an example, when the key frame image 401 is determined, the key frame image 401 is input to an image search engine for retrieval, a retrieval display page 402 is obtained, the retrieval display page 402 displays therein related information for the key frame image 401 retrieved based on the image search engine, extracts the related information displayed in the retrieval display page 402, the related information is the description information 403 of the key frame image 401, and the phrases in the description information 403 are counted in sequence to obtain the occurrence frequency of each phrase, selecting the description information 403 according to the occurrence frequency of each phrase, determining the phrases with the occurrence frequency greater than the threshold value of the statistical frequency in the description information 403 as the key phrase statistical information 404 of the key frame image, the keyword statistical information 404 records each determined keyword and the occurrence frequency of each keyword in the description information 403. Taking fig. 4 as an example, the description information 403 is statistically sorted to determine a plurality of keyword groups, and the keyword group statistical information 404 is generated according to the plurality of keyword groups and the occurrence frequency of each keyword group, where the keyword group statistical information includes 4 occurrences of the keyword group "related", 8 occurrences of the keyword group "XXX", 5 occurrences of the keyword group "xxxxxxxx", 6 occurrences of the keyword group "XX", … …, and the like.

Step S303, acquiring text content characteristics corresponding to the key phrases.

Specifically, acquiring text content characteristics corresponding to a keyword group, specifically, inputting the keyword group into a text classification model, and extracting initial text characteristics corresponding to the keyword group; matching the initial text features with a plurality of to-be-matched type features in the text classification model to obtain matching values; and determining the type feature to be matched with the maximum matching value as the text content feature corresponding to the key phrase. Optionally, a preset number of to-be-matched type features with a larger matching value in all to-be-matched type features may also be used as the text content features corresponding to the keyword group. If the preset number is 3, when the matching values corresponding to the multiple types of features to be matched are obtained, sorting is performed according to the matching values, the first 3 types of features to be matched in the types of features to be matched which are sorted from large to small are determined as text content features corresponding to the keyword group, or the last 3 types of features to be matched in the types of features to be matched which are sorted from small to large are determined as text content features corresponding to the keyword group.

Referring to fig. 5 in particular, fig. 5 is a schematic diagram of a text content feature determination scenario provided in an embodiment of the present application. As shown in fig. 5, the keyword set 501 is converted into a keyword set vector 502, for example, in fig. 5, the keyword set 501 includes a keyword set 1, a keyword set 2, a keyword set … and a keyword set m, and the keyword set 501 is sequentially converted into a keyword set vector 502, which includes a keyword set vector 1, a keyword set vector 2, a keyword set vector … and a keyword set vector 3. Inputting the keyword group vector 502 into a text classification model 503, extracting initial text features 5031 corresponding to the keyword group vector 502 through the text classification model 503, obtaining a plurality of to-be-matched type features in the text classification model 503, matching the initial text features 5031 with the plurality of to-be-matched type features to obtain a matching value 5032 of each to-be-matched type feature in the plurality of to-be-matched type features, and determining the to-be-matched type feature with the maximum matching value as a text content feature 504 corresponding to the keyword group according to the matching value 5032 of each to-be-matched type feature. Wherein, the connection relationship between the keyword group vector 502 and the initial text feature 5031 is a parameter for extracting the feature of the keyword group vector; the connection relationship between the initial text feature 5031 and the matching value 5032 of each type feature to be matched is a parameter for determining the matching values between the plurality of type features to be matched and the input content according to the features of the input content, and the parameter includes, but is not limited to, a weight matrix; the connection relationship between the matching value 5032 of each type feature to be matched and the text content feature 504 corresponding to the keyword group is a selection relationship, and is used for sorting based on the matching value to obtain the type feature 504 to be matched with the largest matching value.

And step S304, determining the video type label of the target video according to the text content characteristics.

Specifically, the video type label of the target video is determined according to the text content characteristics. And after the text content characteristics are obtained, the text content characteristics are used as the video type labels of the target videos.

According to the method and the device for searching the key frame images, the key frame images are obtained from the target video and input into an image search engine to obtain the description information of the key frame images, the key phrases of the key frame images are determined according to the description information, the text content characteristics corresponding to the key phrases are obtained, and the video type labels of the target video are determined according to the text content characteristics. According to the method and the device, the related text information of the target video is used as a basis for classifying the target video, so that the text information of the video sample can be increased during training to be used as a training sample, the trainable data volume is increased, and the accuracy of video classification is improved. And because the text information is obtained based on the image search engine, the text label of the image by the image search engine is an explanation description of the image, so that the related text information of the target video is a content description of the target video, and the video type label of the target video is determined through the related text information, thereby further improving the effectiveness and accuracy of video classification.

Further, referring to fig. 6, fig. 6 is a schematic view of a specific flow of video classification according to an embodiment of the present application. As shown in fig. 6, step S601, step S603, and step S604 are three parallel steps, and are respectively configured to obtain text content features, video content features, and voice content features of a target video, where there is no sequence in execution in the present application, the three steps may be executed asynchronously or directly synchronously, and the sequence in synchronous execution is not limited, and the video classification method includes the following steps:

step S601, a key frame image is acquired from the target video.

Specifically, the target video is split to obtain a plurality of frame images forming the target video, and the key frame images are obtained from the plurality of frame images. Specifically, a plurality of frame images forming the target video are obtained, and the plurality of frame images are input into a feature extraction layer in a key frame determination model to obtain the image features of each frame image; inputting the image characteristics of each frame image into a key value determination layer in the key frame determination model, and determining the key value of each frame image based on the attention mechanism in the key value determination layer; and determining key frame images in the target video according to the key value of each frame image. Determining the correlation degree between the image characteristics of the ith frame image and the image characteristics of the comparison image in the plurality of frame images based on an attention mechanism in the key value determination layer, and obtaining the key value of the ith frame image according to the correlation degree between the image characteristics of the ith frame image and the image characteristics of the comparison image; the comparison image is a frame image except the ith frame image in a plurality of frame images forming the target video, i is a positive integer and is not more than the number of the plurality of frame images; and when the ith frame image is the last frame image in the plurality of frame images, obtaining the key value of each frame image, and determining the key frame image in the target video according to the key value of each frame image.

Optionally, the correlation between the image feature of each frame image and the image feature of the contrast image of the frame image may be buffered by a correlation matrix, and the key value of each frame image is determined based on the correlation matrix. Specifically, an empty incidence matrix is created, wherein the incidence matrix is a two-dimensional matrix with the size of M × M, and M is the number of a plurality of frame images forming the target video. Acquiring the association degrees of the first frame image and the second frame image to the Mth frame image respectively, and recording each association degree into a first row and a first column in an association matrix for representing the association degrees between the image features of the first frame image and the image features from the second frame image to the Mth frame image respectively, and the association degrees between the image features from the second frame image to the Mth frame image and the image features of the first frame image respectively; and then obtaining the correlation degree between the image characteristics of the second frame image and the image characteristics from the third frame image to the Mth frame image, and recording each correlation degree into [2] [3] to [2] [ M ] in the second row and [3] [2] to [ M ] [2] in the second column in the correlation matrix until the correlation degree at the [ M ] [ M ] position in the correlation matrix is obtained. Based on the correlation matrix, obtaining a key value of each frame image, for example, grouping a plurality of frame images based on the correlation matrix, each frame image included in each group may be regarded as a similar image, and according to the relative position of each group of frame images in the target image, determining the frame image with the top relative position as the key frame image corresponding to the group, thereby obtaining the key frame image of the target video.

The determining of the key frame image through the incidence matrix is a possible key frame image determining method, and the key frame image may also be determined through other methods, such as using a key frame obtaining model or a key frame obtaining application, and the like, which is not limited herein.

Specifically, referring to fig. 7a, fig. 7a is a schematic view of a key frame image capturing scene provided in the embodiment of the present application. As shown in fig. 7a, the key frame determination model includes a feature extraction layer 703 and a key value determination layer 704. After the target video 701 is acquired, the target video 701 is split to obtain a plurality of frame images 702 forming the target video 701, the plurality of frame images 702 are input into a feature extraction layer 703 to obtain image features corresponding to each frame image, the image features corresponding to each frame image are input into a key value determination layer 704, the image features of each frame image are sequentially compared with the image features of other comparison images to determine the association between the image features of each frame image and the image features of the comparison images of the frame image, the key value of each frame image is obtained based on the association, the position 705 of the key frame image in the target video is determined according to the key value obtained in the key value determination layer 704, and the key frame image 706 is acquired from the target video according to the position 705 of the key frame image in the target video.

Alternatively, if the key frame images are determined based on the correlation matrix in the key value determination layer 704, assuming that a correlation matrix diagram as shown in fig. 7b is obtained, the key value of each frame image is obtained based on the correlation between the image feature of each frame image and the image feature of the contrast image of the frame image. Specifically, based on the correlation matrix shown in fig. 7b, the key value of the frame image corresponding to each row in the correlation matrix is obtained according to each correlation degree included in the row, if each correlation degree is smaller than the minimum similarity threshold, it is considered that the similarity between the frame image and other frame images is small, it is considered that the frame image is an individual content, it is considered that the frame image is a key frame image, if there is a correlation degree larger than the maximum similarity threshold, the frame images corresponding to the correlation degree larger than the maximum similarity threshold are divided into a group, and the frame image before the relative position in the target video in each frame image included in each finally obtained group is determined as the key frame image in the group. Assuming that the minimum similarity threshold is 0.3 and the maximum similarity threshold is 0.7, based on the incidence matrix in fig. 7b, each degree of incidence corresponding to the first frame image is less than 0.3, and the first frame image is determined as a key frame image; determining the second frame image as a key frame image if the association degree of the second frame image and each corresponding frame image is less than 0.3; the relevance degrees of the third frame image, the fourth frame image, the fifth frame image, the sixth frame image and the like are all larger than 0.7, and then the third frame image, the fourth frame image, the fifth frame image and the sixth frame image are equally divided into a group; and then, grouping and determining the seventh frame image until the Mth frame image is processed, wherein the first frame image in the single frame group is a key frame image, the second frame image in the single frame group is a key frame image, the third frame image in the multi-frame group (the third frame image, the fourth frame image, the fifth frame image, the sixth frame image and the like) is a key frame image, and … …, until all key frame images of the target video are obtained.

Optionally, after obtaining the association degree between the image feature of each frame image and the image features of other frame images, the key value of each frame image may be obtained based on the association degree and the key weight matrix, where the key weight matrix may be configured to increase the key value of the currently processed frame image when the association degree between the currently processed frame image and the frame image before the currently processed frame image is small, and decrease the key value of the currently processed frame image when the association degree between the currently processed frame image and the frame image before the currently processed frame image is large. After a key value of each frame image is obtained, the frame images are sequenced based on the key value, a front frame image in the sequenced frame images is determined as a key frame image of a target video, the front frame image is a frame image which is positioned at the front of the frame images sequenced from large to small based on the key value, K frame images are selected from the front to the back as the key frame image (K is a key frame quantity threshold), a plurality of frame images with a specified proportion are selected as the key frame images, if the key frame quantity threshold is 10, 10 frame images with key values larger than key values of other frame images are selected from the frame images as the key frame images, and if the specified proportion is 10%, one tenth frame image is selected from the frame images sequenced from large to small based on the key value as the key frame image.

Step S602, determining a key phrase corresponding to the key frame image, and determining the text content characteristics of the target video according to the key phrase.

Specifically, a key phrase corresponding to the key frame image is determined, and text content characteristics of the target video are determined according to the key phrase. Specifically, determining a key phrase corresponding to the key frame image, inputting the key phrase into a text classification model, extracting initial text features corresponding to the key phrase, and matching the initial text features with a plurality of to-be-matched type features in the text classification model to obtain a matching value; and determining the type feature to be matched with the maximum matching value as the text content feature corresponding to the key phrase.

The method for determining the keyword group is based on several methods obtained by randomly combining description information with image text and subtitle information, and specifically, refer to fig. 8, where fig. 8 is a text content feature acquisition architecture diagram provided in an embodiment of the present application, and as shown in fig. 8, the method for determining the keyword group is obtained by randomly combining branches from a key frame image to the keyword group. Each branch is used for obtaining the description information of the key frame image by inputting the key frame image into an image search engine and obtaining a key phrase based on the description information; identifying the key frame image to obtain image characters in the key frame image, and obtaining a key phrase based on the image characters; and extracting subtitle information in the key frame image, and obtaining a key phrase based on the subtitle information. The method for determining the key phrase specifically comprises the following steps:

the first method for determining a keyword group includes inputting the key frame image into an image search engine to obtain description information of the key frame image, determining a keyword group of the key frame image according to the description information, specifically, counting the occurrence frequency of each phrase in the phrases included in the description information, and determining a phrase whose occurrence frequency is greater than a threshold value of the statistical frequency in the description information as a keyword group of the key frame image, where the process may refer to the specific description in fig. 4 and is not repeated here.

The second key word group determining method is to input the key frame image into an image search engine to obtain the description information of the key frame image, identify the image characters in the key frame image, and determine the key word group of the key frame image according to the description information and the image characters. Specifically, a key frame image is input into an image search engine to obtain description information of the key frame image, the key frame image is input into an image character extraction tool to identify image characters in the key frame image, phrases in the description information and phrases in the image characters are added into a phrase set, an evaluation value corresponding to each phrase in the phrase set is determined according to the occurrence frequency of each phrase in the phrase set, the weight corresponding to the description information and the weight corresponding to the image characters, each phrase in the phrase set is sequenced according to the evaluation values, and key phrases of the key frame image are determined from the phrase set according to sequencing results. As shown in fig. 8, after obtaining the description information and the image text, the description information and the image text are both composed of phrases, the phrases in the description information and the phrases in the image text are added to the phrase set, and when the phrases are added to the phrase set, the occurrence frequency of each phrase is counted, so that the final phrase set includes two parts, one part is the phrases in the description information and the occurrence frequency of each phrase, the other part is the phrases in the image text and the occurrence frequency of each phrase, according to the phrases in the description information, the occurrence frequency of each phrase and the weight 1 of the description information, a weighted calculation is performed on each phrase to obtain an evaluation value of each phrase in the description information, according to the phrases in the image text, the occurrence frequency of each phrase and the weight 2 of the image text, a weighted calculation is performed on each phrase to obtain an evaluation value of each phrase in the image text, and sequencing each phrase in the phrase set according to the evaluation value, and determining key phrases from the phrase set according to a sequencing result. The evaluation value of the phrase may be obtained by multiplying the occurrence frequency of each phrase in the phrase set by the type weight corresponding to the phrase.

The third key phrase determining method includes inputting the key frame image into an image search engine to obtain description information of the key frame image, obtaining caption information corresponding to the key frame image, and determining a key phrase of the key frame image according to the description information and the caption information. Specifically, the key frame image is input into an image search engine to obtain description information of the key frame image, phrases in the description information and phrases in the caption information are added into a phrase set according to caption information corresponding to the key frame image, an evaluation value corresponding to each phrase in the phrase set is determined according to the occurrence frequency of each phrase in the phrase set, the weight corresponding to the description information and the weight corresponding to the caption information, each phrase in the phrase set is sequenced according to the evaluation value, and the key phrases of the key frame image are determined from the phrase set according to the sequencing result. As shown in fig. 8, after obtaining the description information and the caption information, the description information and the caption information are both composed of phrases, the phrases in the description information and the phrases in the caption information are added to a phrase set, and when the phrases are added to the phrase set, the number of times of occurrence of each phrase is counted, so that the final phrase set includes two parts, one part is the phrases in the description information and the number of times of occurrence of each phrase, the other part is the phrases in the caption information and the number of times of occurrence of each phrase, according to the phrases in the description information, the number of times of occurrence of each phrase and the weight 1 of the description information, each phrase is weighted to obtain the evaluation value of each phrase in the description information, according to the phrases in the caption information, the number of times of occurrence of each phrase and the weight 3 of the caption information, each phrase is weighted to obtain the evaluation value of each phrase in the caption information, and sequencing each phrase in the phrase set according to the evaluation value, and determining key phrases from the phrase set according to a sequencing result. The evaluation value of the phrase may be obtained by multiplying the occurrence frequency of each phrase in the phrase set by the type weight corresponding to the phrase.

The fourth key phrase determining method includes inputting the key frame image into an image search engine to obtain description information of the key frame image, identifying image characters in the key frame image, obtaining caption information corresponding to the key frame image, and determining a key phrase of the key frame image according to the description information, the image characters and the caption information. Specifically, phrases in the description information, phrases in the image text and phrases in the caption information are added to a phrase set, an evaluation value corresponding to each phrase in the phrase set is determined according to the occurrence frequency and the type weight of each phrase in the phrase set, each phrase in the phrase set is sequenced according to the evaluation value, and key phrases of the key frame image are determined from the phrase set according to the sequencing result. The type weight comprises a weight corresponding to the description information, a weight corresponding to the image and the text and a weight corresponding to the subtitle information. As shown in fig. 8, the evaluation value of each phrase in the description information is obtained based on the occurrence frequency of each phrase in the description information and the weight 1 corresponding to the description information, the evaluation value of each phrase in the image text is obtained based on the occurrence frequency of each phrase in the image text and the weight 2 corresponding to the image text, the evaluation value of each phrase in the caption information is obtained based on the occurrence frequency of each phrase in the caption information and the weight 3 corresponding to the caption information, the phrases in the phrase set are sorted according to the evaluation value of each phrase, and the keyword phrase of the key frame image is determined from the phrase set according to the sorting result.

When the keyword group is determined, if the keyword group of the key frame image of the target video is determined by combining the description information and the image characters and/or the multiple texts of the caption information, different types of weights can be respectively given to the description information, the image characters and the caption information based on the importance degrees of the description information, the image characters and the caption information, for example, the weight of the description information can be set to be greater than the weight of the image characters and greater than the weight of the caption information.

After determining the key phrases of the key frame image, converting each key phrase into a key phrase vector, and inputting the key phrase vector into a text classification model to obtain the text content characteristics of the target video. The process of determining the text content features may refer to the specific description in the process of determining the text content features described in fig. 5, and is not repeated herein.

Step S603, obtaining video content characteristics corresponding to the target video.

Specifically, a plurality of frame images forming a target video are obtained, and video content characteristics corresponding to the target video are obtained according to the content of each frame image in the target video. Specifically, at least one image pair in the target video is obtained, wherein each image pair comprises two adjacent frames of images in the target video; acquiring an optical flow graph between two frames of images in at least one image pair, and forming the optical flow graph corresponding to at least one image pair into an optical flow graph sequence of the target video; and inputting the frame image sequence and the optical flow graph sequence of the target video into a video classification model to obtain video content characteristics corresponding to the target video, wherein the frame image sequence is obtained by sequentially arranging all frame images forming the target video. The optical flow graph is the instantaneous speed of the pixel motion of a space moving object on an observation imaging plane, and is a method for finding the corresponding relation between the previous frame and the current frame by utilizing the change of the pixel in a time domain in an image sequence and the correlation between adjacent frames so as to calculate the motion information of the object between the adjacent frames.

Specifically, referring to fig. 9, fig. 9 is a schematic view for determining video content features according to an embodiment of the present application, and as shown in fig. 9, a plurality of frame images constituting a target video are acquired, the plurality of frame images are sequentially combined to obtain a frame image sequence 901, at least one image pair of the plurality of frame images is acquired, an optical flow diagram of the image pair is obtained based on each image pair, the optical flow diagrams are sequentially combined to obtain an optical flow diagram sequence 902 based on a relative position of the optical flow diagram in the target video, and the frame image sequence 901 and the optical flow diagram sequence 902 are input to a video classification model 903 for learning, so as to obtain video content features of the target video. Specifically, a frame image sequence 901 is input into a spatial convolution layer in a video classification model 903, feature extraction is performed on the frame image sequence 901 to obtain spatial features of a target video, an optical flow diagram sequence 902 is input into a time domain convolution layer in the video classification model 903, feature extraction is performed on the optical flow diagram sequence 902 to obtain time domain features of the target video, the spatial features and the time domain features are spliced, and the spliced spatial features and time domain features are processed based on the video classification model to obtain video content features 9031. In other words, a video is divided into continuous frame images, then, for each two adjacent frame images, an optical flow (optical flow) diagram of the two frame images is calculated, then, two three-dimensional convolutional neural networks (a spatial convolutional layer and a temporal convolutional layer) are used for respectively performing feature extraction on a time sequence frame image 901 and an optical flow diagram 902, then, the two extracted features are spliced, and finally, classification is performed to obtain video content features of a target video, wherein the spliced features are processed based on a classification head (header) structure of a video classification model to obtain the video content features, and in the video classification neural network structure, the classification head structure generally consists of a full connection layer and a softmax layer. The video classification model uses a three-dimensional convolution model (3D ConvNet) to perform feature extraction, and a construction method of a three-dimensional convolution kernel used in the video classification model can be obtained by performing 'dilation' through two-dimensional convolution.

Step S604, obtaining the audio information of the target video, and obtaining the voice content feature corresponding to the audio information based on the audio information.

Specifically, audio information in the target video is acquired, and a voice content feature corresponding to the audio information is obtained based on the audio information. Specifically, after audio information in the target video is acquired, the audio information is input into a speech classification model to obtain speech content features corresponding to the audio information, or the audio information is converted into an image, and the image after the audio information conversion is input into a related image classification model to perform feature extraction. The speech classification model may be an existing speech classification model, such as a Deep feed forward sequential memory neural network (DFSMN). The method comprises the steps of obtaining an existing voice classification model, training the existing voice classification model based on a video audio sample and a video category sample to obtain the voice classification model suitable for video classification, specifically obtaining audio sample characteristics in each video and category characteristics of the video, and training the voice classification model by using the audio sample characteristics and the category characteristics of each video, so that the probability that the corresponding category characteristics are obtained when the finally obtained voice classification model receives the audio sample is the maximum.

And step S605, obtaining the fusion characteristics of the target video.

Specifically, the fusion feature of the target video is obtained according to the text content feature obtained in step S602, the video content feature obtained in step S603, and the voice content feature obtained in step S604. In the first possible fusion feature obtaining method, text content features and video content features are spliced to obtain first fusion features; in a second possible fusion feature obtaining method, the text content features and the voice content features are spliced to obtain a second fusion feature; in a third possible fusion feature obtaining method, the text content feature, the video content feature and the voice content feature are spliced to obtain a third fusion feature. The method is described by a third possible fusion feature obtaining method, and the text content feature, the video content feature and the voice content feature are spliced to obtain a third fusion feature.

The text content feature, the video content feature and the voice content feature are all vectors, the lengths of the three vectors are possibly different, the three vectors are spliced to obtain a long feature vector, and the dimension of the long feature vector is equal to the sum of the dimensions of the three vectors. Since not every video contains audio information, when the target video does not contain audio information, the voice content features can be represented by a preset vector, and the preset vector is a fixed vector, such as a full 0 vector with a fixed length.

Optionally, a default feature value is added at a first designated position in the text content feature to obtain a text content feature with a first designated length; adding a default characteristic value at a second designated position in the voice content characteristics to obtain the voice content characteristics with a second designated length; and adding a default characteristic value at a third designated position in the video content characteristics to obtain video content characteristics with a third designated length, and splicing the text content characteristics with the first designated length, the voice content characteristics with the second designated length and the video content characteristics with the third designated length to obtain third fusion characteristics. In this optional case, since the text content feature length, the video content feature length, and the voice content length are preset fixed lengths, the target video can be directly complemented with the default feature value regardless of whether the packet contains no audio information.

For example, if a text content feature with a length of 11, a voice content feature with a length of 3, and a video content feature with a length of 9 are obtained, the first specified length is 15, the second specified length is 5, the third specified length is 12, and the default feature value is 0, the dimension of the long feature vector at this time is 32. Based on the assumed conditions, 4-bit default feature value 0 is added after the text content feature to obtain a (11+4) text content feature, 2-bit default feature value 0 is added after the voice content feature to obtain a (3+2) voice content feature, 3-bit default feature value 0 is added after the video content feature to obtain a (9+3) video content feature, and the text content feature of (11+4), the voice content feature of (3+2), and the video content feature of (9+3) are spliced to obtain a third fusion feature with a dimension of 32. Wherein the third fused feature with dimension 32 can be considered to have a value of 0 at dimensions 12 to 15, 19 to 20, and 30 to 32.

And step S606, determining the video type label of the target video according to the fusion characteristics.

Specifically, the third fusion feature is input into the classification model to obtain the video type label of the target video. Specifically, after the long feature vector is obtained in the above step, the long feature vector may be used to represent the target video, and the long feature vector is used as an input of the classification model to classify the target video, so as to obtain the video type tag of the target video. The classification model may be a classification model of conventional Machine learning, such as a Support Vector Machine (SVM) model, a linear regression analysis model (logistic regression), etc., or an end-to-end structure based on a neural network model structure, such as a combined structure of several fully-connected layers plus a softmax layer. After the video type tag of the target video is obtained, step S607 and step S608 are executed to apply the video type tag of the target video.

Optionally, if the third fusion feature includes a text content feature with a first specified length, a voice content feature with a second specified length, and a video content feature with a third specified length, based on the importance degree of the text content feature, the voice content feature, and the video content feature to the target video, the weight matrix in the classification model gives different weights to different text content features, voice content features, and video content features. Based on the first specified length, the second specified length and the third specified length, weights are given to different features in different dimensional ranges of a weight matrix in the classification model, and the weight matrix in the classification model comprises weight parts of the three dimensional ranges, wherein the weight parts of the three dimensional ranges are respectively calculated with text content features, voice content features and video content features. For example, due to the uncertainty of the audio information of the video, the weight part corresponding to the text content features in the weight matrix of the classification model can be larger than the weight part corresponding to the video content features than the weight part corresponding to the voice content features, so as to increase the influence of the text content features on the classification of the target video and further improve the accuracy of the video classification.

For example, assuming that the first specified length is 15, the second specified length is 5, and the third specified length is 12, based on the example in the step S605, the third fusion feature with the dimension of 32 is input into the classification model, and the video type label of the target video is obtained based on the weight matrix in the classification model. The weight matrix in the classification model can be regarded as a matrix of 32 x 1, the weight part with the dimension range of dimension 1 to dimension 15 in the weight matrix is used for calculating with text content features, the weight part with the dimension range of dimension 16 to dimension 20 in the weight matrix is used for calculating with voice content features, and the weight part with the dimension range of dimension 21 to dimension 32 in the weight matrix is used for calculating with video content features, so that the classification model has an emphasis point on each feature of a video when performing video type classification, and the weight matrix in the classification model can be adjusted based on requirements to adjust the influence of each notification on video classification results, and further improve the accuracy of video classification.

Step S607, add the target video to the video category corresponding to the video type tag.

Specifically, based on the video type tag of the target video, the target video is added to the video classification corresponding to the video type tag. Optionally, the video type tag of the target video may be sent to a target server, where the target server is used to manage a corresponding application program, and the application program may add the target video to a category corresponding to the video type tag based on the video type tag received by the target server, so that a user may find the target video in the category corresponding to the video type tag when using the application program, and in this way, the management efficiency of the application program on the video may be improved, where the application program may be a system tool, such as a video classification tool, a video recommendation system, a video search system, and the like. By taking a video classification labeling tool as an example, the method can be used for pre-judging the video to be labeled, and then providing candidate answers for labeling personnel based on the video type label of the video to be labeled, so that the labeling efficiency is improved; for another example, a video recommendation system is used for classifying videos, and then accurate recommendation of the videos can be performed for users according to the types of the videos and the portrait of the users.

Step S608, the target video is pushed to the target terminal.

Specifically, the target video is pushed to the target terminal based on the video type tag of the target video, and the target terminal is a terminal marked with the video type tag. The target terminal can be a personal terminal of a user and the like, the user of the target terminal adds an attention list based on the watching interest of the user on the video, the attention list comprises a plurality of video type labels, after the video type labels of the target video are obtained, at least one target terminal marked with the video type labels is obtained, and the target video is sent to at least one target terminal.

For example, the user a is interested in videos related to comedy and swordsmen, the user B is interested in videos related to terror and reasoning, and the user C is interested in videos related to idol and comprehension.

Specifically, referring to fig. 10, fig. 10 is a schematic diagram of a video type tag determination process provided in an embodiment of the present application. As shown in fig. 10, the feature extraction process in the video type tag generation process is divided into three branches, where a first branch is a key frame image based on a target video, obtains text information in the key frame image, extracts a key group in the text information, and inputs the key group into a text classification model to obtain text content features, and the specific implementation of the process may refer to steps S601 to S602 shown in fig. 6; the second branch is based on the frame images and the video classification models forming the target video to obtain the video content characteristics of the target video, and the specific implementation of the process can refer to step S603 shown in fig. 6; the third branch is to obtain the audio information of the target video, and input the audio information into the speech classification model to obtain the speech content characteristics of the target video, and the specific implementation of the process may refer to step S604 shown in fig. 6. Based on the random combination of the three branches, different video classification methods are obtained.

In the first video classification method, after the text content features of the target video are obtained, the video type tag of the target video is determined according to the text content features, and a specific implementation manner of the method may refer to descriptions of each step in fig. 3, which is not described herein again.

The second video classification method comprises the steps of obtaining text content characteristics and video content characteristics of a target video, splicing the text content characteristics and the video content characteristics to obtain first fusion characteristics, and inputting the first fusion characteristics into a classification model to obtain a video type label of the target video. Optionally, a default feature value is added at a first designated position in the text content feature to obtain a text content feature with a first designated length; adding a default characteristic value at a third designated position in the video content characteristics to obtain video content characteristics with a third designated length; splicing the text content features with the first specified length and the video content features with the third specified length to obtain first fusion features; and inputting the first fusion characteristics into a classification model, and obtaining a video type label of the target video based on a classification weight matrix in the classification model. On the basis of the text content features of the target video, the video content features are added to serve as the basis for video classification of the target video, so that when the target video is classified, the features extracted aiming at the target video are more comprehensive, and the accuracy of the classification result of the target video is improved. The text content features are a conventional description of the target video, the video content features are features extracted from a picture of the target video, and for one video, the picture of the video is a necessary component of the video, namely, different types of videos are composed of pictures, the pictures are frame images forming the video and are content display of the video, so that the composition features of the target video can be obtained as comprehensively as possible by combining third-party feature description-text content features extracted from the target video and video content features extracted from the target video, and the change relationship among each frame image of the target video is concerned in the process of extracting the video content features, so that the integral change of the target video can be represented, and the target video is combined and processed from the features of a single frame image of the target video to the change features among each frame image, the accuracy of video classification can be further improved.

And the third video classification method comprises the steps of obtaining text content characteristics and voice content characteristics of the target video, splicing the text content characteristics and the voice content characteristics to obtain second fusion characteristics, and inputting the second fusion characteristics into a classification model to obtain a video type label of the target video. Optionally, a default feature value is added at a first designated position in the text content feature to obtain a text content feature with a first designated length; adding a default characteristic value at a second designated position in the voice content characteristics to obtain the voice content characteristics with a second designated length; splicing the text content features with the first specified length and the voice content features with the second specified length to obtain second fusion features; and inputting the second fusion characteristics into a classification model, and obtaining a video type label of the target video based on a classification weight matrix in the classification model. On the basis of the text content features of the target video, the voice content features are added to serve as the basis for video classification of the target video, so that when the target video is classified, the features extracted aiming at the target video are more comprehensive, and the accuracy of the classification result of the target video is improved. The text content features are conventional descriptions of the target video, the voice content features are extracted features based on audio information of the target video, the audio information is generally related introduction aiming at the target video and comprises monologue, lines and the like, the audio information is a description aiming at the target video and can not be reflected in a picture of the target video, so that the features of frame images forming the target video can be obtained based on the text content features by combining the text content features with the video content features, and the classification result of the target video is limited to the content of the target video based on the voice content features, so that the classification of the target video is more accurate. For example, a picture of the target video tells a story of 'swordsmen going swordsmen' and text content characteristics of the target video are obtained by extracting text information of the target video, so that only the classification of 'swordsmen' can be obtained, and a section of audio information at the beginning of the target video tells occurrence backgrounds of the story, such as 'in the dynasty XX, four times of smoke beacon, people leave places due to war streams, more people are rushing to cheat and press common people, and … …' when the people leave places in war, and the time and the places of the occurrence of the target video can be extracted through the audio information, so that another classification 'history' of the target video can be obtained, and the classification result of the target video can be more comprehensive and accurate.

And the fourth video classification method comprises the steps of obtaining the text content characteristics, the video content characteristics and the voice content characteristics of the target video, splicing the text content characteristics, the video content characteristics and the voice content characteristics to obtain third fusion characteristics, and inputting the third fusion characteristics into a classification model to obtain the video type label of the target video. The specific implementation manner of the method may refer to the description of each step in fig. 6, which is not described herein again.

After the video classification tag of the target video is obtained by the above video classification method, step S607 and step S608 in fig. 6 may be executed to apply the video classification tag of the target video.

According to the method and the device for searching the key frame images, the key frame images are obtained from the target video and input into an image search engine to obtain the description information of the key frame images, the key phrases of the key frame images are determined according to the description information, the text content characteristics corresponding to the key phrases are obtained, and the video type labels of the target video are determined according to the text content characteristics. According to the method and the device, the related text information of the target video is used as a basis for classifying the target video, so that the text information of the video sample can be increased during training to be used as a training sample, the trainable data volume is increased, and the accuracy of video classification is improved. And because the text information is obtained based on the image search engine, the text label of the image by the image search engine is an explanation description of the image, so that the related text information of the target video is a content description of the target video, and the video type label of the target video is determined through the related text information, thereby further improving the effectiveness and accuracy of video classification. Meanwhile, in the application, multi-mode classification of videos is realized by randomly combining the features of different dimensional contents forming the target video, specifically, the target video is classified based on combination of the dimensional contents such as text information, video frame images and audio information of the target video, and under the condition that the features for video classification are added, the features used when the target video is classified are more comprehensive, and the method has the advantages of combining the text content features with the video content features and combining the text content features with the voice content features, so that the accuracy and the effectiveness of video classification are improved. In addition, during training, all models for video classification are trained based on combination of dimensional content samples such as text information, video frame images and audio information of the target video, and the multi-mode training effectively increases the trainable data volume and further improves the prediction accuracy of all models for video classification.

Referring to fig. 11, fig. 11 is a schematic view of a video classification apparatus provided in an embodiment of the present application, and as shown in fig. 11, the video classification apparatus 110 may be used in the computer in the embodiment corresponding to fig. 3 or fig. 6, specifically, the video classification apparatus 110 may include: the device comprises a first acquisition module 11, a first determination module 12, a second acquisition module 13 and a second determination module 14.

A first obtaining module 11, configured to obtain a key frame image from a target video;

a first determining module 12, configured to input the key frame image into an image search engine, obtain description information of the key frame image, and determine a keyword group of the key frame image according to the description information;

a second obtaining module 13, configured to obtain text content features corresponding to the keyword group;

and a second determining module 14, configured to determine a video type tag of the target video according to the text content feature.

Wherein the apparatus 110 further comprises:

a third obtaining module 15, configured to obtain, according to content of each frame of image in the target video, video content features corresponding to the target video;

the second determination module 14 includes:

a splicing unit 141, configured to splice the text content feature and the video content feature to obtain a first fusion feature;

and the first training unit 142 is configured to input the first fusion feature into a classification model to obtain a video type tag of the target video.

Wherein, the third obtaining module 15 includes:

a first obtaining unit 151, configured to obtain at least one image pair in the target video, where each image pair includes two adjacent frames of images in the target video;

a second obtaining unit 152, configured to obtain an optical flow graph between two frames of images in the at least one image pair, and compose the optical flow graph corresponding to the at least one image pair into an optical flow graph sequence of the target video;

the second training unit 153 is configured to input the frame image sequence of the target video and the light flow graph sequence into a video classification model to obtain video content features corresponding to the target video, where the frame image sequence is obtained by sequentially arranging frame images that constitute the target video.

Wherein the apparatus 110 further comprises:

a fourth obtaining module 16, configured to obtain audio information of the target video, and input the audio information into a speech classification model to obtain a speech content feature corresponding to the audio information;

the second determining module 14 includes:

the splicing unit 141 is further configured to splice the text content feature and the voice content feature to obtain a second fusion feature;

the first training unit 142 is further configured to input the second fusion feature into a classification model to obtain a video type tag of the target video.

Wherein the apparatus 110 further comprises:

a fifth obtaining module 17, configured to identify image characters in the key frame image, and obtain subtitle information corresponding to the key frame image;

in the aspect of determining the key phrase of the key frame image according to the description information, the first determining module 12 is specifically configured to:

Wherein the first determining module 12 includes:

an adding unit 121, configured to add a phrase in the description information, a phrase in the image text, and a phrase in the subtitle information to a phrase set;

a first determining unit 122, configured to determine, according to the occurrence frequency and the type weight of each phrase in the phrase set, an evaluation value corresponding to each phrase in the phrase set; the type weight comprises a weight corresponding to the description information, a weight corresponding to the image and the text and a weight corresponding to the subtitle information;

a second determining unit 123, configured to rank each phrase in the phrase set according to the evaluation value, and determine a keyword phrase of the key frame image from the phrase set according to a ranking result.

Wherein, the first obtaining module 11 includes:

a third obtaining unit 111, configured to obtain a plurality of frame images forming the target video, input the plurality of frame images into a feature extraction layer in a key frame determination model, and obtain an image feature of each frame image;

a third determining unit 112, configured to input the image feature of each frame image into a key value determination layer in the key frame determination model, where a key value of each frame image is determined based on an attention mechanism;

a fourth determining unit 113, configured to determine the key frame image in the target video according to the key value of each frame image.

The third determining unit 112 is specifically configured to:

Wherein, in said determining a key phrase of the key frame image according to the description information, the first determining module 12 comprises:

a counting unit 124, configured to count the occurrence frequency of each phrase in the phrases included in the description information, and determine the phrase whose occurrence frequency is greater than a threshold value of the statistical frequency in the description information as a key phrase of the key frame image.

Wherein, the second obtaining module 13 includes:

the extracting unit 131 is configured to input the keyword group into a text classification model, and extract an initial text feature corresponding to the keyword group;

a matching unit 132, configured to match the initial text feature with multiple features of types to be matched in the text classification model, so as to obtain a matching value;

a fifth determining unit 133, configured to determine the type feature to be matched with the maximum matching value as a text content feature corresponding to the keyword group.

In the aspect of obtaining a second fusion feature by splicing the text content feature and the speech content feature, the splicing unit 141 includes:

a first generating subunit 1411, configured to add a default feature value to a first specified position in the text content feature, so as to obtain a text content feature with a first specified length;

the first generating subunit 1411 is further configured to add the default feature value to a second specified position in the voice content feature to obtain a voice content feature with a second specified length;

a second generating subunit 1412, configured to splice the text content feature with the first specified length and the voice content feature with the second specified length to obtain the second fusion feature;

the first training unit 142 is specifically configured to:

Wherein the apparatus 110 further comprises:

an adding module 18, configured to add the target video to a video category corresponding to the video type tag based on the video type tag of the target video; or,

a sending module 19, configured to push the target video to a target terminal, where the target terminal is a terminal marked with the video type tag.

The embodiment of the application provides a video classification device, which obtains a key frame image from a target video, inputs the key frame image into an image search engine to obtain description information of the key frame image, determines a key phrase of the key frame image according to the description information, obtains text content characteristics corresponding to the key phrase, and determines a video type label of the target video according to the text content characteristics. According to the method and the device, the related text information of the target video is used as a basis for classifying the target video, so that the text information of the video sample can be increased during training to be used as a training sample, the trainable data volume is increased, and the accuracy of video classification is improved. And because the text information is obtained based on the image search engine, the text label of the image by the image search engine is an explanation description of the image, so that the related text information of the target video is a content description of the target video, and the video type label of the target video is determined through the related text information, thereby further improving the effectiveness and accuracy of video classification.

Referring to fig. 12, fig. 12 is a schematic structural diagram of a computer according to an embodiment of the present disclosure. As shown in fig. 12, the computer in the embodiment of the present application may include: one or more processors 1201, memory 1202, and input-output interface 1203. The processor 1201, the memory 1202, and the input/output interface 1203 are connected by a bus 1204. The memory 1202 is used for storing a computer program including program instructions, and the input/output interface 1203 is used for inputting and outputting data, specifically for inputting and outputting data in each model used in each application; the processor 1201 is configured to execute the program instructions stored in the memory 1202 to perform the following operations:

acquiring a key frame image from a target video;

acquiring text content characteristics corresponding to the key phrases;

In some possible embodiments, the processor 1201 may be a Central Processing Unit (CPU), and the processor may be other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 1202 may include both read-only memory and random access memory, and provides instructions and data to the processor 1201 and the input output interface 1203. A portion of the memory 1202 may also include non-volatile random access memory. For example, memory 1202 may also store device type information.

In a specific implementation, the computer may execute, through each built-in functional module, the implementation manner provided in each step in fig. 3 or fig. 6, which may be specifically referred to the implementation manner provided in each step in fig. 3 or fig. 6, and is not described herein again.

The embodiment of the present application provides a computer, including: the video classification method comprises a processor, an input/output interface and a memory, wherein the processor acquires computer instructions in the memory, and executes the steps of the method shown in the figure 3 or the figure 6 to perform the video classification operation. With computer instructions in the memory, the processor performs the steps of: acquiring a key frame image from a target video, inputting the key frame image into an image search engine to obtain description information of the key frame image, determining a key phrase of the key frame image according to the description information, acquiring text content characteristics corresponding to the key phrase, and determining a video type label of the target video according to the text content characteristics. According to the method and the device, the related text information of the target video is used as a basis for classifying the target video, so that the text information of the video sample can be increased during training to be used as a training sample, the trainable data volume is increased, and the accuracy of video classification is improved. And because the text information is obtained based on the image search engine, the text label of the image by the image search engine is an explanation description of the image, so that the related text information of the target video is a content description of the target video, and the video type label of the target video is determined through the related text information, thereby further improving the effectiveness and accuracy of video classification.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, where the computer program includes program instructions, and when the program instructions are executed by a processor, the video classification method provided in each step in fig. 3 or fig. 6 is implemented, which may specifically refer to the implementation manner provided in each step in fig. 3 or fig. 6, and details of the implementation manner are not described herein again.

The computer-readable storage medium may be the video classification apparatus provided in any of the foregoing embodiments or an internal storage unit of the computer, such as a hard disk or a memory of the computer. The computer readable storage medium may also be an external storage device of the computer, such as a plug-in hard disk, a smart card (SMC), a Secure Digital (SD) card, a flash card (flash card), and the like, which are provided on the computer. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the computer. The computer-readable storage medium is used for storing the computer program and other programs and data required by the computer. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.

The terms "first," "second," and the like in the description and in the claims and drawings of the embodiments of the present application are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "comprises" and any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, apparatus, product, or apparatus that comprises a list of steps or elements is not limited to the listed steps or modules, but may alternatively include other steps or modules not listed or inherent to such process, method, apparatus, product, or apparatus.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The method and the related apparatus provided by the embodiments of the present application are described with reference to the flowchart and/or the structural diagram of the method provided by the embodiments of the present application, and each flow and/or block of the flowchart and/or the structural diagram of the method, and the combination of the flow and/or block in the flowchart and/or the block diagram can be specifically implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block or blocks of the block diagram. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block or blocks of the block diagram. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block or blocks.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. A method for video classification, the method comprising:

acquiring a key frame image from a target video;

acquiring text content characteristics corresponding to the key phrases;

2. The method of claim 1, wherein the method further comprises:

3. The method of claim 2, wherein the obtaining the video content characteristics corresponding to the target video according to the content of each frame of image in the target video comprises:

4. The method of claim 1, wherein the method further comprises:

5. The method of claim 1, wherein the method further comprises:

6. The method of claim 5, wherein said determining a key phrase for said key frame image based on said description information, said image text, and said caption information comprises:

7. The method of claim 1, wherein said obtaining a key frame image from a target video comprises:

8. The method of claim 7, wherein said determining a key value for said each frame image based on an attention mechanism in said key value determination layer comprises:

9. The method of claim 1, wherein said determining a key phrase for the key frame image based on the description information comprises:

10. The method of claim 1, wherein the obtaining text content features corresponding to the keyword group comprises:

11. The method of claim 4, wherein said concatenating said text content feature with said speech content feature to obtain a second fused feature comprises:

12. The method of claim 1, wherein the method further comprises:

13. An apparatus for video classification, the apparatus comprising:

14. A computer, comprising a processor, a memory, an input output interface;

the processor is connected to the memory and the input/output interface, respectively, wherein the input/output interface is used for inputting and outputting data, the memory is used for storing program codes, and the processor is used for calling the program codes to execute the method according to any one of claims 1 to 12.

15. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions which, when executed by a processor, perform the method according to any one of claims 1-12.