CN113094549A

CN113094549A - Video classification method and device, electronic equipment and storage medium

Info

Publication number: CN113094549A
Application number: CN202110645773.5A
Authority: CN
Inventors: 李林科; 李大海
Original assignee: Zhizhe Sihai Beijing Technology Co Ltd
Current assignee: Zhizhe Sihai Beijing Technology Co Ltd
Priority date: 2021-06-10
Filing date: 2021-06-10
Publication date: 2021-07-09

Abstract

The invention provides a video classification method, a video classification device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring text information, video information and audio information of a target video; inputting text information, video information and audio information of the target video into a video classification model to obtain a video classification result output by the video classification model; and the video classification result is obtained by performing feature fusion on the characteristics corresponding to the text information, the video information and the audio information of the target video according to the attention mechanism used by the video classification model. According to the video classification method, the video classification device, the electronic equipment and the storage medium, the model prediction effect for video classification is better and more accurate and the generalization capability is stronger through the multi-mode video classification model based on the attention mechanism.

Description

Video classification method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to a video classification method and apparatus, an electronic device, and a computer-readable storage medium.

Background

The websites or platforms containing video services generally need to classify the video entities in the domain, so as to provide basic data support for the services such as downstream content models, user models, item recommendation, etc. by giving the video corresponding to the categories, such as science and technology, entertainment, etc.

The current general video classification method is to classify and label videos according to text information such as video description information, title information and the like attached to the uploaded videos, but due to the diversity of the video information, the titles are only a part of the videos, and the problem that a lot of title information cannot effectively represent the video information exists.

Disclosure of Invention

In view of this, an object of an embodiment of the present invention is to provide a video classification method, an apparatus, an electronic device, and a storage medium, which specifically include:

in a first aspect, an embodiment of the present invention provides a video classification method, where the method includes:

acquiring text information, video information and audio information of a target video;

inputting text information, video information and audio information of the target video into a video classification model to obtain a video classification result output by the video classification model; the video classification model is obtained by taking text information, video information and audio information of a sample video as training samples and taking a classification result of the sample video as a label for training;

and the video classification result is obtained by performing feature fusion on the characteristics corresponding to the text information, the video information and the audio information of the target video according to the attention mechanism used by the video classification model.

Optionally, the video classification model includes a feature fusion layer, and the feature fusion layer is specifically configured to:

receiving characteristics corresponding to text information, video information and audio information of the target video input to the characteristic fusion layer;

splicing the characteristics corresponding to the text information, the video information and the audio information of the target video to obtain multi-modal characteristics corresponding to the target video;

according to the multi-modal characteristics corresponding to the target video, determining the attention weight corresponding to the target video;

and determining fusion characteristics corresponding to the target video according to the multi-modal characteristics corresponding to the target video and the attention weight, and using the fusion characteristics as the output of the characteristic fusion layer.

Optionally, the video classification model further includes a classification output layer located after the feature fusion layer, and the classification output layer is specifically configured to:

receiving fusion characteristics corresponding to the target video output by the characteristic fusion layer;

and determining a video classification result corresponding to the target video according to the fusion characteristics corresponding to the target video.

Optionally, the video classification result includes one or more classification tags corresponding to the target video, each classification tag corresponds to a preset classification level, and the number of the classification levels is the same as the number of classification sub-layers in the classification output layer;

correspondingly, the determining a video classification result corresponding to the target video according to the fusion feature corresponding to the target video specifically includes:

and respectively inputting the fusion characteristics corresponding to the target video to each classification sublayer in the classification output layer to obtain a video classification result corresponding to the target video consisting of the classification labels output by each classification sublayer.

Optionally, when the number of the classification levels is 2, a loss function corresponding to a training process of the video classification model is as follows:

wherein c1 and c2 represent the number of samples,

the true value of the first label representing the ith sample,

a predicted value representing the first label is indicated,

second representing the ith sampleThe actual value of the tag is set by the tag,

indicating the predicted value of the second label.

Optionally, the splicing the features corresponding to the text information, the video information, and the audio information of the target video to obtain the multi-modal features corresponding to the target video specifically includes:

performing one-dimensional expansion on the characteristics corresponding to the text information, the video information and the audio information of the target video to obtain one-dimensional representation of the characteristics corresponding to the text information, the video information and the audio information of the target video;

and splicing the one-dimensional representations of the characteristics corresponding to the text information, the video information and the audio information of the target video to obtain the multi-modal characteristics corresponding to the target video.

Optionally, the calculation formula of the attention weight score and the fusion feature is:

wherein BN represents the normalization operation, Relu and sigmod are the activation functions, w₁,w₂,b₁,b₂For model parameters for trainable learning, f₁F is a multi-modal feature, which is a feature obtained after the activation function.

Optionally, the video classification model further includes a feature extraction layer located before the feature fusion layer, where the feature extraction layer is specifically configured to:

extracting the characteristics of the video information and the audio information of the target video by using a Nextvlad model according to the video information and the audio information of the target video;

and extracting the characteristics of the text information of the target video by using a Bert model according to the text information of the target video.

Optionally, the Nextvlad model is obtained by a pre-training step, where the pre-training step specifically includes:

extracting the characteristics of video information and audio information of the sample video through an inclusion model and a Vggish model respectively;

inputting the characteristics of the video information and the audio information of the sample video into a Nextvlad model for pre-training to obtain a pre-trained Nextvlad model;

wherein the pre-training step uses a loss function of

；

Wherein k represents the number of samples, y represents the true value of the ith sample, and logy_pRepresenting the predicted value of the ith sample.

Optionally, the extracting, according to the text information of the target video, the feature of the text information of the target video by using a Bert model specifically includes:

coding title information and video description information contained in the text information of the target video to obtain coded text information;

and inputting the coded text information into a Bert model to obtain the characteristics of the text information of the target video.

In a second aspect, an embodiment of the present invention provides a video classification apparatus, where the apparatus includes:

the information acquisition module is used for acquiring text information, video information and audio information of the target video;

the video classification module is used for inputting the text information, the video information and the audio information of the target video into a video classification model to obtain a video classification result output by the video classification model; the video classification model is obtained by taking text information, video information and audio information of a sample video as training samples and taking a classification result of the sample video as a label for training;

In a third aspect, an embodiment of the present invention provides an electronic device, including:

one or more processors;

a memory for storing one or more programs;

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of the first aspect.

In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium having stored thereon executable instructions, which when executed by a processor, cause the processor to perform the method according to the first aspect.

According to the video classification method, the video classification device, the electronic equipment and the storage medium, the model prediction effect for video classification is better and more accurate and the generalization capability is stronger through the multi-mode video classification model based on the attention mechanism.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings required to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative work. The foregoing and other objects, features and advantages of the application will be apparent from the accompanying drawings. Like reference numerals refer to like parts throughout the drawings. The drawings are not intended to be to scale as practical, emphasis instead being placed upon illustrating the subject matter of the present application.

Fig. 1 shows a schematic flow chart of a video classification method provided according to an embodiment of the present disclosure.

Fig. 2 shows a schematic flow chart of a feature fusion method provided according to an embodiment of the present disclosure.

Fig. 3 shows a flow chart of a video classification method provided according to an embodiment of the present disclosure.

Fig. 4 shows a schematic flow chart of a feature extraction method provided according to an embodiment of the present disclosure.

Fig. 5 shows a flowchart of a pretraining method of a Nextvlad model provided according to an embodiment of the present disclosure.

Fig. 6 shows a schematic structural diagram of a video classification apparatus provided according to an embodiment of the present disclosure.

Fig. 7 shows a schematic structural diagram of an electronic device provided according to an embodiment of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The words "a", "an" and "the" and the like as used herein are also intended to include the meanings of "a plurality" and "the" unless the context clearly dictates otherwise. Furthermore, the terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

The websites or platforms containing video services generally need to classify the uploaded video entities in fields, so that the videos can be provided with corresponding categories, such as science and technology, entertainment and the like, and basic data support is provided for services such as downstream content models, user models, item recommendation and the like.

In view of this, an object of the embodiments of the present disclosure is to provide a video classification method, apparatus, electronic device and storage medium, which enable a model prediction effect for video classification to be better and accurate and have stronger generalization capability through a multi-modal video classification model based on an attention mechanism.

The present disclosure is described in detail below with reference to the attached drawings.

Fig. 1 shows a video classification method provided in an embodiment of the present invention, which includes the following specific steps:

step S110: and acquiring text information, video information and audio information of the target video.

The target video described in the embodiment of the present invention may be a video entity in a website or platform including a video service, such as a video uploaded by a user in a video website, a video resource of the video website itself, and the like. The text information of the target video can be the text information in the video title or the text information in the video description information. The acquisition of the text information of the target video can be realized by reading a title field or a video description information field corresponding to the target video.

The video information and the audio information of the target video are basic contents constituting a target video entity, and the video information and the audio information of the target video are acquired. For example, when video information is acquired, video frames with a specific frame number may be extracted as the video information, and the video frames may be extracted at fixed intervals or by using a key frame technique.

Step S120: and inputting the text information, the video information and the audio information of the target video into a video classification model to obtain a video classification result output by the video classification model.

The video classification model used in the video classification of the embodiment of the invention belongs to a multi-modal video classification model, and simultaneously takes various information such as text information, video information, audio information and the like of a target video as the input of the model. Compared with the single-mode video classification model, for example, the title information is simply classified based on the Bert model, or the video is subjected to frame extraction processing, then the CNN is used for extracting features, and finally the classification is performed, so that the single-mode model can achieve better effect in the form of uniform video data. But for the case of video information containing many different modalities, a single modality will have difficulty in efficiently extracting data features to the desired effect.

The video classification model in the embodiment of the invention is obtained by taking the text information, the video information and the audio information of the sample video as training samples and taking the classification result of the sample video as a label for training. The sample video refers to a video serving as a training sample, and the classification result of the sample video is obtained through manual labeling or through a third-party data set such as youtu-8 m.

Further, the video classification result is obtained by performing feature fusion on the features corresponding to the text information, the video information and the audio information of the target video according to the attention mechanism used by the video classification model. When the embodiment of the invention utilizes a plurality of modes to carry out video frequency classification, the method not only simply and directly integrates the output results of the plurality of modes, but also needs to carry out feature fusion on the information of the plurality of modes. When feature fusion is carried out, an attention mechanism is also used, the characteristics and the relation of different modal contents are fully utilized, and the importance degrees of different modal features are judged through network learning, so that the model for video classification has better and accurate prediction effect and stronger generalization capability.

On the basis of the foregoing embodiment, fig. 2 shows a feature fusion method related to a feature fusion layer included in a video classification model according to another embodiment of the present invention, which includes the specific steps of:

step S210: receiving characteristics corresponding to text information, video information and audio information of the target video input to the characteristic fusion layer.

The video classification model used in the embodiment of the invention mainly comprises a feature extraction layer, a feature fusion layer and a feature output layer in sequence. The feature fusion layer is an important component in the video classification model used in the embodiment of the present invention, and fuses the features of multiple modes based on an attention mechanism. The feature fusion layer first receives input information input to the layer, i.e., features corresponding to text information, video information, and audio information of the target video. It is understood that the above features may be obtained by feature extraction of text information, video information, and audio information of the target video.

Step S220: and splicing the characteristics corresponding to the text information, the video information and the audio information of the target video to obtain the multi-modal characteristics corresponding to the target video.

The features corresponding to the text information, the video information and the audio information of the target video acquired in the previous step are respectively and independently acquired, and when the three features are fused, the three features are spliced to obtain corresponding multi-modal features.

Due to the different characteristics of text information, video information and audio information, the dimensions of the extracted features are usually different. Therefore, first, the features corresponding to the text information, the video information, and the audio information of the target video need to be expanded in one dimension to obtain a one-dimensional representation of the features corresponding to the text information, the video information, and the audio information of the target video. And then, splicing the one-dimensional representations of the characteristics corresponding to the text information, the video information and the audio information of the target video to obtain the multi-modal characteristics corresponding to the target video. It will be appreciated that the multi-modal features are also represented in a one-dimensional form.

For example, corresponding features to video information

Features corresponding to audio information

And features corresponding to text information

First, we will

And

is expanded into one dimension

，

Then are mixed with

And (5) splicing together to obtain the multi-modal feature F.

Step S230: and determining the attention weight corresponding to the target video according to the multi-modal characteristics corresponding to the target video.

The formula for calculating the attention weight score is as follows:

In particular, the calculated attention weight score may characterize the importance of different modal characteristics in the multi-modal characteristics. Model parameter w₁,w₂,b₁,b₂After training, a robust attention weight can be effectively calculated, and the dimension of the attention weight is the same as that of the multi-modal features.

Step S240: and determining fusion characteristics corresponding to the target video according to the multi-modal characteristics corresponding to the target video and the attention weight, and using the fusion characteristics as the output of the characteristic fusion layer.

The fusion feature F_aThe calculation formula of (2) is as follows:

after the multi-modal features and the attention weights are fused, original information in the multi-modal features is converted into weight information carrying the attention mechanism, and fused features output by a feature fusion layer are formed. The fusion characteristics are used for video classification, so that the model can effectively distinguish which information in the multi-modal characteristics is more important, and the video classification result is accurately predicted.

For example, there are few text information contents of the target video, and the title is often short or even absent, which results in less effective content information contained therein; for another example, some target videos such as plain text videos have a high degree of visual information redundancy, and since the video duration span is large, this is effective for the generalization capability of the models in the prior art, and thus it is not possible to model videos with different durations at the same time. In the embodiment of the invention, the fusion features output by the feature fusion layer not only fuse the multi-modal features corresponding to the target video, but also consider the importance degrees of different modal features, so that the model used for video classification has better and accurate prediction effect and stronger generalization capability.

On the basis of the foregoing embodiment, fig. 3 shows a video classification method related to a classification output layer included in a video classification model according to another embodiment of the present invention, which includes the specific steps of:

step S310: and receiving the fusion characteristics corresponding to the target video output by the characteristic fusion layer.

The video classification model used in the embodiment of the invention mainly comprises a feature extraction layer, a feature fusion layer and a feature output layer in sequence. The classification output layer is the next layer of the feature fusion layer, and first needs to receive the fusion features corresponding to the target video output by the feature fusion layer as input data of the classification output layer.

Step S320: and determining a video classification result corresponding to the target video according to the fusion characteristics corresponding to the target video.

And the classification output layer is used as the last layer of the whole video classification model, and is used for determining a video classification result corresponding to the target video according to the fusion characteristics and is also used for outputting the whole video classification model.

Specifically, the video classification result includes one or more classification tags corresponding to the target video, and each classification tag corresponds to a preset classification level. For example, when the number of classification levels is 1, it means that the target video is classified at one level, such as music, sports, education, science and technology, and the like. When the number of the classification levels is 2, the target video is subjected to secondary classification, for example, under the primary classification music, the target video is further classified into popularity, rock, light music and the like; for example, under the first-class classification sports, the target video area is further divided into football, basketball, tennis and the like.

The number of the classification labels represents the refinement degree of the video classification method for classifying the target video, and the number of the classification labels can be preset according to actual requirements.

Further, the number of classification levels is the same as the number of classification sublayers in the classification output layer. Each classification sub-layer in the classification output layer is used for determining a certain level of classification label of the target video. For example, when the number of classification levels is 2, the classification output layer includes two classification sublayers, a first classification sublayer is used for determining a first-level classification of the target video, and a second classification sublayer is used for further determining a second-level classification of the target video.

The number of classification levels also has an impact on the entire training process of the video classification model. In particular, the number of classification levels will determine the form of expression of the loss function in the model training process. For example, when the number of classification levels is 2, the loss function corresponding to the training process of the video classification model is:

wherein c1 and c2 represent the number of samples,

the true value of the first label representing the ith sample,

a predicted value representing the first label is indicated,

the true value of the second label representing the ith sample,

indicating the predicted value of the second label.

It is understood that the two terms on the right side of the above calculation formula represent the loss function term of the first label and the loss function term of the second label, respectively, and the two terms together form the loss function of the whole training process. The person skilled in the art can reasonably speculate that the formula for calculating the loss function L has a corresponding expression when the number of classification levels is other values.

On the basis of the foregoing embodiment, fig. 4 shows a feature extraction method related to a feature extraction layer included in a video classification model according to another embodiment of the present invention, which includes the specific steps of:

step S410, extracting the characteristics of the video information and the audio information of the target video by using a Nextvlad model according to the video information and the audio information of the target video;

the video classification model used in the embodiment of the invention mainly comprises a feature extraction layer, a feature fusion layer and a feature output layer in sequence. The feature extraction layer is a layer above the feature fusion layer and is used for extracting features of video information, audio information and text information of the target video.

The Nextvald model is a better video classification model at present, and the embodiment of the invention uniformly uses the Nextvald model to extract features together with video information and audio information.

And step S420, extracting the characteristics of the text information of the target video by using a Bert model according to the text information of the target video.

Specifically, title information and video description information contained in the text information of the target video are encoded to obtain encoded text information; and then inputting the coded text information into a Bert model to obtain the characteristics of the text information of the target video.

The Bert model is a commonly used text feature extraction model at present, and for a title of a video and corresponding description information, we can encode the model as "[ CLS ]]Title content [ SEP ]]Description content [ SEP]"form, which will then get its text features after it is fed into the Bert model

。

On the basis of the above embodiment, fig. 5 shows a pretraining method of a Nextvlad model according to another embodiment of the present invention, which includes the following specific steps:

step S510, extracting the characteristics of the video information and the audio information of the sample video through an inclusion model and a Vggish model respectively;

step S520, inputting the characteristics of the video information and the audio information of the sample video into a Nextvlad model for pre-training to obtain a pre-trained Nextvlad model;

the embodiment of the invention can train the model by utilizing the youtu-8m data, and the youtu-8m data contains millions of videos and 3386 labels. A pre-training model is trained by using the data, so that the trained Nextvlad model has stronger generalization capability.

Specifically, for an input video, images with the maximum frame number of 300 frames can be uniformly extracted, and the features of the images are extracted by using the acceptance

Where n = 300. Similarly, Vggish can be used for extracting audio features

。

And (3) inputting the extracted audio features and video features into a Nextvlad model for training, wherein the loss function of the training process is as follows:

After the training of the model is completed, the visual features of a video can be extracted by using Nexvlad

And audio features

。

Fig. 6 shows a video classification apparatus according to another embodiment of the present invention, which specifically includes:

the information acquisition module 610 is used for acquiring text information, video information and audio information of a target video;

the video classification module 620 is configured to input text information, video information, and audio information of the target video into a video classification model, so as to obtain a video classification result output by the video classification model; the video classification model is obtained by taking text information, video information and audio information of a sample video as training samples and taking a classification result of the sample video as a label for training;

According to the video classification device provided by the invention, the model prediction effect for video classification is better and more accurate and the generalization capability is stronger through the multi-mode video classification model based on the attention mechanism.

Fig. 7 shows a schematic physical structure diagram illustrating an electronic device, which may include, as shown in fig. 7: a processor (processor)710, a communication Interface (Communications Interface)720, a memory (memory)730, and a communication bus 740, wherein the processor 710, the communication Interface 720, and the memory 730 communicate with each other via the communication bus 740. Processor 710 may call logic instructions in memory 730 to perform the following method: acquiring text information, video information and audio information of a target video; inputting text information, video information and audio information of the target video into a video classification model to obtain a video classification result output by the video classification model; the video classification model is obtained by taking text information, video information and audio information of a sample video as training samples and taking a classification result of the sample video as a label for training; and the video classification result is obtained by performing feature fusion on the characteristics corresponding to the text information, the video information and the audio information of the target video according to the attention mechanism used by the video classification model.

In addition, the logic instructions in the memory 730 can be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented by a processor to perform the method provided by the foregoing embodiments, for example, including: acquiring text information, video information and audio information of a target video; inputting text information, video information and audio information of the target video into a video classification model to obtain a video classification result output by the video classification model; the video classification model is obtained by taking text information, video information and audio information of a sample video as training samples and taking a classification result of the sample video as a label for training; and the video classification result is obtained by performing feature fusion on the characteristics corresponding to the text information, the video information and the audio information of the target video according to the attention mechanism used by the video classification model.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for video classification, the method comprising:

2. The video classification method according to claim 1, characterized in that the video classification model comprises a feature fusion layer, which is specifically configured to:

3. The video classification method according to claim 2, wherein the video classification model further comprises a classification output layer located after the feature fusion layer, the classification output layer being specifically configured to:

4. The video classification method according to claim 3, wherein the video classification result includes one or more classification labels corresponding to the target video, each classification label corresponds to a preset classification level, and the number of the classification levels is the same as the number of classification sub-layers in the classification output layer;

5. The video classification method according to claim 4, wherein when the number of classification levels is 2, the loss function corresponding to the training process of the video classification model is:

wherein c1 and c2 represent the number of samples,

the true value of the first label representing the ith sample,

a predicted value representing the first label is indicated,

the true value of the second label representing the ith sample,

indicating the predicted value of the second label.

6. The video classification method according to claim 2, wherein the splicing the features corresponding to the text information, the video information, and the audio information of the target video to obtain the multi-modal features corresponding to the target video specifically comprises:

7. The video classification method according to claim 2, wherein the attention weight score and the fused feature are calculated by the formula:

8. The video classification method according to claim 2, wherein the video classification model further comprises a feature extraction layer located before the feature fusion layer, the feature extraction layer being specifically configured to:

9. The video classification method according to claim 8, wherein the Nextvlad model is obtained by a pre-training step, the pre-training step specifically comprising:

wherein the pre-training step uses a loss function of

；

10. The video classification method according to claim 7, wherein the extracting, according to the text information of the target video, the feature of the text information of the target video by using a Bert model specifically includes:

11. An apparatus for video classification, the apparatus comprising:

12. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs;

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-10.

13. A computer readable storage medium having stored thereon executable instructions, which when executed by a processor, cause the processor to perform the method of any one of claims 1 to 10.