Nothing Special   »   [go: up one dir, main page]

CN113094549A - Video classification method and device, electronic equipment and storage medium - Google Patents

Video classification method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113094549A
CN113094549A CN202110645773.5A CN202110645773A CN113094549A CN 113094549 A CN113094549 A CN 113094549A CN 202110645773 A CN202110645773 A CN 202110645773A CN 113094549 A CN113094549 A CN 113094549A
Authority
CN
China
Prior art keywords
video
information
classification
target video
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110645773.5A
Other languages
Chinese (zh)
Inventor
李林科
李大海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhizhe Sihai Beijing Technology Co Ltd
Original Assignee
Zhizhe Sihai Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhizhe Sihai Beijing Technology Co Ltd filed Critical Zhizhe Sihai Beijing Technology Co Ltd
Priority to CN202110645773.5A priority Critical patent/CN113094549A/en
Publication of CN113094549A publication Critical patent/CN113094549A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/75Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a video classification method, a video classification device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring text information, video information and audio information of a target video; inputting text information, video information and audio information of the target video into a video classification model to obtain a video classification result output by the video classification model; and the video classification result is obtained by performing feature fusion on the characteristics corresponding to the text information, the video information and the audio information of the target video according to the attention mechanism used by the video classification model. According to the video classification method, the video classification device, the electronic equipment and the storage medium, the model prediction effect for video classification is better and more accurate and the generalization capability is stronger through the multi-mode video classification model based on the attention mechanism.

Description

Video classification method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of artificial intelligence, and in particular, to a video classification method and apparatus, an electronic device, and a computer-readable storage medium.
Background
The websites or platforms containing video services generally need to classify the video entities in the domain, so as to provide basic data support for the services such as downstream content models, user models, item recommendation, etc. by giving the video corresponding to the categories, such as science and technology, entertainment, etc.
The current general video classification method is to classify and label videos according to text information such as video description information, title information and the like attached to the uploaded videos, but due to the diversity of the video information, the titles are only a part of the videos, and the problem that a lot of title information cannot effectively represent the video information exists.
Disclosure of Invention
In view of this, an object of an embodiment of the present invention is to provide a video classification method, an apparatus, an electronic device, and a storage medium, which specifically include:
in a first aspect, an embodiment of the present invention provides a video classification method, where the method includes:
acquiring text information, video information and audio information of a target video;
inputting text information, video information and audio information of the target video into a video classification model to obtain a video classification result output by the video classification model; the video classification model is obtained by taking text information, video information and audio information of a sample video as training samples and taking a classification result of the sample video as a label for training;
and the video classification result is obtained by performing feature fusion on the characteristics corresponding to the text information, the video information and the audio information of the target video according to the attention mechanism used by the video classification model.
Optionally, the video classification model includes a feature fusion layer, and the feature fusion layer is specifically configured to:
receiving characteristics corresponding to text information, video information and audio information of the target video input to the characteristic fusion layer;
splicing the characteristics corresponding to the text information, the video information and the audio information of the target video to obtain multi-modal characteristics corresponding to the target video;
according to the multi-modal characteristics corresponding to the target video, determining the attention weight corresponding to the target video;
and determining fusion characteristics corresponding to the target video according to the multi-modal characteristics corresponding to the target video and the attention weight, and using the fusion characteristics as the output of the characteristic fusion layer.
Optionally, the video classification model further includes a classification output layer located after the feature fusion layer, and the classification output layer is specifically configured to:
receiving fusion characteristics corresponding to the target video output by the characteristic fusion layer;
and determining a video classification result corresponding to the target video according to the fusion characteristics corresponding to the target video.
Optionally, the video classification result includes one or more classification tags corresponding to the target video, each classification tag corresponds to a preset classification level, and the number of the classification levels is the same as the number of classification sub-layers in the classification output layer;
correspondingly, the determining a video classification result corresponding to the target video according to the fusion feature corresponding to the target video specifically includes:
and respectively inputting the fusion characteristics corresponding to the target video to each classification sublayer in the classification output layer to obtain a video classification result corresponding to the target video consisting of the classification labels output by each classification sublayer.
Optionally, when the number of the classification levels is 2, a loss function corresponding to a training process of the video classification model is as follows:
Figure M_210610083100436_436220001
wherein c1 and c2 represent the number of samples,
Figure M_210610083100483_483099001
the true value of the first label representing the ith sample,
Figure M_210610083100514_514346002
a predicted value representing the first label is indicated,
Figure M_210610083100547_547543003
second representing the ith sampleThe actual value of the tag is set by the tag,
Figure M_210610083100578_578506004
indicating the predicted value of the second label.
Optionally, the splicing the features corresponding to the text information, the video information, and the audio information of the target video to obtain the multi-modal features corresponding to the target video specifically includes:
performing one-dimensional expansion on the characteristics corresponding to the text information, the video information and the audio information of the target video to obtain one-dimensional representation of the characteristics corresponding to the text information, the video information and the audio information of the target video;
and splicing the one-dimensional representations of the characteristics corresponding to the text information, the video information and the audio information of the target video to obtain the multi-modal characteristics corresponding to the target video.
Optionally, the calculation formula of the attention weight score and the fusion feature is:
Figure M_210610083100610_610089001
Figure M_210610083100656_656908001
Figure M_210610083100703_703756001
wherein BN represents the normalization operation, Relu and sigmod are the activation functions, w1,w2,b1,b2For model parameters for trainable learning, f1F is a multi-modal feature, which is a feature obtained after the activation function.
Optionally, the video classification model further includes a feature extraction layer located before the feature fusion layer, where the feature extraction layer is specifically configured to:
extracting the characteristics of the video information and the audio information of the target video by using a Nextvlad model according to the video information and the audio information of the target video;
and extracting the characteristics of the text information of the target video by using a Bert model according to the text information of the target video.
Optionally, the Nextvlad model is obtained by a pre-training step, where the pre-training step specifically includes:
extracting the characteristics of video information and audio information of the sample video through an inclusion model and a Vggish model respectively;
inputting the characteristics of the video information and the audio information of the sample video into a Nextvlad model for pre-training to obtain a pre-trained Nextvlad model;
wherein the pre-training step uses a loss function of
Figure M_210610083100739_739007001
Wherein k represents the number of samples, y represents the true value of the ith sample, and logypRepresenting the predicted value of the ith sample.
Optionally, the extracting, according to the text information of the target video, the feature of the text information of the target video by using a Bert model specifically includes:
coding title information and video description information contained in the text information of the target video to obtain coded text information;
and inputting the coded text information into a Bert model to obtain the characteristics of the text information of the target video.
In a second aspect, an embodiment of the present invention provides a video classification apparatus, where the apparatus includes:
the information acquisition module is used for acquiring text information, video information and audio information of the target video;
the video classification module is used for inputting the text information, the video information and the audio information of the target video into a video classification model to obtain a video classification result output by the video classification model; the video classification model is obtained by taking text information, video information and audio information of a sample video as training samples and taking a classification result of the sample video as a label for training;
and the video classification result is obtained by performing feature fusion on the characteristics corresponding to the text information, the video information and the audio information of the target video according to the attention mechanism used by the video classification model.
In a third aspect, an embodiment of the present invention provides an electronic device, including:
one or more processors;
a memory for storing one or more programs;
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of the first aspect.
In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium having stored thereon executable instructions, which when executed by a processor, cause the processor to perform the method according to the first aspect.
According to the video classification method, the video classification device, the electronic equipment and the storage medium, the model prediction effect for video classification is better and more accurate and the generalization capability is stronger through the multi-mode video classification model based on the attention mechanism.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings required to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative work. The foregoing and other objects, features and advantages of the application will be apparent from the accompanying drawings. Like reference numerals refer to like parts throughout the drawings. The drawings are not intended to be to scale as practical, emphasis instead being placed upon illustrating the subject matter of the present application.
Fig. 1 shows a schematic flow chart of a video classification method provided according to an embodiment of the present disclosure.
Fig. 2 shows a schematic flow chart of a feature fusion method provided according to an embodiment of the present disclosure.
Fig. 3 shows a flow chart of a video classification method provided according to an embodiment of the present disclosure.
Fig. 4 shows a schematic flow chart of a feature extraction method provided according to an embodiment of the present disclosure.
Fig. 5 shows a flowchart of a pretraining method of a Nextvlad model provided according to an embodiment of the present disclosure.
Fig. 6 shows a schematic structural diagram of a video classification apparatus provided according to an embodiment of the present disclosure.
Fig. 7 shows a schematic structural diagram of an electronic device provided according to an embodiment of the present disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The words "a", "an" and "the" and the like as used herein are also intended to include the meanings of "a plurality" and "the" unless the context clearly dictates otherwise. Furthermore, the terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.
The websites or platforms containing video services generally need to classify the uploaded video entities in fields, so that the videos can be provided with corresponding categories, such as science and technology, entertainment and the like, and basic data support is provided for services such as downstream content models, user models, item recommendation and the like.
The current general video classification method is to classify and label videos according to text information such as video description information, title information and the like attached to the uploaded videos, but due to the diversity of the video information, the titles are only a part of the videos, and the problem that a lot of title information cannot effectively represent the video information exists.
In view of this, an object of the embodiments of the present disclosure is to provide a video classification method, apparatus, electronic device and storage medium, which enable a model prediction effect for video classification to be better and accurate and have stronger generalization capability through a multi-modal video classification model based on an attention mechanism.
The present disclosure is described in detail below with reference to the attached drawings.
Fig. 1 shows a video classification method provided in an embodiment of the present invention, which includes the following specific steps:
step S110: and acquiring text information, video information and audio information of the target video.
The target video described in the embodiment of the present invention may be a video entity in a website or platform including a video service, such as a video uploaded by a user in a video website, a video resource of the video website itself, and the like. The text information of the target video can be the text information in the video title or the text information in the video description information. The acquisition of the text information of the target video can be realized by reading a title field or a video description information field corresponding to the target video.
The video information and the audio information of the target video are basic contents constituting a target video entity, and the video information and the audio information of the target video are acquired. For example, when video information is acquired, video frames with a specific frame number may be extracted as the video information, and the video frames may be extracted at fixed intervals or by using a key frame technique.
Step S120: and inputting the text information, the video information and the audio information of the target video into a video classification model to obtain a video classification result output by the video classification model.
The video classification model used in the video classification of the embodiment of the invention belongs to a multi-modal video classification model, and simultaneously takes various information such as text information, video information, audio information and the like of a target video as the input of the model. Compared with the single-mode video classification model, for example, the title information is simply classified based on the Bert model, or the video is subjected to frame extraction processing, then the CNN is used for extracting features, and finally the classification is performed, so that the single-mode model can achieve better effect in the form of uniform video data. But for the case of video information containing many different modalities, a single modality will have difficulty in efficiently extracting data features to the desired effect.
The video classification model in the embodiment of the invention is obtained by taking the text information, the video information and the audio information of the sample video as training samples and taking the classification result of the sample video as a label for training. The sample video refers to a video serving as a training sample, and the classification result of the sample video is obtained through manual labeling or through a third-party data set such as youtu-8 m.
Further, the video classification result is obtained by performing feature fusion on the features corresponding to the text information, the video information and the audio information of the target video according to the attention mechanism used by the video classification model. When the embodiment of the invention utilizes a plurality of modes to carry out video frequency classification, the method not only simply and directly integrates the output results of the plurality of modes, but also needs to carry out feature fusion on the information of the plurality of modes. When feature fusion is carried out, an attention mechanism is also used, the characteristics and the relation of different modal contents are fully utilized, and the importance degrees of different modal features are judged through network learning, so that the model for video classification has better and accurate prediction effect and stronger generalization capability.
On the basis of the foregoing embodiment, fig. 2 shows a feature fusion method related to a feature fusion layer included in a video classification model according to another embodiment of the present invention, which includes the specific steps of:
step S210: receiving characteristics corresponding to text information, video information and audio information of the target video input to the characteristic fusion layer.
The video classification model used in the embodiment of the invention mainly comprises a feature extraction layer, a feature fusion layer and a feature output layer in sequence. The feature fusion layer is an important component in the video classification model used in the embodiment of the present invention, and fuses the features of multiple modes based on an attention mechanism. The feature fusion layer first receives input information input to the layer, i.e., features corresponding to text information, video information, and audio information of the target video. It is understood that the above features may be obtained by feature extraction of text information, video information, and audio information of the target video.
Step S220: and splicing the characteristics corresponding to the text information, the video information and the audio information of the target video to obtain the multi-modal characteristics corresponding to the target video.
The features corresponding to the text information, the video information and the audio information of the target video acquired in the previous step are respectively and independently acquired, and when the three features are fused, the three features are spliced to obtain corresponding multi-modal features.
Due to the different characteristics of text information, video information and audio information, the dimensions of the extracted features are usually different. Therefore, first, the features corresponding to the text information, the video information, and the audio information of the target video need to be expanded in one dimension to obtain a one-dimensional representation of the features corresponding to the text information, the video information, and the audio information of the target video. And then, splicing the one-dimensional representations of the characteristics corresponding to the text information, the video information and the audio information of the target video to obtain the multi-modal characteristics corresponding to the target video. It will be appreciated that the multi-modal features are also represented in a one-dimensional form.
For example, corresponding features to video information
Figure M_210610083100770_770163001
Features corresponding to audio information
Figure M_210610083100801_801398002
And features corresponding to text information
Figure M_210610083100848_848312003
First, we will
Figure M_210610083100895_895198004
And
Figure M_210610083100927_927364005
is expanded into one dimension
Figure F_210610083059851_851308001
Figure M_210610083100959_959133006
Then are mixed with
Figure M_210610083101006_006037007
And (5) splicing together to obtain the multi-modal feature F.
Step S230: and determining the attention weight corresponding to the target video according to the multi-modal characteristics corresponding to the target video.
The formula for calculating the attention weight score is as follows:
Figure M_210610083101037_037295001
Figure M_210610083101084_084142001
wherein BN represents the normalization operation, Relu and sigmod are the activation functions, w1,w2,b1,b2For model parameters for trainable learning, f1F is a multi-modal feature, which is a feature obtained after the activation function.
In particular, the calculated attention weight score may characterize the importance of different modal characteristics in the multi-modal characteristics. Model parameter w1,w2,b1,b2After training, a robust attention weight can be effectively calculated, and the dimension of the attention weight is the same as that of the multi-modal features.
Step S240: and determining fusion characteristics corresponding to the target video according to the multi-modal characteristics corresponding to the target video and the attention weight, and using the fusion characteristics as the output of the characteristic fusion layer.
The fusion feature FaThe calculation formula of (2) is as follows:
Figure M_210610083101131_131980001
after the multi-modal features and the attention weights are fused, original information in the multi-modal features is converted into weight information carrying the attention mechanism, and fused features output by a feature fusion layer are formed. The fusion characteristics are used for video classification, so that the model can effectively distinguish which information in the multi-modal characteristics is more important, and the video classification result is accurately predicted.
For example, there are few text information contents of the target video, and the title is often short or even absent, which results in less effective content information contained therein; for another example, some target videos such as plain text videos have a high degree of visual information redundancy, and since the video duration span is large, this is effective for the generalization capability of the models in the prior art, and thus it is not possible to model videos with different durations at the same time. In the embodiment of the invention, the fusion features output by the feature fusion layer not only fuse the multi-modal features corresponding to the target video, but also consider the importance degrees of different modal features, so that the model used for video classification has better and accurate prediction effect and stronger generalization capability.
On the basis of the foregoing embodiment, fig. 3 shows a video classification method related to a classification output layer included in a video classification model according to another embodiment of the present invention, which includes the specific steps of:
step S310: and receiving the fusion characteristics corresponding to the target video output by the characteristic fusion layer.
The video classification model used in the embodiment of the invention mainly comprises a feature extraction layer, a feature fusion layer and a feature output layer in sequence. The classification output layer is the next layer of the feature fusion layer, and first needs to receive the fusion features corresponding to the target video output by the feature fusion layer as input data of the classification output layer.
Step S320: and determining a video classification result corresponding to the target video according to the fusion characteristics corresponding to the target video.
And the classification output layer is used as the last layer of the whole video classification model, and is used for determining a video classification result corresponding to the target video according to the fusion characteristics and is also used for outputting the whole video classification model.
Specifically, the video classification result includes one or more classification tags corresponding to the target video, and each classification tag corresponds to a preset classification level. For example, when the number of classification levels is 1, it means that the target video is classified at one level, such as music, sports, education, science and technology, and the like. When the number of the classification levels is 2, the target video is subjected to secondary classification, for example, under the primary classification music, the target video is further classified into popularity, rock, light music and the like; for example, under the first-class classification sports, the target video area is further divided into football, basketball, tennis and the like.
The number of the classification labels represents the refinement degree of the video classification method for classifying the target video, and the number of the classification labels can be preset according to actual requirements.
Further, the number of classification levels is the same as the number of classification sublayers in the classification output layer. Each classification sub-layer in the classification output layer is used for determining a certain level of classification label of the target video. For example, when the number of classification levels is 2, the classification output layer includes two classification sublayers, a first classification sublayer is used for determining a first-level classification of the target video, and a second classification sublayer is used for further determining a second-level classification of the target video.
Correspondingly, the determining a video classification result corresponding to the target video according to the fusion feature corresponding to the target video specifically includes:
and respectively inputting the fusion characteristics corresponding to the target video to each classification sublayer in the classification output layer to obtain a video classification result corresponding to the target video consisting of the classification labels output by each classification sublayer.
The number of classification levels also has an impact on the entire training process of the video classification model. In particular, the number of classification levels will determine the form of expression of the loss function in the model training process. For example, when the number of classification levels is 2, the loss function corresponding to the training process of the video classification model is:
Figure M_210610083101163_163774001
wherein c1 and c2 represent the number of samples,
Figure M_210610083101210_210692001
the true value of the first label representing the ith sample,
Figure M_210610083101241_241521002
a predicted value representing the first label is indicated,
Figure M_210610083101273_273188003
the true value of the second label representing the ith sample,
Figure M_210610083101304_304423004
indicating the predicted value of the second label.
It is understood that the two terms on the right side of the above calculation formula represent the loss function term of the first label and the loss function term of the second label, respectively, and the two terms together form the loss function of the whole training process. The person skilled in the art can reasonably speculate that the formula for calculating the loss function L has a corresponding expression when the number of classification levels is other values.
On the basis of the foregoing embodiment, fig. 4 shows a feature extraction method related to a feature extraction layer included in a video classification model according to another embodiment of the present invention, which includes the specific steps of:
step S410, extracting the characteristics of the video information and the audio information of the target video by using a Nextvlad model according to the video information and the audio information of the target video;
the video classification model used in the embodiment of the invention mainly comprises a feature extraction layer, a feature fusion layer and a feature output layer in sequence. The feature extraction layer is a layer above the feature fusion layer and is used for extracting features of video information, audio information and text information of the target video.
The Nextvald model is a better video classification model at present, and the embodiment of the invention uniformly uses the Nextvald model to extract features together with video information and audio information.
And step S420, extracting the characteristics of the text information of the target video by using a Bert model according to the text information of the target video.
Specifically, title information and video description information contained in the text information of the target video are encoded to obtain encoded text information; and then inputting the coded text information into a Bert model to obtain the characteristics of the text information of the target video.
The Bert model is a commonly used text feature extraction model at present, and for a title of a video and corresponding description information, we can encode the model as "[ CLS ]]Title content [ SEP ]]Description content [ SEP]"form, which will then get its text features after it is fed into the Bert model
Figure M_210610083101337_337605001
On the basis of the above embodiment, fig. 5 shows a pretraining method of a Nextvlad model according to another embodiment of the present invention, which includes the following specific steps:
step S510, extracting the characteristics of the video information and the audio information of the sample video through an inclusion model and a Vggish model respectively;
step S520, inputting the characteristics of the video information and the audio information of the sample video into a Nextvlad model for pre-training to obtain a pre-trained Nextvlad model;
the embodiment of the invention can train the model by utilizing the youtu-8m data, and the youtu-8m data contains millions of videos and 3386 labels. A pre-training model is trained by using the data, so that the trained Nextvlad model has stronger generalization capability.
Specifically, for an input video, images with the maximum frame number of 300 frames can be uniformly extracted, and the features of the images are extracted by using the acceptance
Figure M_210610083101368_368886001
Where n = 300. Similarly, Vggish can be used for extracting audio features
Figure M_210610083101431_431287002
And (3) inputting the extracted audio features and video features into a Nextvlad model for training, wherein the loss function of the training process is as follows:
Figure M_210610083101462_462562001
wherein k represents the number of samples, y represents the true value of the ith sample, and logypRepresenting the predicted value of the ith sample.
After the training of the model is completed, the visual features of a video can be extracted by using Nexvlad
Figure M_210610083101509_509550001
And audio features
Figure M_210610083101542_542124002
Fig. 6 shows a video classification apparatus according to another embodiment of the present invention, which specifically includes:
the information acquisition module 610 is used for acquiring text information, video information and audio information of a target video;
the video classification module 620 is configured to input text information, video information, and audio information of the target video into a video classification model, so as to obtain a video classification result output by the video classification model; the video classification model is obtained by taking text information, video information and audio information of a sample video as training samples and taking a classification result of the sample video as a label for training;
and the video classification result is obtained by performing feature fusion on the characteristics corresponding to the text information, the video information and the audio information of the target video according to the attention mechanism used by the video classification model.
According to the video classification device provided by the invention, the model prediction effect for video classification is better and more accurate and the generalization capability is stronger through the multi-mode video classification model based on the attention mechanism.
Fig. 7 shows a schematic physical structure diagram illustrating an electronic device, which may include, as shown in fig. 7: a processor (processor)710, a communication Interface (Communications Interface)720, a memory (memory)730, and a communication bus 740, wherein the processor 710, the communication Interface 720, and the memory 730 communicate with each other via the communication bus 740. Processor 710 may call logic instructions in memory 730 to perform the following method: acquiring text information, video information and audio information of a target video; inputting text information, video information and audio information of the target video into a video classification model to obtain a video classification result output by the video classification model; the video classification model is obtained by taking text information, video information and audio information of a sample video as training samples and taking a classification result of the sample video as a label for training; and the video classification result is obtained by performing feature fusion on the characteristics corresponding to the text information, the video information and the audio information of the target video according to the attention mechanism used by the video classification model.
In addition, the logic instructions in the memory 730 can be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented by a processor to perform the method provided by the foregoing embodiments, for example, including: acquiring text information, video information and audio information of a target video; inputting text information, video information and audio information of the target video into a video classification model to obtain a video classification result output by the video classification model; the video classification model is obtained by taking text information, video information and audio information of a sample video as training samples and taking a classification result of the sample video as a label for training; and the video classification result is obtained by performing feature fusion on the characteristics corresponding to the text information, the video information and the audio information of the target video according to the attention mechanism used by the video classification model.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (13)

1. A method for video classification, the method comprising:
acquiring text information, video information and audio information of a target video;
inputting text information, video information and audio information of the target video into a video classification model to obtain a video classification result output by the video classification model; the video classification model is obtained by taking text information, video information and audio information of a sample video as training samples and taking a classification result of the sample video as a label for training;
and the video classification result is obtained by performing feature fusion on the characteristics corresponding to the text information, the video information and the audio information of the target video according to the attention mechanism used by the video classification model.
2. The video classification method according to claim 1, characterized in that the video classification model comprises a feature fusion layer, which is specifically configured to:
receiving characteristics corresponding to text information, video information and audio information of the target video input to the characteristic fusion layer;
splicing the characteristics corresponding to the text information, the video information and the audio information of the target video to obtain multi-modal characteristics corresponding to the target video;
according to the multi-modal characteristics corresponding to the target video, determining the attention weight corresponding to the target video;
and determining fusion characteristics corresponding to the target video according to the multi-modal characteristics corresponding to the target video and the attention weight, and using the fusion characteristics as the output of the characteristic fusion layer.
3. The video classification method according to claim 2, wherein the video classification model further comprises a classification output layer located after the feature fusion layer, the classification output layer being specifically configured to:
receiving fusion characteristics corresponding to the target video output by the characteristic fusion layer;
and determining a video classification result corresponding to the target video according to the fusion characteristics corresponding to the target video.
4. The video classification method according to claim 3, wherein the video classification result includes one or more classification labels corresponding to the target video, each classification label corresponds to a preset classification level, and the number of the classification levels is the same as the number of classification sub-layers in the classification output layer;
correspondingly, the determining a video classification result corresponding to the target video according to the fusion feature corresponding to the target video specifically includes:
and respectively inputting the fusion characteristics corresponding to the target video to each classification sublayer in the classification output layer to obtain a video classification result corresponding to the target video consisting of the classification labels output by each classification sublayer.
5. The video classification method according to claim 4, wherein when the number of classification levels is 2, the loss function corresponding to the training process of the video classification model is:
Figure M_210610083057983_983578001
wherein c1 and c2 represent the number of samples,
Figure M_210610083058108_108587001
the true value of the first label representing the ith sample,
Figure M_210610083058141_141812002
a predicted value representing the first label is indicated,
Figure M_210610083058173_173039003
the true value of the second label representing the ith sample,
Figure M_210610083058204_204262004
indicating the predicted value of the second label.
6. The video classification method according to claim 2, wherein the splicing the features corresponding to the text information, the video information, and the audio information of the target video to obtain the multi-modal features corresponding to the target video specifically comprises:
performing one-dimensional expansion on the characteristics corresponding to the text information, the video information and the audio information of the target video to obtain one-dimensional representation of the characteristics corresponding to the text information, the video information and the audio information of the target video;
and splicing the one-dimensional representations of the characteristics corresponding to the text information, the video information and the audio information of the target video to obtain the multi-modal characteristics corresponding to the target video.
7. The video classification method according to claim 2, wherein the attention weight score and the fused feature are calculated by the formula:
Figure M_210610083058235_235496001
Figure M_210610083058298_298015001
Figure M_210610083058346_346320001
wherein BN represents the normalization operation, Relu and sigmod are the activation functions, w1,w2,b1,b2For model parameters for trainable learning, f1F is a multi-modal feature, which is a feature obtained after the activation function.
8. The video classification method according to claim 2, wherein the video classification model further comprises a feature extraction layer located before the feature fusion layer, the feature extraction layer being specifically configured to:
extracting the characteristics of the video information and the audio information of the target video by using a Nextvlad model according to the video information and the audio information of the target video;
and extracting the characteristics of the text information of the target video by using a Bert model according to the text information of the target video.
9. The video classification method according to claim 8, wherein the Nextvlad model is obtained by a pre-training step, the pre-training step specifically comprising:
extracting the characteristics of video information and audio information of the sample video through an inclusion model and a Vggish model respectively;
inputting the characteristics of the video information and the audio information of the sample video into a Nextvlad model for pre-training to obtain a pre-trained Nextvlad model;
wherein the pre-training step uses a loss function of
Figure M_210610083058378_378073001
Wherein k represents the number of samples, y represents the true value of the ith sample, and logypRepresenting the predicted value of the ith sample.
10. The video classification method according to claim 7, wherein the extracting, according to the text information of the target video, the feature of the text information of the target video by using a Bert model specifically includes:
coding title information and video description information contained in the text information of the target video to obtain coded text information;
and inputting the coded text information into a Bert model to obtain the characteristics of the text information of the target video.
11. An apparatus for video classification, the apparatus comprising:
the information acquisition module is used for acquiring text information, video information and audio information of the target video;
the video classification module is used for inputting the text information, the video information and the audio information of the target video into a video classification model to obtain a video classification result output by the video classification model; the video classification model is obtained by taking text information, video information and audio information of a sample video as training samples and taking a classification result of the sample video as a label for training;
and the video classification result is obtained by performing feature fusion on the characteristics corresponding to the text information, the video information and the audio information of the target video according to the attention mechanism used by the video classification model.
12. An electronic device, comprising:
one or more processors;
a memory for storing one or more programs;
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-10.
13. A computer readable storage medium having stored thereon executable instructions, which when executed by a processor, cause the processor to perform the method of any one of claims 1 to 10.
CN202110645773.5A 2021-06-10 2021-06-10 Video classification method and device, electronic equipment and storage medium Pending CN113094549A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110645773.5A CN113094549A (en) 2021-06-10 2021-06-10 Video classification method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110645773.5A CN113094549A (en) 2021-06-10 2021-06-10 Video classification method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113094549A true CN113094549A (en) 2021-07-09

Family

ID=76665127

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110645773.5A Pending CN113094549A (en) 2021-06-10 2021-06-10 Video classification method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113094549A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343936A (en) * 2021-07-15 2021-09-03 北京达佳互联信息技术有限公司 Training method and training device for video representation model
CN113822382A (en) * 2021-11-22 2021-12-21 平安科技(深圳)有限公司 Course classification method, device, equipment and medium based on multi-mode feature representation
CN114282058A (en) * 2021-08-10 2022-04-05 腾讯科技(深圳)有限公司 Method, device and equipment for model training and video theme prediction
CN114282055A (en) * 2021-08-12 2022-04-05 腾讯科技(深圳)有限公司 Video feature extraction method, device and equipment and computer storage medium
CN114443899A (en) * 2022-01-28 2022-05-06 腾讯科技(深圳)有限公司 Video classification method, device, equipment and medium
CN114926203A (en) * 2021-11-03 2022-08-19 特赞(上海)信息科技有限公司 Advertisement short video label processing method, device, equipment and storage medium
CN115660036A (en) * 2022-09-22 2023-01-31 北京百度网讯科技有限公司 Model pre-training and task processing method and device, electronic equipment and storage medium
CN116030295A (en) * 2022-10-13 2023-04-28 中电金信软件(上海)有限公司 Article identification method, apparatus, electronic device and storage medium
CN116028668A (en) * 2021-10-27 2023-04-28 腾讯科技(深圳)有限公司 Information processing method, apparatus, computer device, and storage medium
CN116701708A (en) * 2023-07-27 2023-09-05 上海蜜度信息技术有限公司 Multi-mode enhanced video classification method, system, storage medium and electronic equipment

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107526785A (en) * 2017-07-31 2017-12-29 广州市香港科大霍英东研究院 File classification method and device
CN107995535A (en) * 2017-11-28 2018-05-04 百度在线网络技术(北京)有限公司 A kind of method, apparatus, equipment and computer-readable storage medium for showing video
CN110399841A (en) * 2019-07-26 2019-11-01 北京达佳互联信息技术有限公司 A kind of video classification methods, device and electronic equipment
CN111259215A (en) * 2020-02-14 2020-06-09 北京百度网讯科技有限公司 Multi-modal-based topic classification method, device, equipment and storage medium
CN111489095A (en) * 2020-04-15 2020-08-04 腾讯科技(深圳)有限公司 Risk user management method and device, computer equipment and storage medium
CN111611436A (en) * 2020-06-24 2020-09-01 腾讯科技(深圳)有限公司 Label data processing method and device and computer readable storage medium
CN111711869A (en) * 2020-06-24 2020-09-25 腾讯科技(深圳)有限公司 Label data processing method and device and computer readable storage medium
CN111753133A (en) * 2020-06-11 2020-10-09 北京小米松果电子有限公司 Video classification method, device and storage medium
CN112464857A (en) * 2020-12-07 2021-03-09 深圳市欢太科技有限公司 Video classification model training and video classification method, device, medium and equipment

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107526785A (en) * 2017-07-31 2017-12-29 广州市香港科大霍英东研究院 File classification method and device
CN107995535A (en) * 2017-11-28 2018-05-04 百度在线网络技术(北京)有限公司 A kind of method, apparatus, equipment and computer-readable storage medium for showing video
CN110399841A (en) * 2019-07-26 2019-11-01 北京达佳互联信息技术有限公司 A kind of video classification methods, device and electronic equipment
CN111259215A (en) * 2020-02-14 2020-06-09 北京百度网讯科技有限公司 Multi-modal-based topic classification method, device, equipment and storage medium
CN111489095A (en) * 2020-04-15 2020-08-04 腾讯科技(深圳)有限公司 Risk user management method and device, computer equipment and storage medium
CN111753133A (en) * 2020-06-11 2020-10-09 北京小米松果电子有限公司 Video classification method, device and storage medium
CN111611436A (en) * 2020-06-24 2020-09-01 腾讯科技(深圳)有限公司 Label data processing method and device and computer readable storage medium
CN111711869A (en) * 2020-06-24 2020-09-25 腾讯科技(深圳)有限公司 Label data processing method and device and computer readable storage medium
CN112464857A (en) * 2020-12-07 2021-03-09 深圳市欢太科技有限公司 Video classification model training and video classification method, device, medium and equipment

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343936B (en) * 2021-07-15 2024-07-12 北京达佳互联信息技术有限公司 Training method and training device for video characterization model
CN113343936A (en) * 2021-07-15 2021-09-03 北京达佳互联信息技术有限公司 Training method and training device for video representation model
CN114282058A (en) * 2021-08-10 2022-04-05 腾讯科技(深圳)有限公司 Method, device and equipment for model training and video theme prediction
CN114282055A (en) * 2021-08-12 2022-04-05 腾讯科技(深圳)有限公司 Video feature extraction method, device and equipment and computer storage medium
CN116028668B (en) * 2021-10-27 2024-07-19 腾讯科技(深圳)有限公司 Information processing method, apparatus, computer device, and storage medium
CN116028668A (en) * 2021-10-27 2023-04-28 腾讯科技(深圳)有限公司 Information processing method, apparatus, computer device, and storage medium
CN114926203A (en) * 2021-11-03 2022-08-19 特赞(上海)信息科技有限公司 Advertisement short video label processing method, device, equipment and storage medium
CN113822382A (en) * 2021-11-22 2021-12-21 平安科技(深圳)有限公司 Course classification method, device, equipment and medium based on multi-mode feature representation
CN114443899A (en) * 2022-01-28 2022-05-06 腾讯科技(深圳)有限公司 Video classification method, device, equipment and medium
CN115660036A (en) * 2022-09-22 2023-01-31 北京百度网讯科技有限公司 Model pre-training and task processing method and device, electronic equipment and storage medium
CN115660036B (en) * 2022-09-22 2024-05-24 北京百度网讯科技有限公司 Model pre-training and task processing method and device, electronic equipment and storage medium
CN116030295A (en) * 2022-10-13 2023-04-28 中电金信软件(上海)有限公司 Article identification method, apparatus, electronic device and storage medium
CN116701708B (en) * 2023-07-27 2023-11-17 上海蜜度信息技术有限公司 Multi-mode enhanced video classification method, system, storage medium and electronic equipment
CN116701708A (en) * 2023-07-27 2023-09-05 上海蜜度信息技术有限公司 Multi-mode enhanced video classification method, system, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN113094549A (en) Video classification method and device, electronic equipment and storage medium
CN113011186B (en) Named entity recognition method, named entity recognition device, named entity recognition equipment and computer readable storage medium
CN110083729B (en) Image searching method and system
CN110737783A (en) method, device and computing equipment for recommending multimedia content
CN111741330A (en) Video content evaluation method and device, storage medium and computer equipment
CN112100375B (en) Text information generation method, device, storage medium and equipment
CN116415017B (en) Advertisement sensitive content auditing method and system based on artificial intelligence
CN111783712A (en) Video processing method, device, equipment and medium
CN111105013A (en) Optimization method of countermeasure network architecture, image description generation method and system
CN113688951A (en) Video data processing method and device
CN115470488A (en) Target risk website detection method, device and storage medium
CN113987274A (en) Video semantic representation method and device, electronic equipment and storage medium
CN114339450A (en) Video comment generation method, system, device and storage medium
CN110852071B (en) Knowledge point detection method, device, equipment and readable storage medium
CN117216535A (en) Training method, device, equipment and medium for recommended text generation model
CN115269781A (en) Modal association degree prediction method, device, equipment, storage medium and program product
CN113297525B (en) Webpage classification method, device, electronic equipment and storage medium
CN111767726B (en) Data processing method and device
CN113408282A (en) Method, device, equipment and storage medium for topic model training and topic prediction
CN115129902B (en) Media data processing method, device, equipment and storage medium
US11232325B2 (en) Data analysis system, method for controlling data analysis system, and recording medium
CN113569091B (en) Video data processing method and device
CN116956915A (en) Entity recognition model training method, device, equipment, storage medium and product
CN113888216A (en) Advertisement information pushing method and device, electronic equipment and storage medium
CN116028617B (en) Information recommendation method, apparatus, device, readable storage medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210709

RJ01 Rejection of invention patent application after publication