CN113627536B

CN113627536B - Model training, video classification method, device, equipment and storage medium

Info

Publication number: CN113627536B
Application number: CN202110924849.8A
Authority: CN
Inventors: 吴文灏; 夏博洋
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-08-12
Filing date: 2021-08-12
Publication date: 2024-01-16
Anticipated expiration: 2041-08-12
Also published as: CN113627536A

Abstract

The disclosure provides model training, video classification methods, devices, equipment and storage media, relates to the field of artificial intelligence, in particular to the technical field of computer vision and deep learning, and can be particularly used in smart cities and intelligent traffic scenes. The specific implementation scheme is as follows: determining a video classification result of each video frame in the sample video by utilizing a pre-trained video classification model; determining a salient frame in the sample video according to the video classification result of each video frame and the annotation classification result of each video frame in the sample video; based on the salient frames, a salient frame determination model is trained. The implementation method can select the obvious frames from the videos and improve the accuracy of candidate video classification.

Description

Model training, video classification method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular to the field of computer vision and deep learning technologies, and more particularly to a method, apparatus, device, and storage medium for model training and video classification, which are particularly applicable in smart cities and intelligent traffic scenarios.

Background

And (3) efficient video identification, namely, designating the category of given video content under the condition of considering accuracy and speed and calculation cost. As one of the more active research topics in the computer vision community, efficient video recognition is widely used for video surveillance, video recommendation, retrieval, and other scenes. Although video recognition has made great progress in recognition accuracy, reducing the computational cost while ensuring accuracy is still not a simple task.

Disclosure of Invention

The present disclosure provides a model training method, a video classification method, a device, equipment and a storage medium.

According to a first aspect, there is provided a model training method comprising: determining a video classification result of each video frame in the sample video by utilizing a pre-trained video classification model; determining a salient frame in the sample video according to the video classification result of each video frame and the annotation classification result of each video frame in the sample video; based on the salient frames, a salient frame determination model is trained.

According to a second aspect, there is provided a video classification method comprising: acquiring a target video; determining a salient frame of the target video by using a salient frame determination model trained by the model training method as described in the first aspect; and determining the classification result of the target video according to the salient frames of the target video and the pre-trained video classification model.

According to a third aspect, there is provided a model training apparatus comprising: the sample video classification unit is configured to determine a video classification result of each video frame in the sample video by using a pre-trained video classification model; the first salient frame determining unit is configured to determine salient frames in the sample video according to video classification results of all video frames and annotation classification results of all video frames in the sample video; and a model training unit configured to train the salient frame to determine a model based on the salient frame.

According to a fourth aspect, there is provided a video classification apparatus comprising: a target video acquisition unit configured to acquire a target video; a second salient frame determination unit configured to determine salient frames of the target video using the salient frame determination model trained by the model training method as described in the first aspect; and the video classification unit is configured to determine the classification result of the target video according to the salient frames of the target video and the pre-trained video classification model.

According to a fifth aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in the first aspect or the method as described in the second aspect.

According to a sixth aspect, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method as described in the first aspect or the method as described in the second aspect.

According to a seventh aspect, a computer program product comprising a computer program which, when executed by a processor, implements a method as described in the first aspect or a method as described in the second aspect.

According to the technology disclosed by the invention, the salient frames can be selected from the video, and the accuracy of candidate video classification is improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is an exemplary system architecture diagram in which an embodiment of the present disclosure may be applied;

FIG. 2 is a flow chart of one embodiment of a model training method according to the present disclosure;

FIG. 3 is a flow chart of another embodiment of a model training method according to the present disclosure;

FIG. 4 is a flow chart of yet another embodiment of a model training method according to the present disclosure;

FIG. 5 is a flow chart of one embodiment of a video classification method according to the present disclosure;

FIG. 6 is a schematic diagram of one application scenario of the model training method, video classification method according to the present disclosure;

FIG. 7 is a schematic diagram of the structure of one embodiment of a model training apparatus according to the present disclosure;

FIG. 8 is a schematic diagram of a structure of one embodiment of a video classification apparatus according to the disclosure;

fig. 9 is a block diagram of an electronic device for implementing the model training method, video classification method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that, without conflict, the embodiments of the present disclosure and features of the embodiments may be combined with each other. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

FIG. 1 illustrates an exemplary system architecture 100 to which the model training methods, video classification methods, or embodiments for model training devices, video classification devices of the present disclosure may be applied.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a video play class application, a video classification class application, etc., may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, smartphones, tablets, car-mounted computers, laptop and desktop computers, and the like. When the terminal devices 101, 102, 103 are software, they can be installed in the above-listed electronic devices. Which may be implemented as multiple software or software modules (e.g., to provide distributed services), or as a single software or software module. The present invention is not particularly limited herein.

The server 105 may be a server providing various services, such as a background server providing language models on the terminal devices 101, 102, 103. The background server may train the model with training samples to obtain a target model, and feed back the target model to the terminal devices 101, 102, 103.

The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster formed by a plurality of servers, or as a single server. When server 105 is software, it may be implemented as a plurality of software or software modules (e.g., to provide distributed services), or as a single software or software module. The present invention is not particularly limited herein.

It should be noted that, the model training method provided in the embodiment of the present disclosure is generally performed by the server 105, and the video classification method may be performed by the terminal devices 101, 102, 103, or may be performed by the server 105. Accordingly, the model training apparatus is generally disposed in the server 105, and the video classification apparatus may be disposed in the terminal devices 101, 102, 103, or may be disposed in the server 105.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to fig. 2, a flow 200 of one embodiment of a model training method according to the present disclosure is shown. The model training method of the embodiment comprises the following steps:

step 201, determining a video classification result of each video frame in the sample video by using a pre-trained video classification model.

In this embodiment, the execution subject of the model training method (e.g., the server 105 shown in fig. 1) may first acquire a sample video. A plurality of video frames may be included in the sample video. The execution subject may input each video frame into a pre-trained video classification model to obtain a video classification result for each video frame. Here, the video classification model is used to characterize the correspondence between video frames and video classification results. Which may be various deep learning algorithms such as convolutional neural networks, and the like. The video classification result may include: running, jumping, etc. The video classification model may output confidence levels corresponding to different classifications to obtain confidence level vectors. In particular, the execution subject may employ different identifiers for representing different classifications.

Step 202, determining a salient frame in the sample video according to the video classification result of each video frame and the labeling classification result of each video frame in the sample video.

After the executing body obtains the video classification result of each video frame, the executing body can combine the labeling classification result of each video frame to determine the obvious frame in the sample video. Here, the salient frame may be a video frame containing salient feature information. Specifically, for each video frame, the executing body may compare the output result of the video classification model with the labeling classification result, and if the two classifications are consistent, consider the video frame as a salient frame. Alternatively, the execution subject may first take, as the classification of each video frame, the classification corresponding to the greatest confidence in the confidence vector of each video frame. Then, a video frame with a confidence level greater than a preset threshold value may be selected from the video frames. And then comparing the classification of the selected video frames with the labeling classification result, and if the classification result is consistent with the labeling classification result, considering the video frames as salient frames. In this way, the accuracy of the salient frames can be further improved.

Step 203, training a salient frame determination model based on the salient frames.

In this embodiment, after obtaining each salient frame, the execution subject may train the salient frame determination model based on each salient frame. Specifically, the execution subject may take each video frame of the sample video as input to the salient frame determination model, may take the salient frame as desired output, and trains the salient frame determination model. During the training process, the parameters of the salient frame determination model may be iteratively adjusted until a training termination condition is met. The training termination conditions described above may include, but are not limited to: the loss function converges and the iteration number reaches a preset threshold. It will be appreciated that the smaller the loss function value, the better the performance of the trained salient frame determination model.

According to the model training method provided by the embodiment of the disclosure, the model can be trained by using the salient frames, so that the trained model can determine the salient frames from the video, and the accuracy of subsequent video classification is improved.

In some alternative implementations of the present embodiment, the foregoing step 203 may specifically implement training of the salient frame determination model through the following steps not shown in fig. 2: determining a target loss function based on the salient frames and labeling classification results of the salient frames; the salient frame determination model is trained according to the objective loss function.

In this embodiment, the execution body may input each video frame into the salient frame determination model, and determine the target loss function according to the output of the salient frame determination model, the salient frame, and the labeling classification result of the salient frame. The salient frame determination model is then trained based on the objective loss function. Specifically, the execution body may continuously adjust the parameters of the salient frame determination model, which is the convergence of the objective loss function value.

With continued reference to fig. 3, a flow 300 of another embodiment of a model training method according to the present disclosure is shown. As shown in fig. 3, the method of the present embodiment may include the steps of:

Step 301, determining a video classification result of each video frame in the sample video by using a pre-trained video classification model.

Step 302, for each video frame, determining that the video frame is a salient frame in response to determining that the video classification result of the video frame is consistent with the annotation classification result of the video frame.

In this embodiment, the execution subject may use the classification with the highest confidence in the video classification result output by the video classification model as the classification result of the video frame. And comparing the video classification result with the labeling classification result. If the two are identical, the executing body may consider the video frame as a salient frame.

Step 303, updating the labeling classification identification of each video frame in the sample video according to the salient frames.

In this embodiment, after determining the salient frames, the execution subject may update the labeling classification identifier of each video frame in the sample video. Here, in order to highlight a salient frame, the labeling classification flag of a non-salient frame may be set to the same preset value (e.g., set to 0). Thus, the annotation classification of non-salient frames is understood as background. In order to distinguish between the labeling classification identifiers of the salient frames and the identifiers corresponding to the above-mentioned background, the labeling classification identifiers of the salient frames may be updated simultaneously (for example, each labeling classification identifier is increased by a preset value). In this embodiment, if the execution subject sets the labeling classification identifier of the non-salient frame to a value different from the labeling classification identifier, the labeling classification identifier of the salient frame may not be updated. For example, if the annotation class identifier is 0-99, the executing body may set the annotation class identifier for each non-salient frame to 0, and may understand the class corresponding to 0 as background. In order to distinguish the classification corresponding to the original labeling classification identifier 0 from the background, the labeling classification identifier of each salient frame can be increased by 1. Thus, the labeling classification identifier is updated from 0 to 99 to 0 to 100. However, if the execution subject directly sets the labeling classification identifier of the non-salient frame to 100, the labeling classification identifier of the salient frame does not need to be updated.

In some alternative implementations of the present embodiment, the execution entity may update the annotation classification identifier by the following steps, not shown in fig. 3: setting the labeling classification identification of the non-salient frames in the sample video as a preset numerical value; and in response to determining that the preset numerical value and the labeling classification identifier before updating are repeated, updating the labeling classification identifiers of part or all of the salient frames so as to avoid the repetition.

In this implementation manner, the execution body may set the labeling classification identifier of the non-salient frame to a preset value. If the preset value is repeated with the original labeling classification identification, the labeling classification identification of the salient frame can be updated again. If the preset value is not repeated with the original labeling classification identification, the labeling classification identification of the salient frame does not need to be updated. When updating the labeling classification identifiers of the salient frames, only the labeling classification identifiers of part of the salient frames can be updated, and the labeling classification identifiers of all the salient frames can be updated. For example, if the label classification identifier is 0 to 99 and the execution subject updates the label classification identifier of the non-salient frame to 50, the execution subject may update 50 to 99 to 51 to 100. If the execution subject updates the annotation class identifier of the non-salient frame to 100, the annotation class identifier of the salient frame does not need to be updated.

Step 304, each video frame in the sample video is input into a significant frame determination model, and the prediction classification identification of each video frame is determined according to the output of the significant frame determination model.

The execution body may input each video frame in the sample video into a significant frame determination model, so as to obtain a prediction result of the significant frame determination model on each video frame, where the prediction result may represent a non-significant frame by a first preset identifier and represent a significant frame by a second preset identifier.

Step 305, determining the target loss function according to the prediction classification identifier and the updated labeling classification identifier.

The execution subject can determine the target loss function according to the prediction classification identifier and the updated labeling classification identifier. In particular, the execution body may calculate a cross entropy loss function between the prediction classification identifier and the updated labeling classification identifier. And adding the cross entropy loss functions of the video frames to obtain a target loss function.

Step 306, training a salient frame determination model according to the objective loss function.

Finally, the execution body may train the salient frame determination model according to the objective loss function. In particular, the execution body may iteratively adjust parameters of the salient frame determination model such that values of the objective loss function converge. And when the objective loss function value is smaller than a preset threshold value, training can be stopped, and a trained significant frame determination model is obtained.

According to the model training method provided by the embodiment of the disclosure, the salient frame determination model can pay more attention to salient frames in the training process, and classification of non-salient frames is not paid attention to, so that screening accuracy of the salient frame determination model to the salient frames can be improved.

With continued reference to fig. 4, a flow 400 of yet another embodiment of a model training method according to the present disclosure is shown. As shown in fig. 4, the method of the present embodiment may include the steps of:

step 401, determining a video classification result of each video frame in the sample video by using a pre-trained video classification model.

Step 402, for each video frame, determining that the video frame is a salient frame in response to determining that the video classification result of the video frame is consistent with the annotation classification result of the video frame.

Step 403, determining an aliasing frame according to each video frame in the sample video.

In this embodiment, the execution body may alias a significant frame with a non-significant frame in the sample video, or alias a significant frame with a significant frame, or alias a non-significant frame with a non-significant frame. Here, aliasing refers to adding pixel values of two video frames or adding feature values of two video frames. The weights of the two video frames may be set at the time of aliasing.

In some optional implementations of the present embodiment, the executing body may determine the aliased frame by:

step 4031, for each video frame in the sample video, selecting a video frame from the sample video and aliasing the video frame to obtain an aliased frame.

In this implementation manner, for each video frame in the sample video, the execution body may select one video frame from the sample video and perform aliasing with the video frame to obtain an aliased frame. In a particular application, the executing body may replicate the following sample video and then shuffle the video frames in the sample video. And carrying out one-to-one correspondence aliasing on the video frames before the scrambling and the video frames after the scrambling to obtain a plurality of aliasing frames. The method and the device can realize aliasing in the feature space, so that more original information of the video frame can be reserved.

Step 404, determining an aliasing classification result corresponding to the aliasing frame according to the labeling classification result of the aliasing video frame.

The execution body may further determine an aliasing classification result corresponding to the aliased frame according to the labeling classification result of the two video frames from which the aliased frame is obtained. Specifically, the execution body may add the labeling classification results of the two video frames that obtain the aliased frame according to weights. For example, the feature map of video frame a and the feature map of video frame B are aliased, and the feature map of aliased frame c=λ×the feature map of video frame a + (1- λ) the feature map of video frame B. Wherein the range of lambda is between (0, 1). The annotation class identifier of the aliased frame C = λ the annotation class identifier of the video frame a + (1- λ) the annotation class identifier of the video frame B.

In some optional implementations of this embodiment, the execution body may determine the aliasing classification result corresponding to the aliasing frame by:

step 4041, updating the label classification identifier corresponding to the label classification result of each video frame.

Step 4042, determining an aliasing classification result corresponding to the aliasing frame according to the updated labeling classification identifier of the video frame of the obtained aliasing frame.

In this implementation manner, the updating manner of the labeling classification identifier may be the same as that described in step 303, and will not be described herein. The execution body can determine an aliasing classification result corresponding to the aliasing frame according to the updated labeling classification identifiers of the two video frames participating in the aliasing. Here, considering that two video frames that are aliased may be significant frames, and that both significant frames have their own labeling classification identifiers, the classification of the aliased frame obtained after the aliases should be different from the labeling classification identifier of any significant frame, so the labeling classification identifiers are also aliased here. Labeling the classification identifier may also be called a tag, and the implementation manner is to implement aliasing in the tag space.

Step 405, determining a target loss function according to the salient frame, the labeling classification result of the salient frame, the aliasing frame and the aliasing classification result corresponding to the aliasing frame.

After determining the aliasing frame and the aliasing classification result corresponding to the aliasing frame, the execution body may combine the significant frame and the labeling classification result of the significant frame to determine the target loss function. Specifically, the execution body may generate the first loss function according to the salient frame and the labeling classification result of the salient frame. And generating a second loss function according to the aliasing frame and the aliasing classification result corresponding to the aliasing frame. Then, a target loss function is determined from the first loss function and the second loss function. Here, the manner of generating the first loss function may be the same as that in step 305, and will not be described here again. The principle of generating the second loss function may be similar to the principle of generating the first loss function.

In some alternative implementations of the present embodiment, the execution body may use the target loss function to calculate by selecting a portion of the aliased frames from the plurality of aliased frames. Since the aliasing frames are additionally added information, in order to avoid the addition of the information of excessive aliasing frames and cause interference to the original information of the sample video, only a part of aliasing frames are selected to calculate the target loss function.

Step 406, training a salient frame determination model according to the objective loss function.

According to the model training method provided by the embodiment of the disclosure, the significant frames and the aliasing frames can be utilized to train the significant frame determination model, so that the overfitting of the model can be avoided, and the generalization of the model is improved.

Referring to fig. 5, a flow 500 of one embodiment of a video classification method according to the present disclosure is shown. As shown in fig. 5, the method of the present embodiment may include the steps of:

step 501, a target video is acquired.

In this embodiment, the execution subject may first acquire the target video. Here, the target video is a video to be processed.

Step 502, determining a salient frame of the target video by using a salient frame determination model obtained through training by a model training method.

In this embodiment, the executing body may input each video frame of the target video into the salient frame determination model obtained by the model training method described in the embodiment of fig. 2 or fig. 3, to determine the salient frame of the target video.

Step 503, determining the classification result of the target video according to the salient frames of the target video and the pre-trained video classification model.

After obtaining the salient frames, the execution subject can input the salient frames into a pre-trained video classification model, so that a classification result of the target video can be obtained. The classification results may include classifications for each video frame, which may include running, jumping, and so on, for example.

According to the video classification method provided by the embodiment of the disclosure, the trained salient frame determination model can be utilized to extract salient frames from the target video, so that the accuracy of subsequent video classification can be improved.

With continued reference to fig. 6, a schematic diagram of one application scenario of the model training method, video classification method according to the present disclosure is shown. In the application scenario of fig. 6, the server 601 obtains a trained salient frame determination model using steps 201-203. The salient frame determination model described above is then sent to the terminal 602. The terminal 602 may use the salient frame determination model to classify the video, so as to obtain a classification result of the video.

With further reference to fig. 7, as an implementation of the method shown in the foregoing figures, the present disclosure provides an embodiment of a model training apparatus, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 7, the model training apparatus 700 of the present embodiment includes: a sample video classification unit 701, a first salient frame determination unit 702, and a model training unit 703.

The sample video classification unit 701 is configured to determine a video classification result of each video frame in the sample video using a pre-trained video classification model.

The first salient frame determining unit 702 is configured to determine salient frames in the sample video according to the video classification result of each video frame and the annotation classification result of each video frame in the sample video.

The model training unit 703 is configured to train the salient frames to determine a model based on the salient frames.

In some optional implementations of the present embodiment, the first salient frame determination unit 703 may be further configured to: for each video frame, determining that the video frame is a salient frame in response to determining that the video classification result of the video frame is consistent with the annotation classification result of the video frame.

In some alternative implementations of the present embodiment, the model training unit 703 may further include not shown in fig. 7: the loss function determination module and the model training module.

The loss function determination module is configured to determine an objective loss function based on the salient frames and labeling classification results of the salient frames.

The model training module is configured to train the salient frame determination model according to the objective loss function.

In some optional implementations of the present embodiment, the loss function determination module is further configured to: updating the labeling classification identification of each video frame in the sample video according to the salient frames; inputting each video frame in the sample video into a significant frame determination model, and determining the prediction classification identification of each video frame according to the output of the significant frame determination model; and determining the target loss function according to the prediction classification identifier and the updated labeling classification identifier.

In some optional implementations of the present embodiment, the loss function determination module is further configured to: setting the labeling classification identification of the non-salient frames in the sample video as a preset numerical value; and in response to determining that the preset numerical value and the labeling classification identifier before updating are repeated, updating the labeling classification identifiers of part or all of the salient frames so as to avoid the repetition.

In some optional implementations of the present embodiment, the loss function determination module is further configured to: determining an aliasing frame according to each video frame in the sample video; determining an aliasing classification result corresponding to the aliasing frame according to the labeling classification result of the video frame of the obtained aliasing frame; and determining a target loss function according to the salient frame, the labeling classification result of the salient frame, the aliasing frame and the aliasing classification result corresponding to the aliasing frame.

In some optional implementations of the present embodiment, the loss function determination module is further configured to: for each video frame in the sample video, selecting one video frame from the sample video and carrying out aliasing on the video frame to obtain an aliasing frame.

In some optional implementations of the present embodiment, the loss function determination module is further configured to: updating the annotation classification identifiers corresponding to the annotation classification results of each video frame; and determining an aliasing classification result corresponding to the aliasing frame according to the updated labeling classification identifier of the video frame of the obtained aliasing frame.

In some optional implementations of the present embodiment, the loss function determination module is further configured to: determining a first loss function according to the salient frames and the labeling classification results of the salient frames; determining a second loss function according to the aliasing frame and the aliasing classification result corresponding to the aliasing frame; and determining a target loss function according to the first loss function and the second loss function.

It should be understood that the units 701 to 703 described in the model training apparatus 700 correspond to the respective steps in the method described with reference to fig. 2. Thus, the operations and features described above with respect to the model training method are equally applicable to the apparatus 700 and the units contained therein, and are not described in detail herein.

With further reference to fig. 8, as an implementation of the method shown in fig. 5, the present disclosure provides an embodiment of a video classification apparatus, which corresponds to the method embodiment shown in fig. 5, and which is particularly applicable to various electronic devices.

As shown in fig. 8, the video classification apparatus 800 of the present embodiment includes: a target video acquisition unit 801, a second salient frame determination unit 802, and a video classification unit 803.

The target video acquisition unit 801 is configured to acquire a target video.

A second salient frame determination unit 802 configured to determine salient frames of the target video using salient frame determination models trained by the model training method described in any of the embodiments of fig. 2 to 4.

The video classification unit 803 is configured to determine a classification result of the target video according to the salient frames of the target video and the pre-trained video classification model.

It should be understood that the units 801 to 803 described in the video classification apparatus 800 correspond to the respective steps in the method described with reference to fig. 5, respectively. Thus, the operations and features described above with respect to the video classification method are equally applicable to the apparatus 800 and the units contained therein, and are not described in detail herein.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 9 shows a block diagram of an electronic device 900 that performs a model training method, a video classification method, according to an embodiment of the disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the electronic device 900 includes a processor 901 that can perform various suitable actions and processes in accordance with a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a memory 908 into a Random Access Memory (RAM) 903. In the RAM903, various programs and data required for the operation of the electronic device 900 can also be stored. The processor 901, the ROM 902, and the RAM903 are connected to each other by a bus 904. An I/O interface (input/output interface) 905 is also connected to the bus 904.

A number of components in the electronic device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, or the like; an output unit 907 such as various types of displays, speakers, and the like; memory 908, such as a magnetic disk, optical disk, etc.; and a communication unit 909 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 909 allows the electronic device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.

Processor 901 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of processor 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 901 performs the various methods and processes described above, such as model training methods, video classification methods. For example, in some embodiments, the model training method, the video classification method may be implemented as a computer software program tangibly embodied on a machine-readable storage medium, such as the memory 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into RAM903 and executed by processor 901, one or more steps of the model training method, video classification method described above may be performed. Alternatively, in other embodiments, processor 901 may be configured to perform the model training method, the video classification method, by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. The program code described above may be packaged into a computer program product. These program code or computer program product may be provided to a processor or controller of a general purpose computer, special purpose computer or other programmable data processing apparatus such that the program code, when executed by the processor 901, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable storage medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable storage medium may be a machine-readable signal storage medium or a machine-readable storage medium. The machine-readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions, improvements, etc. that are within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A model training method, comprising:

determining a video classification result of each video frame in the sample video by utilizing a pre-trained video classification model;

determining a significant frame in the sample video according to the video classification result of each video frame and the annotation classification result of each video frame in the sample video;

training a salient frame determination model based on the salient frames, comprising: determining a target loss function based on the salient frame and a labeling classification result of the salient frame; training a significant frame determination model according to the target loss function;

Wherein the determining the objective loss function based on the salient frame and the labeling classification result of the salient frame comprises: carrying out aliasing according to each video frame in the sample video, and determining an aliasing frame, wherein the aliasing refers to adding pixel values of two video frames; determining an aliasing classification result corresponding to the aliasing frame according to the labeling classification result of the video frame of the aliasing frame; determining a first loss function according to the salient frames and the labeling classification results of the salient frames; determining a second loss function according to the aliasing frame and the aliasing classification result corresponding to the aliasing frame; determining a target loss function according to the first loss function and the second loss function; and training a significant frame determination model according to the target loss function.

2. The method of claim 1, wherein the determining salient frames in the sample video based on the video classification results for each video frame and the annotation classification results for each video frame in the sample video comprises:

for each video frame, determining that the video frame is a salient frame in response to determining that the video classification result of the video frame is consistent with the annotation classification result of the video frame.

3. The method of claim 1, wherein the determining a target loss function based on the salient frame and the labeling classification result of the salient frame comprises:

updating the labeling classification identification of each video frame in the sample video according to the salient frames;

inputting each video frame in the sample video into the salient frame determining model, and determining the prediction classification identification of each video frame according to the output of the salient frame determining model;

and determining a target loss function according to the prediction classification identifier and the updated labeling classification identifier.

4. The method of claim 3, wherein updating the annotation classification identifications for each video frame in the sample video based on the salient frames comprises:

setting the labeling classification identification of the non-salient frames in the sample video as a preset numerical value;

and in response to determining that the preset numerical value is repeated with the labeling classification identifier before updating, updating the labeling classification identifier of part or all of the salient frames so as to avoid repetition.

5. The method of claim 1, wherein the determining an aliased frame from each video frame in the sample video comprises:

and for each video frame in the sample video, selecting one video frame from the sample video and carrying out aliasing on the video frame to obtain an aliasing frame.

6. The method of claim 1, wherein the determining the aliasing classification result corresponding to the aliased frame according to the labeling classification result of the video frame from which the aliased frame is obtained comprises:

updating the annotation classification identifiers corresponding to the annotation classification results of each video frame;

and determining an aliasing classification result corresponding to the aliasing frame according to the updated labeling classification identifier of the video frame of the obtained aliasing frame.

7. A method of video classification, comprising:

acquiring a target video;

determining a salient frame of the target video using a salient frame determination model trained by the model training method of any one of claims 1-6;

and determining a classification result of the target video according to the salient frames of the target video and a pre-trained video classification model.

8. A model training apparatus comprising:

the sample video classification unit is configured to determine a video classification result of each video frame in the sample video by using a pre-trained video classification model;

the first salient frame determining unit is configured to determine salient frames in the sample video according to video classification results of all video frames and annotation classification results of all video frames in the sample video;

A model training unit configured to train a salient frame to determine a model based on the salient frame;

the model training unit includes: a loss function determination module configured to determine a target loss function based on the salient frame and a labeling classification result of the salient frame; a model training module configured to train a salient frame determination model based on the objective loss function;

wherein the loss function determination module is further configured to: carrying out aliasing according to each video frame in the sample video, and determining an aliasing frame, wherein the aliasing refers to adding pixel values of two video frames; determining an aliasing classification result corresponding to the aliasing frame according to the labeling classification result of the video frame of the aliasing frame; determining a first loss function according to the salient frames and the labeling classification results of the salient frames; determining a second loss function according to the aliasing frame and the aliasing classification result corresponding to the aliasing frame; determining a target loss function according to the first loss function and the second loss function; and training a significant frame determination model according to the target loss function.

9. The apparatus of claim 8, wherein the first salient frame determination unit is further configured to:

10. The apparatus of claim 8, wherein the loss function determination module is further configured to:

11. The apparatus of claim 10, wherein the loss function determination module is further configured to:

12. The apparatus of claim 8, wherein the loss function determination module is further configured to:

13. The apparatus of claim 8, wherein the loss function determination module is further configured to:

14. A video classification apparatus comprising:

a target video acquisition unit configured to acquire a target video;

a second salient frame determination unit configured to determine salient frames of the target video using a salient frame determination model trained by the model training method of any one of claims 1 to 6;

and the video classification unit is configured to determine a classification result of the target video according to the salient frames of the target video and a pre-trained video classification model.

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6 or the method of claim 7.

16. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-6 or the method of claim 7.