Disclosure of Invention
One or more embodiments of the present specification describe a method and apparatus for target annotation that can improve the accuracy of lesion identification.
According to a first aspect, there is provided a method for target annotation based on a video stream, the method comprising: acquiring a current key frame, wherein the current key frame is one of a plurality of key frames determined from each image frame of the video stream; performing target labeling on the current key frame by using a pre-trained labeling model to obtain a labeling result aiming at the current key frame, wherein the labeling model is used for labeling a region containing a preset target from a picture through a target frame; and performing target labeling on non-key frames after the current key frame in the video stream based on the labeling result aiming at the current key frame.
In one embodiment, the initial plurality of key frames is extracted by any one of:
selecting a plurality of image frames from the video stream as key frames according to a preset time interval;
and inputting the video stream into a pre-trained frame extraction model, and determining a plurality of key frames according to an output result of the frame extraction model.
In one embodiment, the video stream is a vehicle video, the target is a vehicle injury, and the annotation model is trained by: acquiring a plurality of vehicle pictures, wherein each vehicle picture corresponds to each sample marking result, and each sample marking result comprises at least one damage frame under the condition that the vehicle pictures comprise vehicle damage, and each damage frame is a minimum rectangular frame surrounding a continuous damage area; training the labeling model based on at least the plurality of vehicle pictures.
In one embodiment, in the video stream, adjacent key frames are respectively recorded as a first image frame and a second image frame, for the current image frame, the current key frame is an initial first image frame, and a frame next to the current key frame is an initial second image frame; the target labeling of the non-key frame after the current key frame in the video stream based on the labeling result for the current key frame comprises: after the first image frame is marked, detecting whether the second image frame is a key frame; detecting a similarity of the second image frame to the first image frame in a case where the second image frame is not a key frame; if the similarity between the second image frame and the first image frame is larger than a preset similarity threshold, mapping the labeling result corresponding to the first image frame to the second image frame so as to obtain a labeling result corresponding to the second image frame; and respectively updating the first image frame and the second image frame by using the second image frame and the next frame of the second image frame, and carrying out target labeling on the updated second image frame based on the labeling result of the updated first image frame.
In one embodiment, the second image frame is determined to be a key frame if the similarity of the second image frame to the first image frame is less than the similarity threshold.
In one embodiment, determining the similarity of the second image frame to the first image frame comprises: determining a reference region in the first image frame based on the labeling result of the first image frame; respectively processing the reference area in the first image frame and the second image frame by utilizing a predetermined convolutional neural network, and respectively obtaining a first convolution result and a second convolution result; taking the first convolution result as a convolution kernel, performing convolution processing on the second convolution result to obtain a third convolution result, wherein in a numerical array corresponding to the third convolution result, each numerical value respectively describes each similarity of a corresponding area of the second image frame and a reference area of the first image frame; and determining the similarity of the second image frame and the first image frame based on the maximum numerical value in the numerical array corresponding to the third convolution result.
In one embodiment, in a case that the similarity between the second image frame and the first image frame is greater than a preset similarity threshold, the mapping the labeling result corresponding to the first image frame to the second image frame to obtain a second labeling result corresponding to the second image frame includes: and according to the labeling result of the first image frame, labeling the image area of the second image frame corresponding to the maximum number.
In one embodiment, the determining the reference region in the first image frame based on the annotation result of the first image frame comprises: determining an initial reference area as an area surrounded by a target frame under the condition that the first labeling result contains the target frame; and under the condition that the first labeling result does not contain a target frame, determining an initial reference area as an area at a specified position in the current key frame.
In one embodiment, the current key frame further corresponds to a confidence flag, and the target labeling of a non-key frame following the current key frame in the video stream based on the labeling result for the current key frame includes: and determining the confidence identifications of the non-key frames after the current key frame and before the next key frame, wherein the confidence identifications are consistent with the confidence identifications corresponding to the labeling result of the current key frame.
In one embodiment, the confidence identifiers include a high-confidence identifier and a low-confidence identifier, where the high-confidence identifier corresponds to a case where the output result of the annotation model for the corresponding key frame includes a target frame, a reference region indicates a predetermined target that can provide high confidence, and the low-confidence identifier corresponds to a case where the output result of the annotation model for the corresponding key frame does not include a target frame, and a reference region does not indicate a predetermined target; the method further comprises the following steps: and adding the image frames corresponding to the high-confidence marks into a target labeling set.
According to a second aspect, there is provided an apparatus for target annotation based on a video stream, the apparatus comprising:
an acquisition unit configured to acquire a current key frame, the current key frame being one of a plurality of key frames determined from respective image frames of the video stream;
the first labeling unit is configured to perform target labeling on the current key frame by using a pre-trained labeling model to obtain a labeling result for the current key frame, wherein the labeling model is used for labeling a region containing a preset target from a picture through a target frame;
and the second labeling unit is configured to perform target labeling on non-key frames after the current key frame in the video stream based on the labeling result for the current key frame.
According to a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect.
According to a fourth aspect, there is provided a computing device comprising a memory and a processor, wherein the memory has stored therein executable code, and wherein the processor, when executing the executable code, implements the method of the first aspect.
According to the method and the device for target labeling provided by the embodiment of the specification, in the process of target labeling, only the key frames in the video stream are processed through the labeling model, and for the non-key frames, labeling is performed through the labeling results of the key frames, so that the data processing amount is greatly reduced.
Detailed Description
The scheme provided by the specification is described below with reference to the accompanying drawings.
For convenience of explanation, a specific application scenario of the embodiment of the present specification shown in fig. 1 is described. Fig. 1 shows a vehicle inspection scene, in which vehicle damage, such as damage, damage type, damage material, and the like, is marked as a target. The vehicle inspection scene can be any scene needing to inspect the vehicle damage condition. For example, when the vehicle is in insurance, the vehicle is determined to have no damage through vehicle inspection, or when the vehicle insurance claims are settled, the vehicle damage condition is determined through vehicle inspection.
In this implementation scenario, a user may capture a live video of a vehicle through a terminal capable of capturing live information, such as a smart phone, a camera, a sensor, and the like. Live video may include one or more video streams, a video stream being a segment of video. The field video can be sent to the manual inspection platform, the inspection purpose is determined by the manual inspection platform, so that the corresponding annotation request is sent to the computing platform, and the targeted field video is sent to the computing platform. It should be noted that the targeted live video may send one annotation request for each video stream in units of video streams, or may send one annotation request for one or more video streams in one case in units of cases. And the computing platform carries out target labeling on the video stream according to the labeling request by a target labeling method constructed by the specification. In the implementation scenario, the target annotation may be fed back to the human inspection platform as a pre-annotation result, so as to provide a reference for human decision. The pre-labeling result can indicate the damage part, the damage category and the like of the vehicle in the case of no damage or damage of the vehicle. The pre-labeling result can be in the form of text or image frame containing the vehicle damage.
The human verification platform and the computing platform shown in fig. 1 may be integrated or may be separately provided. In the case of a separate setup, the computing platform may act as a server that serves multiple artificial pinging platforms. The implementation scenario in fig. 1 is only an example, and in some implementations, a manual inspection platform may not be provided, and the terminal directly sends the video stream to the computing platform, and the computing platform feeds back the annotation result to the terminal, or feeds back the car inspection result generated based on the annotation result to the terminal.
Specifically, in the method for target annotation under the framework of the embodiment of the present specification, a plurality of key frames are determined from a video stream, each key frame is processed through a pre-trained annotation model according to a time sequence, an annotation result is obtained when each key frame is processed, and after a non-key frame following the key frame is processed, an image frame between a current key frame and a next key frame is annotated, and then the next key frame is processed. When processing the non-key frame after the current key frame, the marking result of the current key frame is referred to, and the data processing amount is reduced. Optionally, when processing the non-key frame, an image frame satisfying the condition may be selected from the non-key frames according to the actual situation and added to the key frame. And processing the selected key frame by using the labeling model to obtain a labeling result, and performing subsequent image frame processing based on the labeling result.
The method of object labeling is described in detail below.
FIG. 2 illustrates a flow diagram of a method of target annotation, according to one embodiment. The execution subject of the method can be any system, device, apparatus, platform or server with computing and processing capabilities. Such as the computing platform shown in fig. 1. The objects marked may be any objects in the relevant scene, such as various objects (e.g., kittens), modules with certain characteristics (e.g., oval leaves), and so forth. In a vehicle inspection scene, the target to be marked can be a vehicle part, a vehicle damage and the like.
As shown in fig. 2, the method for target labeling may include the following steps: step 201, acquiring a current key frame, wherein the current key frame is one of a plurality of key frames determined from a video stream; step 202, performing target labeling on the current key frame by using a pre-trained labeling model to obtain a labeling result aiming at the current key frame, wherein the labeling model is used for labeling a region containing a target from a picture through a target frame; step 203, based on the labeling result for the current key frame, performing target labeling on the non-key frame after the current key frame in the video stream.
First, in step 201, a current key frame is obtained, the current key frame being one of a plurality of key frames determined from respective image frames of a video stream. Keyframes are typically image frames that may reflect changing characteristics of the video. By extracting key frames from the video stream and reflecting the change characteristics of the video stream by the processing result of the key frames, the data processing amount can be effectively reduced.
Before a video stream is processed, a plurality of key frames can be extracted in advance. These key frames may be used as initial key frames. The key frame extraction of the video stream can be performed in various reasonable manners.
In one embodiment, image frames may be selected from a video stream as key frames at predetermined time intervals. For example, a 30 second video stream may have 60 image frames extracted as key frames at 0.5 second intervals.
In another embodiment, the frame-extracting model may be trained in advance, the video stream is input into the frame-extracting model, and each key frame of the video stream is determined according to the output result of the frame-extracting model. The frame extraction model is, as the name implies, a model for extracting key frames from a plurality of image frames of a video stream.
In an alternative implementation, the frame-extracting model may be trained by: acquiring a plurality of video streams as training samples, extracting image features (such as color features, component features and the like) from each image frame in each video stream, and providing artificially labeled sample key frames for the corresponding video streams; for each training sample, the image characteristics of each corresponding image frame are sequentially input into a selected model, such as a Recurrent Neural Network (RNN), a local state metric (LSTM) and the like, and model parameters are adjusted by comparing the output result of the model with the sample key frame, so that a frame extraction model is trained. At this time, the video stream obtained in step 201 may further include a preprocessing result for extracting image features from each image frame, which is not described herein again.
In another alternative implementation, the frame-extracting model may also be trained by: acquiring a plurality of video streams as training samples, wherein each video stream corresponds to a plurality of image frames and artificially labeled sample key frames; and sequentially inputting each image frame of each training sample into a selected model, such as a recurrent neural network (RNN, LSTM) and the like, automatically mining the characteristics of the image frames by the model, outputting a key frame extraction result, and then adjusting model parameters by comparing the model output result with the sample key frame, thereby training the frame extraction model.
In more embodiments, the key frames may be extracted in more effective manners, which is not described herein.
For each key frame in the video stream to be processed, the processing may be performed sequentially through the target annotation process shown in fig. 2. In this step 201, the key frame acquired by the current process is the current key frame, and the current key frame may be any key frame in the video stream to be processed.
Next, in step 202, a pre-trained labeling model is used to perform target labeling on the current key frame, so as to obtain a labeling result for the current key frame. The labeling model is used for labeling an area containing a preset target from the picture through the target frame. In a vehicle inspection scenario, the predetermined target may be a vehicle component, a vehicle damage, or the like.
The labeling result of the labeling model can be in a picture form or a character form. For example, in the picture format, the marked target is circled by the target frame on the basis of the original picture. The target bounding box may be a minimum bounding box of a predetermined shape, such as a minimum rectangular box, a minimum circular box, or the like, that surrounds the continuous target region. The textual form is, for example, a target feature that is tagged by a textual description. For example, in the case of a vehicle damage target, the annotation result in text form may be: damaged parts + degree of damage, such as bumper scratches; damaged material + degree of damage, such as left front window shattering; and so on.
According to one embodiment, in the case where the video stream is a vehicle video and the target is a vehicle impairment, the annotation model may be trained by:
acquiring a plurality of vehicle pictures, wherein each vehicle picture corresponds to each sample labeling result, wherein under the condition that the vehicle pictures include vehicle damage, a single sample labeling result can include at least one damage frame (corresponding to at least one damage) on the original picture, the damage frame is a minimum rectangular frame (in other embodiments, a circular frame and the like) surrounding a continuous damage area, and otherwise, the labeling result is empty, free of damage or the original picture itself;
then, based on at least the plurality of vehicle pictures with the sample labeling result, a labeling model is trained.
Thus, in this step 202, a key frame, i.e. a picture, inputs the current key frame into the annotation model, and the output result obtained by processing the annotation model may be the annotation result of the current key frame. When the current key frame does not include the predetermined target, the labeling result for the current key frame may be null, or "intact" text representation, or the original image itself.
Further, in step 203, target labeling is performed on non-key frames following the current key frame in the video stream based on the labeling result for the current key frame. Here, the non-key frame is an image frame that is not determined as a key frame. The non-key frame after the current key frame may be an image frame after the current key frame and before the next key frame. In the embodiment of the specification, the target labeling is performed on the key frame through the labeling model, and the target labeling is performed on the non-key frame by referring to the labeling result of the key frame, so that the data processing amount is reduced.
It is understood that the image frames in the video stream are usually continuously captured at a certain frequency (e.g., 24 frames per second), and the pictures between adjacent image frames may have a certain similarity. Adjacent image frames may have multiple similar regions, i.e., have greater similarity. It is understood that if the adjacent image frames are less similar, a sudden change of the picture may be generated and the features of the image frames are changed. In this case, the adjacent image frames of the key frame can also be used as key frames to reflect the feature change of the video stream. Based on this, in the embodiment of the present specification, target labeling may be performed on non-key frames based on the similarity between image frames.
In one embodiment, each image frame after the current key frame and before the next key frame may be sequentially compared with the current key frame to determine their similarity. And if the similarity is greater than a preset threshold value, marking the corresponding image frame by using the marking result of the current key frame. And if the similarity is less than a preset threshold value, taking the corresponding image frame as a key frame. According to the time sequence, the newly determined key frame is the next key frame of the current key frame in the current process, so that the newly determined key frame can be acquired as the current key frame next, and the process of target labeling shown in fig. 2 is executed. Further, non-key frames following the newly determined key frame are labeled with reference to the labeling result of the newly determined key frame.
In another embodiment, the current key frame and the non-key frame after the current key frame and before the next key frame may be compared between adjacent image frames to determine the similarity between the adjacent image frames. If the similarity of the adjacent image frames is higher, the marking result of the previous image frame is used for marking the next image frame, otherwise, the next image frame is used as a newly determined key frame, and the target marking process shown in fig. 2 is executed.
Specifically, two adjacent image frames may be referred to as a first image frame and a second image frame, respectively, and then for a current key frame, the current key frame is an initial first image frame, and a frame next to the current key frame is an initial second image frame. After the first image frame is labeled, whether the second image frame is a key frame is detected. In the case where the second image frame is a key frame, the flow shown in fig. 2 is executed with the second image frame as the current key frame. In the case where the second image frame is not a key frame, the similarity of the second image frame to the first image frame is detected.
If the similarity between the second image frame and the first image frame is smaller than the preset similarity threshold, the second image frame may be used as the current key frame, and the updated current key frame is processed by using the flow shown in fig. 2.
And if the similarity between the second image frame and the first image frame is greater than a preset similarity threshold, mapping the labeling result of the first image frame to the second image frame so as to obtain a labeling result corresponding to the second image frame. On the other hand, the first image frame and the second image frame are updated by the second image frame and the frame next to the second image frame, respectively, that is, the second image frame is used as a new first image frame, and the frame next to the second image frame is used as a new second image frame.
The above process is repeated. Until one of the following occurs:
the second image frame is the last frame of the video stream, and there is no next image frame (i.e., the step of updating the second image frame cannot be continued); or,
and detecting that the updated second image frame is a key frame in the process, and taking the second image frame as the current key frame to continue the subsequent processing.
In other embodiments, the target labeling may also be performed on the non-key frame after the current key frame by using the labeling result of the current key frame in other manners, for example, mapping the labeling result of the current key frame to other image frames by a method such as optical flow (optical flow), which is not described herein again.
In the similarity determination of the two image frames, a comparison of shapes described by the feature points, and the like can be used. It can be understood that, when the subsequent image frame is labeled by the labeling result of the previous image frame, the main purpose is to use the labeling result of the previous image frame for reference in the target labeling process of the subsequent image frame, so in an alternative embodiment, in order to reduce the data processing amount, only one reference area may be taken from the first image frame, and the similarity determination may be performed with the second image frame.
Taking the aforementioned adjacent first image frame and second image frame as an example, specifically, the reference area in the first image frame may be determined first based on the labeling result of the first image frame. Optionally, for the current key frame, determining the initial reference area as an area surrounded by the target frame when the labeling result for the current key frame includes the target frame; and under the condition that the labeling result aiming at the current key frame does not contain a target frame, determining the initial reference area as the area at the appointed position in the current key frame. The area of the designated position may be a pre-designated area containing a predetermined number of pixels, such as a 9 × 9 pixel area in the center of the first image frame, a 9 × 9 pixel area in the upper left corner of the first image frame, and so on. In subsequent image frames, the reference region in the first image frame is the region marked out in the corresponding marking result.
The similarity between the reference region of the first image frame and the corresponding regions of the second image frame may be determined by a method such as pixel value comparison, or may be determined by a similarity model, which is not limited herein.
Referring to fig. 3, a method for determining the similarity between a reference region of a first image frame and corresponding regions of a second image frame will be described by taking a similarity model as an example. Assume that the reference region determined based on the labeling result of the first image frame is the reference region z, and the second image frame is the image frame x. In one aspect, a predetermined convolutional neural network is utilized for a reference region z (e.g., corresponding to a 127 × 127 × 3 pixel array)
Processing is performed to obtain a first convolution result (e.g., a 6 × 6 × 128 feature array), and on the other hand, the same convolution neural network is also used for image frame x (e.g., a pixel array corresponding to 255 × 255 × 3)
Processing is performed to obtain a second convolution result (e.g., 22 x 128 feature array). Further, the first convolution result is used as a convolution kernel, and the second convolution result is subjected to convolution processing to obtain a third convolution result (such as a 17 × 17 × 1 numerical array). As can be appreciated, in the case of convolution by convolutionWhen one array is checked for convolution, the more similar the array is to the convolution kernel, the larger the value obtained. Therefore, in the array of values corresponding to the third convolution result, each value describes the similarity between the corresponding array in the second convolution result and the array in the first convolution result. The second convolution result is the result of processing the second image frame, and each sub-array in the second convolution result may correspond to each region in the second image frame. Meanwhile, each numerical value in the third convolution result corresponds to each sub-array in the second convolution result respectively. The third convolution result can then be viewed as a distributed array of similarities of respective corresponding regions in the second image frame to the reference region of the first image frame. In the numerical array of the third convolution result, the larger the numerical value is, the greater the similarity of the corresponding region in the second image frame with the reference region of the first image frame is. Because the second image frame is labeled based on the labeling result of the first image frame, whether the second image frame has the area corresponding to the reference area of the first image frame or not is judged. In general, if a region corresponding to a reference region of a first image frame exists in a second image frame, the region is a region of the second image frame that is most similar to the reference region of the first image frame. In this way, the similarity between the second image frame and the first key image frame may be determined based on the maximum value in the value array corresponding to the third convolution result. The maximum value corresponds to a region of the second image frame that is most similar to the reference region of the first image frame. The similarity between the second image frame and the first key image frame may be the maximum value itself, or a decimal/fractional value corresponding to the maximum value after normalization processing is performed on each value in the value array corresponding to the third convolution result.
A similarity threshold for the same region in both image frames may be preset, such that if the similarity determined by the above process is greater than the similarity threshold, it indicates that there is a region in the second image frame corresponding to the reference region in the first image frame, such as both left headlights. Otherwise, if the similarity determined by the above process is less than the similarity threshold, it indicates that the second image frame does not include a region corresponding to the reference region of the first image frame.
In the case where a region corresponding to the reference region of the first image frame is not included in the second image frame, there may be a sudden change in picture between the second image frame and the first image frame. Important information may be missed if the second image frame is not labeled. Thus, at this point, the second image frame may be added to the key frame of the video stream. And according to the time sequence, in the next flow, acquiring the second image frame as the current key frame for target marking.
It will be appreciated that, in accordance with the above description, each image frame may correspond to a reference region, but the actual meaning of the reference regions is different. For example, in a vehicle detection scenario, generally, if a labeling result of a labeling model corresponding to a current key frame includes a target frame, it indicates that a certain part or a certain material of a vehicle has a high-confidence damage, and this result may be provided for manual reference or affect a decision. The reference area at this time indicates a predetermined target that can provide a high degree of confidence. The reference area obtained by the position-specifying method may also include a frame, but the area surrounded by the frame is used for providing reference for labeling of the subsequent image frame, and does not indicate a predetermined target, and no damage exists in the vehicle detection scene. Therefore, the labeling result of the current key frame may also correspond to a confidence flag. And the output result of the labeling model comprises a target frame, and the confidence identifier of the current key frame is the high-confidence identifier under the condition that the reference region indicates the preset target which can provide high confidence. In the vehicle detection scenario shown in fig. 1, the high confidence flag represents a high likelihood of vehicle damage, and the corresponding image frame may be output for reference to the human inspection platform. When the output result of the annotation model corresponding to the current key frame does not contain the target frame, the specified position region is determined as the reference region, and the confidence identifier of the current key frame may be a low confidence identifier. In a vehicle detection scenario, the reference label corresponds to a vehicle damage with a low confidence, or a confidence of 0.
As will be readily understood by those skilled in the art, the labeling result of the non-key frame following the key frame is labeled with reference to the labeling result of the key frame, so that the confidence flag of the labeling result of the subsequent non-key frame can be consistent with the confidence flag of the current key frame. At the time of final decision, reference may be made to the confidence token. At this time, the process of labeling the target table shown in fig. 2 may further include adding the image frame corresponding to the high-confidence flag into the target label set. The target label set is used for outputting to human inspection or intelligent decision making of a computer.
For a clearer understanding of the technical concept of the embodiments of the present specification, refer to fig. 4. In a specific implementation, as shown in fig. 4, for a received video stream, a key frame is first extracted. Then, one of the key frames is acquired as a current key frame in time sequence. And processing the current key frame by a labeling model to obtain a labeling result of the current key frame. And judging whether the target frame is output or not according to the labeling result. If so, taking the area in the target frame as a reference area, simultaneously setting a high confidence mark for the current key frame, such as a confidence mark flag set to 1, and adding the current key frame into the pre-labeling result set. Otherwise, setting a low confidence flag for the current key frame, such as setting a confidence flag to 0. Then, based on the labeling result of the current key frame, the image frame is labeled.
First, whether the next image frame is a key frame is judged. If yes, acquiring the next image frame as the current key frame, and continuing the process. Otherwise, the current key frame is the current frame, and the similarity between the current frame and the next frame is detected. And if the similarity is smaller than the preset similarity threshold, adding the next frame into the key frame, acquiring the next frame as the current key frame, and continuing the process. Otherwise, the similarity is larger than a preset similarity threshold, the next frame is marked by using the marking result of the current frame, and the next frame inherits the confidence identification of the current frame. And detecting whether the confidence mark of the next frame is a high confidence mark. If the confidence mark of the next frame is a high confidence mark, adding the next frame into the pre-labeling result, updating the current frame and the next frame by using the next frame and the next frame of the next frame, and continuing the process until a key frame is detected or the video stream is ended. If the confidence mark of the next frame is not the high confidence mark, the current frame and the next frame are updated by using the next frame and the next frame, and the process is continued until the key frame is detected or the video stream is ended.
From the above description, it can be understood that, under the technical idea of the present specification, the target labeling flow shown in fig. 2 is a non-bypass labeling flow, but not a labeling flow that has to be completely executed for each key frame. For example, in the case that the next image frame of the current key frame is also a key frame, the current key frame and the next key frame have no non-key frame in interval, and in step 203, target labeling is performed on the non-key frame after the current key frame in the video stream based on the labeling result for the current key frame, which is not required to be performed.
Reviewing the above process, in the process of target labeling, only the key frames in the video stream are processed through the labeling model, and for the non-key frames, labeling is performed through the labeling results of the key frames, so that the data processing amount is greatly reduced. Furthermore, in the non-key frame labeling process, the image frames with higher similarity can be migrated, and the image frames with lower similarity can be re-labeled as key frames through the labeling model, so that more accurate labeling results can be obtained. In this way, more efficient target annotation can be provided.
According to an embodiment of another aspect, an apparatus for target labeling is also provided. FIG. 5 shows a schematic block diagram of an apparatus for target annotation, according to one embodiment. As shown in fig. 5, the apparatus 500 for target labeling includes: an acquisition unit 51 configured to acquire a current key frame, which is one of a plurality of key frames determined from respective image frames of the video stream; a first labeling unit 52, configured to perform target labeling on the current key frame by using a pre-trained labeling model to obtain a labeling result for the current key frame, where the labeling model is used to label an area containing a predetermined target from the picture through a target frame; the second labeling unit 53 is configured to perform target labeling on a non-key frame subsequent to the current key frame in the video stream based on the labeling result for the current key frame.
According to one embodiment, the apparatus 500 further comprises an extraction unit (not shown) configured to extract the initial plurality of key frames by any one of:
selecting a plurality of image frames from a video stream as key frames according to a preset time interval;
and inputting the video stream into a pre-trained frame extraction model, and determining a plurality of key frames according to the output result of the frame extraction model.
In one implementation, the video stream is a vehicle video, the target is a vehicle injury, and the apparatus 500 may further include a training unit (not shown) configured to train the annotation model by:
acquiring a plurality of vehicle pictures, wherein each vehicle picture corresponds to each sample marking result, and each sample marking result comprises at least one damage frame under the condition that the vehicle pictures comprise vehicle damage, and each damage frame is a minimum rectangular frame surrounding a continuous damage area;
and training a labeling model at least based on a plurality of vehicle pictures.
According to one possible design, in a video stream, for convenience of description, adjacent key frames are respectively recorded as a first image frame and a second image frame, and for a current key frame, an initial first image frame is the current key frame, and an initial second image frame is a frame next to the current key frame;
the second labeling unit 53 is further configured to:
after the first image frame is marked, detecting whether a second image frame is a key frame;
in the case that the second image frame is not a key frame, detecting the similarity of the second image frame and the first image frame;
if the similarity between the second image frame and the first image frame is larger than a preset similarity threshold, mapping the labeling result corresponding to the first image frame to the second image frame so as to obtain a labeling result corresponding to the second image frame;
and respectively updating the first image frame and the second image frame by using the second image frame and the next frame of the second image frame, and carrying out target labeling on the updated second image frame based on the labeling result of the updated first image frame.
And if the similarity between the second image frame and the first image frame is less than a preset similarity threshold value, determining the second image frame as a key frame.
In a further embodiment, the second labeling unit 53 is further configured to determine the similarity of the second image frame to the first image frame by:
determining a reference area in the first image frame based on the labeling result of the first image frame;
respectively processing a reference area and a second image frame in the first image frame by using a predetermined convolutional neural network, and respectively obtaining a first convolution result and a second convolution result;
taking the first convolution result as a convolution kernel, performing convolution processing on the second convolution result to obtain a third convolution result, wherein each numerical value in a numerical value array corresponding to the third convolution result respectively describes each similarity of a corresponding area of the second image frame and a reference area of the first image frame;
and determining the similarity of the second image frame and the first image frame based on the maximum value in the value array corresponding to the third convolution result.
In one embodiment, in the case that the similarity between the second image frame and the first image frame is greater than a preset similarity threshold, the second labeling unit 53 is further configured to:
and according to the labeling result of the first image frame, labeling the image area of the second image frame corresponding to the maximum number.
In one embodiment, the second labeling unit 53 is further configured to:
determining an initial reference area as an area surrounded by the target frame under the condition that the first labeling result contains the target frame;
and under the condition that the first labeling result does not contain the target frame, determining the initial reference area as the area at the appointed position in the current key frame.
In a further embodiment, the current key frame further corresponds to a confidence flag, and the second labeling unit is further configured to:
and determining the confidence identifications of the non-key frames after the current key frame and before the next key frame, wherein the confidence identifications are consistent with the confidence identifications corresponding to the labeling result of the current key frame.
The confidence marks comprise high confidence marks and low confidence marks. The output result of the labeling model corresponding to the high confidence mark for the corresponding key frame contains a target frame, the reference region indicates the condition of the preset target capable of providing high confidence, the output result of the labeling model corresponding to the low confidence mark for the corresponding key frame does not contain the target frame, and the reference region does not indicate the condition of the preset target.
At this time, the apparatus 500 may further include an annotation result determination unit (not shown) configured to:
and adding the image frames corresponding to the high-confidence-degree marks into the labeling result set.
It should be noted that the apparatus 500 shown in fig. 5 is an apparatus embodiment corresponding to the method embodiment shown in fig. 2, and the corresponding description in the method embodiment shown in fig. 2 is also applicable to the apparatus 500, and is not repeated herein.
According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 2.
According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory and a processor, the memory having stored therein executable code, the processor, when executing the executable code, implementing the method described in connection with fig. 2.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in the embodiments of this specification may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The above-mentioned embodiments are intended to explain the technical idea, technical solutions and advantages of the present specification in further detail, and it should be understood that the above-mentioned embodiments are merely specific embodiments of the technical idea of the present specification, and are not intended to limit the scope of the technical idea of the present specification, and any modification, equivalent replacement, improvement, etc. made on the basis of the technical solutions of the embodiments of the present specification should be included in the scope of the technical idea of the present specification.