Nothing Special   »   [go: up one dir, main page]

CN113506316A - Method and device for segmenting video object and network model training method - Google Patents

Method and device for segmenting video object and network model training method Download PDF

Info

Publication number
CN113506316A
CN113506316A CN202110587943.9A CN202110587943A CN113506316A CN 113506316 A CN113506316 A CN 113506316A CN 202110587943 A CN202110587943 A CN 202110587943A CN 113506316 A CN113506316 A CN 113506316A
Authority
CN
China
Prior art keywords
frame image
current frame
historical
image
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110587943.9A
Other languages
Chinese (zh)
Inventor
熊鹏飞
王培森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Megvii Technology Co Ltd
Original Assignee
Beijing Megvii Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Megvii Technology Co Ltd filed Critical Beijing Megvii Technology Co Ltd
Priority to CN202110587943.9A priority Critical patent/CN113506316A/en
Publication of CN113506316A publication Critical patent/CN113506316A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the application provides a method and a device for segmenting a video object and a network model training method, wherein the method for segmenting the video object comprises the following steps: extracting the characteristics of at least one historical frame image before the current frame image to obtain the characteristic pair of each historical frame image in the at least one historical frame image; extracting the characteristics of the current frame image to obtain a characteristic pair of the current frame image; extracting the features of the current frame image to obtain a feature pair of the current frame; acquiring a segmentation mask of an interest target in the current frame image according to the feature pair of each historical frame image, the feature pair of the current frame image and a decoder; wherein each of the at least one history frame image is a previous frame or a plurality of frames of the current frame image, and the feature pair includes a key matrix and a value matrix. According to the embodiments of the application, inter-frame tracking is realized through an enhanced short-time memory network, and the video object segmentation precision of the current frame image is remarkably improved.

Description

Method and device for segmenting video object and network model training method
Technical Field
The present application relates to the field of video object segmentation, and in particular, to a method and an apparatus for segmenting a video object, and a network model training method.
Background
Video object segmentation (video segmentation) is an important subject of computer vision, and therefore, video object segmentation is widely applied to scenes such as video monitoring, object tracking and mobile phone video processing. The video object segmentation comprises two parts of image segmentation and object tracking. The video object tracking means that in a given video sequence, one or more predefined objects are accurately and respectively segmented out from the following video frames.
Existing video object segmentation is mainly limited by segmentation accuracy. On one hand, the object segmentation accuracy of a single frame image has many problems, in a video sequence, the same object has very large posture and form changes in different frames, and often has very large difference with the predefined posture and shape, which results in lower accuracy of video object segmentation according to the predefined posture and shape; on the other hand, when a plurality of similar objects exist in the video, the video segmentation needs to distinguish the object from other objects, and the object is not segmented into other similar objects by mistake. These reasons make the existing video object segmentation technique have a great limit on accuracy.
Therefore, how to improve the precision of video object segmentation becomes a technical problem to be solved urgently.
Disclosure of Invention
The method comprises the steps of segmenting and tracking an object into a neural network based on an enhanced short-time memory network and a network fusing image features of a current frame and cutting image features, monitoring segmentation results of the current frame image through segmentation results of historical frames, and achieving remarkable improvement of segmentation accuracy of video objects.
In a first aspect, some embodiments of the present application provide a method of segmenting video objects, the method comprising: extracting the characteristics of at least one historical frame image before the current frame image to obtain the characteristic pair of each historical frame image in the at least one historical frame image; extracting the characteristics of the current frame image to obtain a characteristic pair of the current frame image; acquiring a segmentation mask of an interest target in the current frame image according to the feature pair of each historical frame image, the feature pair of the current frame image and a decoder; wherein the feature pairs comprise a key matrix and a value matrix.
According to the embodiments of the application, the characteristics of at least one historical frame image adjacent to the current frame image are extracted through an enhanced short-time memory network to realize inter-frame tracking, and the video object segmentation precision of the current frame image is remarkably improved.
In some embodiments, the features of the current frame image are extracted by a current frame encoder, the current frame encoder comprising a convolution layer, a down-sampling layer and a feature similarity fusion module; the extracting the features of the current frame image by using the current frame encoder to obtain the feature pairs of the current frame image includes: extracting the characteristics of the current frame image by adopting the convolution layer and the down-sampling layer to obtain a current frame characteristic diagram; extracting features of a cut image by adopting the convolution layer and the downsampling layer to obtain a cut image feature map, wherein the cut image is obtained by cutting the interested target from a first frame image, and the first frame image is a frame of the video sequence in which the interested target appears for the first time; and fusing the current frame feature map and the cut image feature map according to the feature similarity fusion module to obtain a fused image, and obtaining a feature pair of the current frame image based on the fused image.
According to some embodiments of the application, similarity fusion is carried out on the extracted current frame image and the feature map of the cut image containing the interested target, so that the interested target needing to be segmented at present is better highlighted, and the segmentation precision is further improved.
In some embodiments, the feature similarity fusion module fuses the current frame feature map and the clipped image feature map through a short-term memory network to obtain the fused image.
Some embodiments of the present application provide a network structure for fusing extracted current frame feature maps and cut image feature maps, which can implement feature fusion of current frame images and cut images through a short-time memory network, thereby better highlighting an interested target.
In some embodiments, the feature similarity fusion module obtains the fused image according to the following formula:
Figure BDA0003088341070000031
wherein,
Figure BDA0003088341070000032
characterizing the matrix, feat, to which the fused image correspondstCharacterizing a matrix, feat, corresponding to said current feature mappAnd characterizing a matrix corresponding to the feature map of the cut image.
Some embodiments of the present application provide a calculation formula for fusing the extracted features of the current frame image and the extracted features of the cropped image, so as to implement quantitative characterization of the fused features.
In some embodiments, the features of the current frame image are extracted through an enhanced short-time memory network, the enhanced short-time memory network comprises at least one encoder of a semantic segmentation network, the encoders in the at least one encoder are connected in parallel, and each encoder receives an input historical frame image and a segmentation result of the historical frame image.
According to some embodiments of the application, the characteristics of multi-frame historical images are mined by connecting a plurality of encoders in parallel, and then inter-frame tracking is better realized.
In some embodiments, the obtaining a segmentation mask of an object of interest in the current frame image according to the feature pair of the historical frame images, the feature pair of the current frame image, and the decoder includes: respectively carrying out fusion operation on the current key matrix included by the feature pairs of the current frame image and the historical key matrix included by the feature pairs of each historical frame image to obtain a fusion key matrix; and inputting the current value matrixes included by the fusion key matrix and the feature pairs of the current frame image into the decoder to obtain the segmentation mask.
Some embodiments of the present application provide a method for fusing features of a mined historical frame image and features of a mined current frame image, and a fused feature pair (i.e., a fused key matrix and a current value matrix included in the current frame feature pair) obtained by the fusion method can improve the accuracy of a segmentation mask obtained by decoding with a decoder.
In some embodiments, the performing a fusion operation on the current key matrix included in the feature pair of the current frame image and the historical key matrix included in the feature pair of each historical frame image to obtain a fusion key matrix includes: respectively acquiring first relevancy of the current key matrix and each historical key matrix in all the historical key matrices; and obtaining the fusion key matrix according to the first correlation.
Some embodiments of the application determine the fusion key matrix by calculating the correlation between the current frame feature map and the historical frame feature map of the input enhanced short-time memory network, and better acquire valuable feature information for segmenting an interested target from the historical frame image.
In some embodiments, the feature pairs f input to the decodertComprises the following steps:
Figure BDA0003088341070000041
Figure BDA0003088341070000042
wherein, R is the first correlation degree of the current key matrix and each historical key matrix, Z is the sum of all the first correlation degrees, i represents any historical frame input into the enhanced short-time memory networkThe number of the image, t represents the number of the current frame image and the total number of the historical frame images and the current frame image,
Figure BDA0003088341070000043
a current key matrix for characterizing feature pairs of the current frame image,
Figure BDA0003088341070000044
an ith history key matrix included in the feature pairs for characterizing the ith history frame image,
Figure BDA0003088341070000045
an ith history value matrix for characterizing feature pairs included in the ith history frame image,
Figure BDA0003088341070000046
a matrix of values comprised by pairs of features characterizing the current frame image,
Figure BDA0003088341070000047
for characterizing the fusion key matrix.
Some embodiments of the present application provide a computational formula that quantifies fusing pairs of historical frame image features and pairs of features of a current frame image.
In some embodiments, the first correlation is calculated by the following formula:
Figure BDA0003088341070000048
wherein,
Figure BDA0003088341070000049
a dot product operation for characterizing the current key matrix and the ith historical key matrix.
Some embodiments of the present application use an e-exponential function to calculate a fusion result of a key matrix included in a feature pair of a current frame image and a key matrix included in a feature pair of each historical frame image.
In some embodiments, the total number of the at least one historical frame image for which the enhanced short-time memory network receives input is 3.
Some embodiments of the application realize inter-frame tracking by setting 3 frames of historical frame images, the calculation amount is moderate, and the segmentation precision of the current frame image is effectively improved.
In some embodiments, the extracting features of at least one historical frame image adjacent to the current frame image to obtain a feature pair of each historical frame image in the at least one historical frame image includes: and serially connecting segmentation masks of a second historical frame image and the second historical frame image, and inputting the serial segmentation masks into the enhanced short-time memory network to obtain a feature pair of the second historical frame image, wherein the second historical frame image is any one of the at least one historical frame image.
Some embodiments of the application realize better inter-frame tracking by inputting each frame of historical image and the segmentation result of each frame of historical image into an enhanced short-time memory network at the same time, and finally improve the segmentation result of the interested target.
In a second aspect, some embodiments of the present application provide an apparatus for segmenting video objects, the apparatus comprising: the historical frame feature mining module is configured to extract features of at least one historical frame image before a current frame image to obtain feature pairs of each historical frame image in the at least one historical frame image; the current frame coding network module is configured to extract the characteristics of the current frame image to obtain a characteristic pair of the current frame image; the decoding network module is configured to acquire a segmentation mask image of an interested target in the current frame image according to the feature pairs of the historical frame images, the feature pairs of the current frame image and a decoder; wherein the feature pair comprises a key matrix and a value matrix.
In some embodiments, the current frame encoding network module is further configured to: extracting the characteristics of the current frame image by adopting a convolution layer and a down-sampling layer to obtain a current frame characteristic diagram; extracting features of a cut image by adopting the convolution layer and the downsampling layer to obtain a cut image feature map, wherein the cut image is obtained by cutting the interested target from a first frame image, and the first frame image is a frame of the video sequence in which the interested target appears for the first time; and fusing the current frame feature image and the cut image feature image according to a feature similarity fusion module to obtain a feature pair of the current frame image.
In a third aspect, some embodiments of the present application provide a method for training a network model of a segmented video object, where the method for training the network model of the segmented video object includes: carrying out salient object segmentation training on a basic network by adopting a data set, wherein the basic network comprises a current frame encoder and a decoder; training a video object segmentation network according to at least one historical frame image, a current frame image and a cut image, wherein the video object networks respectively comprise a basic network and an enhanced short-time memory network which are obtained after training is completed, the current frame encoder is further configured to extract and fuse the characteristics of the current frame image and the cut image, and the at least one historical frame image and the current frame image are from the same video sequence.
In a fourth aspect, some embodiments of the present application provide a system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform operations of the respective methods according to the first aspect described above.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
Fig. 1 is one of the architecture diagrams of a video object segmentation network provided in an embodiment of the present application;
fig. 2 is a second architecture diagram of a video object segmentation network according to an embodiment of the present invention;
FIG. 3 is a flowchart of a method for segmenting video objects according to an embodiment of the present application;
fig. 4 is a second flowchart of a method for segmenting video objects according to an embodiment of the present application;
fig. 5 is a third architecture diagram of a video object segmentation network according to an embodiment of the present application;
fig. 6 is a block diagram illustrating a device for segmenting a video object according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.
Related art video object segmentation methods are mainly classified into two categories. One class attempts to migrate the object segmentation results of the first frame to subsequent frames. For example, MaskTrack learns the optical flow change of the object mask of the next frame relative to the previous frame by simultaneously inputting the current frame image, the current frame object mask, and the image of the next frame through an optical flow method. However, this type of method has a significant degradation in accuracy when the object variation is large, and the erroneous segmentation result may seriously affect the segmentation accuracy of the subsequent frame. The other method is to divide the video segmentation method into two parts of image segmentation and object tracking, perform instance segmentation on each frame of image to distinguish each object, and then perform classification between the objects by re-identification (re-identification) and other methods to realize object correspondence of continuous frames. Such methods are mainly subject to the precision of example segmentation methods, the results of which are significantly worse than semantic segmentation on edges or small objects.
Unlike the related art described above, embodiments of the present application provide a more efficient method for video object segmentation based on deep learning. For example, in some embodiments of the present application, based on an enhanced short-time memory network and a network that fuses similarities of image features of a current frame and features of a cropped image, object segmentation and tracking are fused into a neural network, and a segmentation result of a historical frame is used to supervise a segmentation result of the current frame, so that the video object segmentation accuracy is significantly improved.
Referring to fig. 1, fig. 1 is an architecture diagram of a video object segmentation network model based on a deep learning method according to some embodiments of the present application. As shown in fig. 1, the architecture of the video object segmentation network model of some embodiments of the present application includes an enhanced short-time memory network 100, a current frame encoder 120, and a decoder 130, wherein the decoder 130 is connected with the enhanced short-time memory network 100 and the current frame encoder 120 directly or through the remaining functional modules.
The enhanced short-time memory network 100 includes one or more parallel historical frame encoders, where each historical frame encoder 110 is configured to receive an incoming historical frame image. As shown in fig. 1, the enhanced short-term memory network 100 includes N history frame encoders 110, where each history frame encoder 110 is configured to receive an input history frame image (for example, the input first history frame image, the input second history frame image, … …, and up to the nth history frame image are respectively received by the N history frame encoders 110 in fig. 1) and perform feature extraction on the input history frame image to obtain a feature pair corresponding to the history frame image. Each of the historical frame encoders 110 includes a plurality of convolution layers and a plurality of down-sampling layers, and performs feature extraction on the input historical frame image by the convolution layers and the down-sampling layers to obtain a first feature pair, a second feature pair, … …, respectively, until an nth feature pair, N being a natural number greater than 1 (for example, N takes a value of 3 in some examples). It should be noted that the historical frame encoder 110 in fig. 1 belongs to an encoder or an encoding network included in the semantic segmentation network. For example, each history encoder 110 is an encoder portion included in a unet network (or referred to as a "U-shaped convolutional neural network" or a "U-shaped convolutional neural network" including an encoder and a decoder composed of convolutional layers). By performing feature extraction on each historical frame image at different scales through the historical frame encoder 110, a first feature matrix and a second feature matrix corresponding to each historical frame image can be obtained, wherein the first feature matrix can be named as a historical frame key matrix, and the second feature matrix can be named as a historical frame value matrix.
The current frame encoder 120 of fig. 1 is configured to receive an input current frame image, and extract features of the current frame image to obtain a feature pair of the current frame. For example, the current frame encoder 120 belongs to an encoder comprised by a semantic segmentation network, and in particular, the current frame encoder may be an encoder portion comprised by a unet network. As an embodiment, the current frame encoder 120 includes: the method comprises the steps that a plurality of convolutional layers and a plurality of downsampling layers are used for extracting features of different scales of a current frame image, and a first feature matrix and a second feature matrix corresponding to the current frame image can be obtained, wherein the first feature matrix can be named as a current frame key matrix, and the second feature matrix can be named as a current frame value matrix.
To illustrate, the enhanced short-term memory network 100 in some embodiments of the present application further includes a feature pair connection module 115 for connecting feature pairs of respective historical frames extracted by each historical frame encoder. In other embodiments of the present application, the feature pair connection module 115 and the current frame encoder 120 are further connected to the historical feature fusion module 140, so as to implement fusion of the feature pairs of the current frame image and the feature pairs of each historical frame, obtain a fusion feature pair (i.e., a fusion key matrix below and a current value matrix included in the feature pairs of the current frame image), input the fusion feature pair into the decoder 130, and decode the fusion feature pair, and finally obtain a segmentation mask image of an object of interest in the current frame image. For the fusion algorithm and process of the historical feature fusion module 140, the following description is specifically referred to, and no redundant description is provided herein for avoiding repetition.
In order to further highlight the currently segmented object of interest to improve the video object segmentation accuracy, the architecture of the network model for video object segmentation according to some embodiments of the present application is shown in fig. 2.
Unlike fig. 1, the current frame encoder 120 of fig. 2 includes two parallel feature extraction units and a feature similarity fusion module 123. In some embodiments of the present application, each of the two parallel feature extraction units includes multiple convolution layers and multiple downsampling layers (i.e., the multiple convolution layers and the downsampling network 121 of fig. 2), the current frame image is input into the multiple convolution layers and the downsampling network 121 to obtain a current frame feature map, the clipped image including the object of interest is input into the multiple convolution layers and the downsampling network 121 to obtain a clipped image feature map, then the current frame feature map and the clipped image feature map are fused by the feature similarity fusion module 123 to obtain a fusion feature, and a feature pair of the current frame image is obtained based on the fusion feature (e.g., the short-term memory network extracts the fusion feature to obtain a feature key value pair of the current frame image).
It will be appreciated that some embodiments of the present application require training the network model of fig. 1 first in order to perform video object segmentation by the network model of video object segmentation of fig. 1. In some embodiments of the present application, a method for training the video segmentation network model of fig. 1 by some embodiments of the present application includes: performing salient object segmentation training on the base network 20 of fig. 1 by using a data set, wherein the base network 20 comprises a current frame encoder 120 and a decoder 130; the video object segmentation network 10 (i.e., the whole network model of fig. 1) is trained according to at least one of the historical frame image and the current frame image, wherein the video object respective network includes a basic network and an enhanced short-time memory network 100 obtained after the training is completed.
In order to perform video object segmentation by using the network model of fig. 2, embodiments of the present application need to train the video object segmentation network model of fig. 2 first. In some embodiments of the present application, a method of training the network model of fig. 2 of some embodiments of the present application includes: performing salient object segmentation training on the base network 20 of fig. 2 by using a data set, wherein the base network 20 comprises a multi-layer convolution and down-sampling network (at least implementing partial functions of a current frame encoder) and a decoder 130; training a video object segmentation network 10 (i.e. the whole network model of fig. 2) according to at least one historical frame image, a current frame image and a cropped image, wherein the video object segmentation network 10 includes a base network obtained after training is completed, an enhanced short-time memory network and a model for obtaining feature similarity of the current frame image and the cropped image (specifically, including the multi-layer convolution and downsampling network 121 for receiving the input cropped image of fig. 2 and the feature similarity fusion module 123 of fig. 2), and the current frame encoder 120 is further configured to extract and fuse features of the current frame image and the cropped image.
That is, embodiments of the present application train and test on public data sets based on the network architecture of fig. 1 or fig. 2. Since the objects to be segmented are many in kind and not known, some embodiments of the present application first train a salient object segmentation using the underlying network model. The object of salient object segmentation is to segment salient objects in an image without limitation to the types of objects. For example, the embodiment of the application adopts a data set of COCO, VOC and other salients, randomly selects an object from a marked image as a significant object, and rejects other objects. A basic network for single object segmentation is trained based on the generated data set, and the video object segmentation network 10 of fig. 1 or fig. 2 is trained by using the basic network as an initialization. In some embodiments of the present application, the video object segmentation network 10 tests the segmentation accuracy evaluation result on the data set of youtube vos better than other methods participating in evaluation.
It should be noted that, at least one historical frame image and the current frame image involved in training the video object segmentation network model according to the embodiment of the present application are both from the same video sequence, but each historical frame image in the at least one historical frame image is randomly selected from the video sequence, and these historical frame images and the current frame image are not necessarily adjacent images.
The following describes an exemplary process of the video object segmentation method performed by the trained video object segmentation network 10 in conjunction with fig. 3.
As shown in fig. 3, some embodiments of the present application provide a method of segmenting a video object, the method comprising: s101, extracting the characteristics of at least one historical frame image before a current frame image to obtain the characteristic pair of each historical frame image in the at least one historical frame image; s102, extracting the characteristics of the current frame image to obtain a characteristic pair of the current frame image; s103, acquiring a segmentation mask of an interested target in the current frame image according to the feature pair of each historical frame image, the feature pair of the current frame image and a decoder; wherein each of the at least one history frame image is a previous frame or a plurality of frames of the current frame image, and the feature pair includes a key matrix and a value matrix. According to the embodiments of the application, inter-frame tracking is realized through an enhanced short-time memory network, and the video object segmentation precision of the current frame image is remarkably improved. In some embodiments of the present application, S101 extracts features of at least one historical frame image according to an enhanced short-time memory network, or S102 extracts features of a current frame image using a current frame encoder.
It should be noted that, at least one historical frame image input to the enhanced short-time memory network and the current frame image input to the current frame encoder are both from the same video sequence, and each historical frame image included in the at least one historical frame image is one or more frames of images located before the current frame image in the video sequence. That is, in some embodiments of the present application, when performing video object segmentation on a current frame image by using a trained video object segmentation network, at least one historical frame image input to the enhanced short-time memory network and the current frame image input to the current frame encoder are a plurality of adjacent frame images. In other embodiments of the present application, when performing video object segmentation on a current frame image by using a trained video object segmentation network, at least one historical frame image input to the enhanced short-time memory network is a non-adjacent image to the current frame image input to the current frame encoder (for example, one historical frame image is extracted every n frames before the current frame image). When the number of the history frames needing to be input of the enhanced short-time memory network is larger than the total number of frames before the current frame image, at least one history frame can comprise two same history frames. For example, if the number of input nodes of the enhanced short-time memory network is 3, the first 3 frames of images of the current frame image can be taken when the current frame image is subjected to video object segmentation, and if the number of input nodes is less than 3, the previous 1 frame or 2 frames are repeatedly selected.
The steps of fig. 3 are exemplarily set forth below.
In some embodiments of the present application, S101 comprises: any one (for example, the second historical frame image) historical frame image and a pre-acquired segmentation mask of the historical frame image are connected in series and then input into a historical frame encoder included in the enhanced short-time memory network to acquire a feature pair of the historical frame image. For example, when the second historical frame image is an RGB three-channel image, the information input into each historical frame encoder in the enhanced short-time memory network is a four-channel image including an R channel, a G channel, a B channel, and a segmentation mask of the corresponding historical frame image, and the output of the historical frame encoder is two feature maps, and the corresponding matrices of the two feature maps may be named as a key matrix and a value matrix.
For example, the enhanced short-term memory network of S101 includes at least one encoder of a semantic segmentation network (i.e., the historical frame encoder of fig. 1 or fig. 2), in which the encoders are connected in parallel, and each encoder receives an input historical frame image and a segmentation result of the historical frame image. And then each encoder extracts the characteristics of the input historical frame images, respectively obtains and outputs the characteristic pair results of each historical frame image.
That is, as described above, the enhanced short-term memory network structure adopted in S101 may be obtained by connecting in parallel encoders included in a semantic segmentation network such as a plurality of unets, and each encoder receives an input historical frame image to obtain a feature map corresponding to the frame image. For example, the encoder includes a plurality of convolutional layers and downsample layers.
In some embodiments of the present application, as shown in fig. 4, the current frame encoder in S102 includes a convolution layer, a down-sampling layer, and a feature similarity fusion module; wherein, S102 includes: s1021, extracting the features of the current frame image by adopting the convolution layer and the down-sampling layer to obtain a current frame feature map; s1022, extracting features of a clipped image by using the convolution layer and the downsampling layer to obtain a clipped image feature map, wherein the clipped image is obtained by clipping the interested target from a first frame image, and the first frame image is a frame in a video sequence in which the interested target appears for the first time; and S1023, fusing the current frame feature map and the cut image feature map according to a feature similarity fusion module to obtain a fusion image, and obtaining a feature pair of the current frame image based on the fusion image.
It should be noted that, the processes of obtaining the current frame feature map and obtaining the cropped image feature map may be performed simultaneously, or the cropped image feature map may be obtained first and then the current frame feature map is obtained, and the embodiment of the present application does not limit the sequence of the two steps of obtaining the current frame feature map at S1021 and obtaining the cropped image feature map at S1022. Thus, in some embodiments, S102 comprises: extracting features of a cut image by adopting the convolution layer and the downsampling layer to obtain a cut image feature map, wherein the cut image is obtained by cutting the interested target from a first frame image, and the first frame image is a frame of the video sequence in which the interested target appears for the first time; extracting the characteristics of the current frame image by adopting the convolution layer and the down-sampling layer to obtain a current frame characteristic diagram; and fusing the current frame feature map and the cut image feature map according to a feature similarity fusion module to obtain a fusion image, and obtaining a feature pair of the current frame image based on the fusion image.
In order to further highlight the segmented interesting object, the feature similarity fusion module in S102 fuses the current frame feature map and the cut image feature map through a short-time memory network to obtain the fused image, and obtains a feature pair of the current frame image based on the fused image. For example, the feature similarity fusion module obtains the fused image of S1023 according to the following formula:
Figure BDA0003088341070000131
wherein,
Figure BDA0003088341070000132
characterizing the corresponding matrix, feat, of the fused imagetCharacterizing the matrix, feat, corresponding to the current frame profilepAnd characterizing a matrix corresponding to the feature map of the cropped image.
In some embodiments of the present application, S103 comprises: respectively carrying out fusion operation on the current key matrix included by the feature pairs of the current frame image and the historical key matrix included by the feature pairs of each historical frame image to obtain a fusion key matrix; and inputting the current value matrix (namely the characteristic pair input into a decoder) included by the fusion key matrix and the current frame characteristic pair into the decoder to obtain the segmentation mask image. For example, the performing a fusion operation on the current key matrix included in the current frame feature pair and the historical key matrix included in the feature pairs of each historical frame image to obtain a fusion key matrix includes: acquiring a first correlation degree of the current key matrix and each historical key matrix in all the historical key matrices; and obtaining the fusion key matrix according to the first correlation. That is, the current value matrix included in the fusion key matrix and the current frame feature pair is used as the feature pair of the input decoder.
In particular, the feature pairs f of the input decodertComprises the following steps:
Figure BDA0003088341070000141
Figure BDA0003088341070000142
wherein R is the first correlation degree of the current key matrix and each historical key matrix, Z is the sum of all the first correlation degrees, i represents any character input into the enhanced short-time memory networkA number of a historical frame image, t represents the number of the current frame image and the total number of the historical frame images and the current frame image,
Figure BDA0003088341070000143
the feature pairs used to characterize the current frame image comprise a current key matrix,
Figure BDA0003088341070000144
an ith history key matrix included in the feature pairs for characterizing the ith history frame image,
Figure BDA0003088341070000145
an ith history value matrix for characterizing feature pairs included in the ith history frame image,
Figure BDA0003088341070000146
and the value matrix is used for representing the characteristic pair included by the current frame image. In addition, the number i in the ith historical frame image is not the frame number of the frame image in the whole video sequence, but is the number according to the total number of the enhanced short-time memory network input each time,
Figure BDA0003088341070000147
for characterizing the fusion key matrix.
For example, the first correlation is calculated by the following formula:
Figure BDA0003088341070000148
wherein,
Figure BDA0003088341070000149
and the dot product operation is used for characterizing the current key matrix and the ith historical key matrix. Some embodiments of the present application use an e-exponential function to calculate a fusion result of a key matrix included in a feature pair of a current frame image and a key matrix included in a feature pair of each historical frame image. It should be noted that other letters can be usedThe number type calculates the first correlation, for example, the similarity is calculated using the square root or absolute value of the matrix dot product.
Some embodiments of the present application provide a method for fusing features of a mined historical frame image and features of a mined current frame image, and the accuracy of a segmentation mask obtained by decoding with a decoder can be improved by decoding a fused feature pair (i.e., a current value matrix included in a fused key matrix and a current frame feature pair) obtained by the fusion method.
The following exemplarily explains a video object segmentation method and a corresponding video segmentation network architecture according to some embodiments of the present application with reference to fig. 5 by taking 3 historical frame images as an example.
As shown in fig. 5, the video object segmentation network architecture for implementing video object segmentation of the present application includes a backbone network, i.e., a grid formed by the current frame encoder 120 and decoder 130 of fig. 5. As shown in the right part of fig. 5, the backbone network functions as a standard semantic segmentation network. There are a very wide variety of semantic segmentation networks. In the video segmentation network provided by the embodiment of the application, the backbone network can be any semantic segmentation network structure. For example, some embodiments of the present application employ a standard unet structure. The input unet network is an image, and features of different scales of the image are extracted through a multi-layer convolution and down-sampling feature network (i.e., the current frame encoder 120 of fig. 5), and then the features of different scales are fused through a decoding network (i.e., the decoder 130 of fig. 5) which continuously samples up, so as to output an object segmentation mask (i.e., the final output of the entire network model of fig. 5).
As shown in fig. 5, the enhanced short-term memory network 100 of the embodiment of the present application is for implementing inter-frame tracking. For the current frame image It, its history frame and history segmentation result are [ I0, …, It-1] and [ M0, …, Mt-1], respectively. For each frame of image, the enhanced short-time memory network 100 extracts a feature pair of [ k, v ] (i.e., a Key matrix of fig. 5 and a Value matrix (i.e., a Value matrix of fig. 5) obtained by extracting features from each historical frame image by an encoder Enc _ M (memory encoder) of the encoder memory network of fig. 5), for example, each frame of image included in the video sequence is an RGB image, and then the input of each historical frame encoder (Encm) included in the enhanced short-time memory network 100 is a 4-channel image (i.e., including R, G, B and a segmentation result mask image of the frame) after I and M are connected in series, and the output is two feature map feature maps or two feature matrices, i.e., a Key matrix Key and a Value matrix Value.
In order to ensure that the input and output of the video object segmentation network are consistent during actual use, in the training process, 3 frames of images are randomly selected from historical frames to be input into the enhanced short-time memory network 100, and the video object segmentation network is trained. In the implementation, the previous 3 frames of images of the current frame of images can be taken, and if the number of the previous 3 frames of images is less than 3, the previous historical frame of images or the previous two frames of historical frame of images are repeatedly selected. For the current frame image, the backbone network also generates a [ k, v ] feature pair, i.e., the key matrix and the value matrix of fig. 5, while extracting features of different scales.
In some embodiments of the present application, when performing video object segmentation, it is necessary to fuse the extracted feature pair [ k, v ] of the three historical frame images and the extracted feature pair [ k, v ] of the current frame image to obtain a new fused feature map, and use the fused feature map as an input of a decoder 130 (or called a decoding network) in a subsequent backbone network. For example, a formula is adopted for a specific fusion method of a feature pair of a historical frame image and a feature pair of a current frame, wherein t is the number of the current frame image, i is the number of the historical frame image input into the enhanced short-time memory network, t is 3, and the value of i is 0, 1 and 2:
Figure BDA0003088341070000161
wherein, R is the correlation degree of two featmaps, and Z is the sum of the correlation degrees of all frames.
Figure BDA0003088341070000162
Figure BDA0003088341070000163
It should be noted that, the definitions of the relevant parameters related to the above formulas can refer to the foregoing descriptions, and are not described in detail herein to avoid repetition.
Further, in order to better highlight the currently segmented object, some embodiments of the present application further design a feature similarity module (i.e., the feature similarity fusion module 123 of fig. 5) that fuses the current frame image feature and the cropped image feature. The current frame encoder 120 of fig. 5 includes: the system comprises a current frame encoder 121, a feature fusion module 123 and a short-time memory network 122, wherein the current frame encoder 121 is used for acquiring features of a current frame image and features of a cut image, the feature fusion module 123 is used for fusing a current frame feature map and a cut image feature map which are obtained through processing of the current frame encoder 121, and the features fused through the feature fusion module 123 use the short-time memory network 122 to obtain key value pairs consisting of a matrix k and a matrix v. For example, a cropped image containing an object of interest is input into a multilayer convolutional layer and a downsampling layer (i.e., a current frame encoder 121) to obtain a cropped image feature map, a current frame image is input into the multilayer convolutional layer and the downsampling layer (i.e., the current frame encoder 121 included in the feature extraction network) to obtain a current frame feature map, the cropped image feature map and the current frame feature map extracted by a main network are fused by a feature fusion module 123 according to the following fusion formula, and the new feature map obtained after fusion enters a short-time memory network 122 to obtain key values and Value.
In some embodiments of the present application, the method for fusing the current frame feature map and the cropped image feature map by the feature fusion module 123 is to multiply the two feature maps, i.e., the current frame feature map and the cropped image feature map, and concatenate the original skeleton features (i.e., concatenate the current frame feature map), where a specific fusion formula is as follows, and the meaning of each parameter in the following fusion formula may refer to the above description to avoid repetition and is not described herein repeatedly.
Figure BDA0003088341070000171
It should be noted that, in order to obtain a cropped image, some embodiments of the present application further need to crop out an object of interest in a first frame image (i.e., a frame in which the object of interest first appears in the same video sequence, but not an absolute first frame image in the video sequence), and set a position other than the object mask to be 0, so as to only retain texture information of the object, and obtain the cropped image input to the multi-layer convolution and downsampling network 121 as shown in fig. 5. The reading module 160 of fig. 5 is configured to read a historical key-value pair of the historical frame image and read a key-value pair of the current frame in which the current frame image and the cropped image are fused, and the ASSP (i.e., the aperture spatial pyramid pooling) is configured to fuse the historical key-value pair and the key-value pair of the current frame into a further feature.
In some embodiments of the present application, the current frame encoder 120 may employ a twin query encoder, siense query encoder, where twin refers to parameter sharing of the two encoders.
Referring to fig. 6, fig. 6 shows an apparatus for segmenting a video object according to an embodiment of the present application, it should be understood that the apparatus corresponds to the above-mentioned embodiment of the method of fig. 3, and can perform the steps related to the above-mentioned embodiment of the method, and the specific functions of the apparatus can be referred to the above description, and detailed descriptions are appropriately omitted herein to avoid repetition. The device comprises at least one software functional module which can be stored in a memory in the form of software or firmware or solidified in an operating system of the device, and the device for segmenting the video object comprises: a historical frame feature mining module 101, configured to extract features of at least one historical frame image before a current frame image, to obtain a feature pair of each historical frame image in the at least one historical frame image; a current frame encoding network module 102, configured to extract features of the current frame image to obtain a feature pair of the current frame image; the decoding network module 103 is configured to obtain a segmentation mask image of an interest target in the current frame image according to the feature pairs of the historical frame images, the feature pairs of the current frame image and a decoder; wherein each of the at least one history frame image is a previous frame or a plurality of frames of the current frame image, and the feature pair includes a key matrix and a value matrix.
In some embodiments of the present application, the current frame encoding network module 102 is further configured to: extracting the characteristics of the current frame image by adopting a convolution layer and a down-sampling layer to obtain a current frame characteristic diagram; extracting features of a cut image by adopting the convolution layer and the downsampling layer to obtain a cut image feature map, wherein the cut image is obtained by cutting the interested target from a first frame image, and the first frame image is a frame of the video sequence in which the interested target appears for the first time; and fusing the current frame feature map and the cut image feature map according to a feature similarity fusion module to obtain a feature pair of the current frame.
Some embodiments of the present application provide a method for training a network model for segmenting a video object, the method for training a network model for segmenting a video object comprising: carrying out salient object segmentation training on a basic network by adopting a data set, wherein the basic network comprises a current frame encoder and a decoder; training a video object segmentation network according to at least one historical frame image, a current frame image and a cut image, wherein the video object networks respectively comprise a basic network and an enhanced short-time memory network which are obtained after training is completed, the current frame encoder is further configured to extract and fuse the characteristics of the current frame image and the cut image, and the at least one historical frame image and the current frame image are from the same video sequence.
Some embodiments of the present application provide a system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform operations according to the respective methods described above in fig. 3.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims (14)

1. A method of segmenting video objects, the method comprising:
extracting the characteristics of at least one historical frame image before the current frame image to obtain the characteristic pair of each historical frame image in the at least one historical frame image;
extracting the characteristics of the current frame image to obtain a characteristic pair of the current frame image;
acquiring a segmentation mask of an interest target in the current frame image according to the feature pair of each historical frame image, the feature pair of the current frame image and a decoder;
wherein the feature pair comprises a key matrix and a value matrix.
2. The method of claim 1, wherein the features of the current frame image are extracted by a current frame encoder, the current frame encoder comprising a convolutional layer, a downsampling layer, and a feature similarity fusion module; wherein,
the extracting the features of the current frame image by using the current frame encoder to obtain the feature pairs of the current frame image comprises the following steps:
extracting the characteristics of the current frame image by adopting the convolution layer and the down-sampling layer to obtain a current frame characteristic diagram;
extracting features of a cut image by adopting the convolution layer and the downsampling layer to obtain a cut image feature map, wherein the cut image is obtained by cutting the interested target from a first frame image, and the first frame image is a frame of the video sequence in which the interested target appears for the first time;
and fusing the current frame feature map and the cut image feature map according to the feature similarity fusion module to obtain a fused image, and obtaining a feature pair of the current frame image based on the fused image.
3. The method of claim 2, wherein the feature similarity fusion module fuses the current frame feature map and the cropped image feature map through a short-time memory network to obtain the fused image.
4. The method of claim 2 or 3, wherein the feature similarity fusion module obtains the fused image according to the following formula:
Figure FDA0003088341060000021
wherein,
Figure FDA0003088341060000022
characterizing the matrix, feat, to which the fused image correspondstCharacterizing a matrix, feat, corresponding to the current frame profilepAnd characterizing a matrix corresponding to the feature map of the cut image.
5. The method according to any one of claims 1-4, wherein the features of the current frame image are extracted through an enhanced short-time memory network, the enhanced short-time memory network comprises at least one encoder of a semantic segmentation network, the encoders of the at least one encoder are connected in parallel, and each encoder receives an input historical frame image and a segmentation result of the historical frame image.
6. The method according to any one of claims 1 to 4, wherein the obtaining a segmentation mask of the object of interest in the current frame image according to the feature pair of the history frame images, the feature pair of the current frame image and a decoder comprises:
respectively carrying out fusion operation on the current key matrix included by the feature pairs of the current frame image and the historical key matrix included by the feature pairs of each historical frame image to obtain a fusion key matrix;
and inputting the current value matrixes included by the fusion key matrix and the feature pairs of the current frame image into the decoder to obtain the segmentation mask.
7. The method of claim 6, wherein the performing a fusion operation on the current key matrix included in the feature pair of the current frame image and the historical key matrix included in the feature pair of each historical frame image to obtain a fused key matrix comprises:
respectively acquiring first relevancy of the current key matrix and each historical key matrix in all the historical key matrices;
and obtaining the fusion key matrix according to the first correlation.
8. The method of claim 6, wherein the pair of features f input to the decodertComprises the following steps:
Figure FDA0003088341060000031
Figure FDA0003088341060000032
wherein R is the first correlation degree of the current key matrix and each historical key matrix, Z is the sum of all the first correlation degrees, i represents the number of any one historical frame image input into the enhanced short-time memory network, t represents the number of the current frame image and the total number of all the historical frame images and the current frame image,
Figure FDA0003088341060000033
a current key matrix for characterizing feature pairs of the current frame image,
Figure FDA0003088341060000034
an ith history key matrix included in the feature pairs for characterizing the ith history frame image,
Figure FDA0003088341060000035
for watchesAn ith history value matrix included in a feature pair characterizing the ith history frame image,
Figure FDA0003088341060000036
a matrix of values comprised by pairs of features characterizing the current frame image,
Figure FDA0003088341060000037
for characterizing the fusion key matrix.
9. The method of claim 8, wherein the first degree of correlation is calculated by the formula:
Figure FDA0003088341060000038
wherein,
Figure FDA0003088341060000039
a dot product operation for characterizing the current key matrix and the ith historical key matrix.
10. The method of claim 1, wherein a total number of the at least one historical frame image is 3.
11. The method according to any one of claims 1 to 10, wherein the extracting features of at least one historical frame image adjacent to the current frame image to obtain a feature pair of each historical frame image in the at least one historical frame image comprises: and connecting the segmentation masks of a second historical frame image and the second historical frame image in series and inputting the second historical frame image and the segmentation masks of the second historical frame image into an enhanced short-time memory network to obtain a feature pair of the second historical frame image, wherein the second historical frame image is any one of the at least one historical frame image.
12. An apparatus for segmenting video objects, the apparatus comprising:
the historical frame feature mining module is configured to extract features of at least one historical frame image before a current frame image to obtain feature pairs of each historical frame image in the at least one historical frame image;
the current frame coding network module is configured to extract the features of the current frame image to obtain a feature pair of the current frame image;
the decoding network module is configured to acquire a segmentation mask image of an interested target in the current frame image according to the feature pairs of the historical frame images, the feature pairs of the current frame image and a decoder;
wherein the feature pair comprises a key matrix and a value matrix.
13. A network model training method for segmenting video objects is characterized by comprising the following steps:
carrying out salient object segmentation training on a basic network by adopting a data set, wherein the basic network comprises a current frame encoder and a decoder;
training a video object segmentation network according to at least one historical frame image, a current frame image and a cut image, wherein the video object networks respectively comprise a basic network and an enhanced short-time memory network which are obtained after training is completed, the current frame encoder is further configured to extract and fuse the characteristics of the current frame image and the cut image, and the at least one historical frame image and the current frame image are from the same video sequence.
14. A system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform operations of the respective methods of any of claims 1-11.
CN202110587943.9A 2021-05-27 2021-05-27 Method and device for segmenting video object and network model training method Pending CN113506316A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110587943.9A CN113506316A (en) 2021-05-27 2021-05-27 Method and device for segmenting video object and network model training method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110587943.9A CN113506316A (en) 2021-05-27 2021-05-27 Method and device for segmenting video object and network model training method

Publications (1)

Publication Number Publication Date
CN113506316A true CN113506316A (en) 2021-10-15

Family

ID=78008565

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110587943.9A Pending CN113506316A (en) 2021-05-27 2021-05-27 Method and device for segmenting video object and network model training method

Country Status (1)

Country Link
CN (1) CN113506316A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114565637A (en) * 2022-01-14 2022-05-31 厦门理工学院 Single-target tracking method based on feature enhancement and video historical frame
CN115147457A (en) * 2022-07-08 2022-10-04 河南大学 Memory enhanced self-supervision tracking method and device based on space-time perception
CN116630869A (en) * 2023-07-26 2023-08-22 北京航空航天大学 Video target segmentation method
CN117095019A (en) * 2023-10-18 2023-11-21 腾讯科技(深圳)有限公司 Image segmentation method and related device
CN118314150A (en) * 2024-06-07 2024-07-09 山东君泰安德医疗科技股份有限公司 Bone CT image segmentation system

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108319972A (en) * 2018-01-18 2018-07-24 南京师范大学 A kind of end-to-end difference online learning methods for image, semantic segmentation
GB201911502D0 (en) * 2018-10-12 2019-09-25 Adobe Inc Space-time memory network for locating target object in video content
CN110659958A (en) * 2019-09-06 2020-01-07 电子科技大学 Clothing matching generation method based on generation of countermeasure network
US20200034971A1 (en) * 2018-07-27 2020-01-30 Adobe Inc. Image Object Segmentation Based on Temporal Information
CN111050219A (en) * 2018-10-12 2020-04-21 奥多比公司 Spatio-temporal memory network for locating target objects in video content
CN111915571A (en) * 2020-07-10 2020-11-10 云南电网有限责任公司带电作业分公司 Image change detection method, device, storage medium and equipment fusing residual error network and U-Net network
CN111914756A (en) * 2020-08-03 2020-11-10 北京环境特性研究所 Video data processing method and device
CN112215085A (en) * 2020-09-17 2021-01-12 云南电网有限责任公司昆明供电局 Power transmission corridor foreign matter detection method and system based on twin network

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108319972A (en) * 2018-01-18 2018-07-24 南京师范大学 A kind of end-to-end difference online learning methods for image, semantic segmentation
US20200034971A1 (en) * 2018-07-27 2020-01-30 Adobe Inc. Image Object Segmentation Based on Temporal Information
GB201911502D0 (en) * 2018-10-12 2019-09-25 Adobe Inc Space-time memory network for locating target object in video content
CN111050219A (en) * 2018-10-12 2020-04-21 奥多比公司 Spatio-temporal memory network for locating target objects in video content
CN110659958A (en) * 2019-09-06 2020-01-07 电子科技大学 Clothing matching generation method based on generation of countermeasure network
CN111915571A (en) * 2020-07-10 2020-11-10 云南电网有限责任公司带电作业分公司 Image change detection method, device, storage medium and equipment fusing residual error network and U-Net network
CN111914756A (en) * 2020-08-03 2020-11-10 北京环境特性研究所 Video data processing method and device
CN112215085A (en) * 2020-09-17 2021-01-12 云南电网有限责任公司昆明供电局 Power transmission corridor foreign matter detection method and system based on twin network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHISHAN ZHOU; LEJIAN REN; PENGFEI XIONG; YIFEI JI; PEISEN WANG; HAOQIANG FAN; SI LIU: "Enhanced Memory Network for Video Segmentation", IEEE, 5 March 2020 (2020-03-05), pages 689 - 692 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114565637A (en) * 2022-01-14 2022-05-31 厦门理工学院 Single-target tracking method based on feature enhancement and video historical frame
CN114565637B (en) * 2022-01-14 2024-05-14 厦门理工学院 Single-target tracking method based on feature enhancement and video history frames
CN115147457A (en) * 2022-07-08 2022-10-04 河南大学 Memory enhanced self-supervision tracking method and device based on space-time perception
CN116630869A (en) * 2023-07-26 2023-08-22 北京航空航天大学 Video target segmentation method
CN116630869B (en) * 2023-07-26 2023-11-07 北京航空航天大学 Video target segmentation method
CN117095019A (en) * 2023-10-18 2023-11-21 腾讯科技(深圳)有限公司 Image segmentation method and related device
CN117095019B (en) * 2023-10-18 2024-05-10 腾讯科技(深圳)有限公司 Image segmentation method and related device
CN118314150A (en) * 2024-06-07 2024-07-09 山东君泰安德医疗科技股份有限公司 Bone CT image segmentation system

Similar Documents

Publication Publication Date Title
CN113506316A (en) Method and device for segmenting video object and network model training method
CN110782462B (en) Semantic segmentation method based on double-flow feature fusion
CN111050219B (en) Method and system for processing video content using a spatio-temporal memory network
CN108062754B (en) Segmentation and identification method and device based on dense network image
CN107564025B (en) Electric power equipment infrared image semantic segmentation method based on deep neural network
CN111274865A (en) Remote sensing image cloud detection method and device based on full convolution neural network
CN110046550B (en) Pedestrian attribute identification system and method based on multilayer feature learning
CN112801063B (en) Neural network system and image crowd counting method based on neural network system
CN111598108A (en) Rapid salient object detection method of multi-scale neural network based on three-dimensional attention control
CN114049332A (en) Abnormality detection method and apparatus, electronic device, and storage medium
CN113032525A (en) False news detection method and device, electronic equipment and storage medium
CN107392311B (en) Method and device for segmenting sequence
CN110263733A (en) Image processing method, nomination appraisal procedure and relevant apparatus
CN113887472B (en) Remote sensing image cloud detection method based on cascade color and texture feature attention
CN111652039A (en) Hyperspectral remote sensing ground object classification method based on residual error network and feature fusion module
CN111860233A (en) SAR image complex building extraction method and system based on attention network selection
CN110991298A (en) Image processing method and device, storage medium and electronic device
CN116468947A (en) Cutter image recognition method, cutter image recognition device, computer equipment and storage medium
CN111639230A (en) Similar video screening method, device, equipment and storage medium
CN113554655B (en) Optical remote sensing image segmentation method and device based on multi-feature enhancement
CN112802034B (en) Image segmentation and identification methods, model construction methods and devices, and electronic equipment
CN118196468A (en) Hyperspectral image classification method, hyperspectral image classification system and storage medium
CN113111208A (en) Method, system, equipment and storage medium for searching picture by picture
CN117671432A (en) Method and device for training change analysis model, electronic equipment and storage medium
CN114529791B (en) Target detection method and related device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination