CN113506316A

CN113506316A - Method and device for segmenting video object and network model training method

Info

Publication number: CN113506316A
Application number: CN202110587943.9A
Authority: CN
Inventors: 熊鹏飞; 王培森
Original assignee: Beijing Megvii Technology Co Ltd
Current assignee: Force Map New Chongqing Technology Co ltd
Priority date: 2021-05-27
Filing date: 2021-05-27
Publication date: 2021-10-15

Abstract

The embodiments of the present application provide a method, a device, and a network model training method for segmenting a video object. The method for segmenting a video object includes: extracting features of at least one historical frame image before the current frame image, and obtaining the at least one historical frame image feature pairs of each historical frame image in the image; extract the features of the current frame image to obtain the feature pair of the current frame image; extract the features of the current frame image to obtain the feature pair of the current frame; The feature pair of the current frame image, the feature pair of the current frame image, and the decoder, to obtain the segmentation mask of the target of interest in the current frame image; wherein, each historical frame image in the at least one historical frame image is the The previous frame or images of the current frame image, the feature pair includes a key matrix and a value matrix. Some embodiments of the present application implement inter-frame tracking through an enhanced short-term memory network, which significantly improves the video object segmentation accuracy for the current frame image.

Description

Method, device and network model training method for segmenting video objects

技术领域technical field

本申请涉及视频物体分割领域，具体而言本申请实施例涉及分割视频物体的方法、装置以及网络模型训练方法。The present application relates to the field of video object segmentation, and in particular, the embodiments of the present application relate to a method, an apparatus, and a network model training method for segmenting video objects.

背景技术Background technique

视频物体分割(video segmentation)是计算机视觉(computer vision)的一个重要课题，因此视频物体分割在视频监控、物体跟踪和手机视频处理等场景中都有着非常广泛地应用。视频物体分割包含图像分割和物体跟踪两部分。所谓的视频物体跟踪是指在给定一段视频序列中，针对预先定义的一个或多个物体，在后续的视频帧将所有物体精确的分别分割出来。Video object segmentation is an important subject in computer vision, so video object segmentation is widely used in video surveillance, object tracking and mobile phone video processing. Video object segmentation includes two parts: image segmentation and object tracking. The so-called video object tracking refers to accurately segmenting all objects in subsequent video frames for one or more predefined objects in a given video sequence.

现有的视频物体分割主要受限于分割精度。一方面，单帧图像的物体分割精度就存在很多问题，在一段视频序列中，同一物体在不同的帧中存在非常大的姿态和形态的变化，往往和预先定义的姿态和形状存在非常大的差异，这就导致根据预先定义的姿态和形状进行视频物体分割的精度较低；另一方面，当视频中存在多个同类物体时，视频分割需要将该物体从其他物体中区分出来，而不会误分割到其他同类物体中。这些原因使得现有的视频物体分割技术精度收到很大的限制。Existing video object segmentation is mainly limited by segmentation accuracy. On the one hand, there are many problems in the accuracy of object segmentation in a single frame of images. In a video sequence, the same object has very large changes in posture and shape in different frames, which often have very large differences with the predefined posture and shape. On the other hand, when there are multiple objects of the same type in the video, video segmentation needs to distinguish the object from other objects, instead of It will be mistakenly segmented into other similar objects. For these reasons, the accuracy of existing video object segmentation techniques is greatly limited.

因此如何提升视频物体分割的精度成了亟待解决的技术问题。Therefore, how to improve the accuracy of video object segmentation has become an urgent technical problem to be solved.

发明内容SUMMARY OF THE INVENTION

本申请实施例的目的在于提供一种分割视频物体的方法、装置以及网络模型训练方法，本申请的一些实施例提供的新的基于深度学习的方法，基于一个增强型短时记忆网络、和对当前帧图像特征及裁剪图像特征相融合的网络，将物体分割和跟踪融合到一个神经网络中，通过历史帧的分割结果来监督当前帧图像的分割结果，实现了视频物体分割精度的显著提升。The purpose of the embodiments of the present application is to provide a method, device and network model training method for segmenting video objects. The new deep learning-based methods provided by some embodiments of the present application are based on an enhanced short-term memory network, and a pair of The network that combines the features of the current frame image and the cropped image features integrates object segmentation and tracking into a neural network, and supervises the segmentation results of the current frame image through the segmentation results of the historical frames, which achieves a significant improvement in the accuracy of video object segmentation.

第一方面，本申请的一些实施例提供一种分割视频物体的方法，所述方法包括：提取当前帧图像之前的至少一张历史帧图像的特征，得到所述至少一张历史帧图像中各历史帧图像的特征对；提取所述当前帧图像的特征，得到所述当前帧图像的特征对；根据所述各历史帧图像的特征对、所述当前帧图像的特征对和解码器，获取所述当前帧图像中感兴趣目标的分割掩膜；其中所述特征对包括键矩阵和值矩阵。In a first aspect, some embodiments of the present application provide a method for segmenting a video object, the method comprising: extracting features of at least one historical frame image before the current frame image, and obtaining each of the at least one historical frame image feature pairs of historical frame images; extract features of the current frame images to obtain feature pairs of the current frame images; obtain feature pairs of the historical frame images, feature pairs of the current frame images and a decoder, obtain The segmentation mask of the object of interest in the current frame image; wherein the feature pair includes a key matrix and a value matrix.

本申请的一些实施例通过一个增强型短时记忆网络提取与当前帧图像相邻的至少一张历史帧图像的特征实现帧间跟踪，显著提升了对当前帧图像的视频物体分割精度。Some embodiments of the present application implement inter-frame tracking by extracting features of at least one historical frame image adjacent to the current frame image through an enhanced short-term memory network, which significantly improves the accuracy of video object segmentation for the current frame image.

在一些实施例中，通过当前帧编码器提取所述当前帧图像的特征，所述当前帧编码器包括卷积层、下采样层和特征相似度融合模块；其中，所述采用当前帧编码器提取所述当前帧图像的特征，得到所述当前帧图像的特征对，包括：采用所述卷积层和所述下采样层提取所述当前帧图像的特征，得到当前帧特征图；采用所述卷积层和所述下采样层提取裁剪图像的特征，得到裁剪图像特征图，其中，所述裁剪图像是从第一帧图像中裁剪所述感兴趣目标得到的，所述第一帧图像为视频序列中首次出现所述感兴趣目标的帧；根据所述特征相似度融合模块对所述当前帧特征图和所述裁剪图像特征图进行融合，得到融合图像，并基于所述融合图像得到所述当前帧图像的特征对。In some embodiments, the feature of the current frame image is extracted by a current frame encoder, and the current frame encoder includes a convolution layer, a down-sampling layer, and a feature similarity fusion module; wherein, the current frame encoder is used. Extracting the feature of the current frame image to obtain the feature pair of the current frame image, comprising: using the convolution layer and the downsampling layer to extract the feature of the current frame image to obtain the current frame feature map; using the The convolutional layer and the downsampling layer extract the features of the cropped image to obtain a cropped image feature map, wherein the cropped image is obtained by cropping the target of interest from the first frame of image, and the first frame of image is the frame in which the target of interest appears for the first time in the video sequence; according to the feature similarity fusion module, the current frame feature map and the cropped image feature map are fused to obtain a fused image, and based on the fused image to obtain The feature pair of the current frame image.

本申请的一些实施例通过将提取的当前帧图像和包含了感兴趣目标的裁剪图像的特征图进行相似度融合，更好的突出了当前需要分割的感兴趣目标，进一步提升了分割精度。Some embodiments of the present application better highlight the current target of interest that needs to be segmented and further improve the segmentation accuracy by performing similarity fusion between the extracted current frame image and the feature map of the cropped image containing the target of interest.

在一些实施例中，所述特征相似度融合模块通过短时记忆网络将所述当前帧特征图和所述裁剪图像特征图进行融合，得到所述融合图像。In some embodiments, the feature similarity fusion module fuses the current frame feature map and the cropped image feature map through a short-term memory network to obtain the fused image.

本申请的一些实施例提供了一种对提取的当前帧特征图和裁剪图像特征图进行融合的网络结构，通过短时记忆网络可实现当前帧图像和裁剪图像的特征融合，更好的凸显了感兴趣目标。Some embodiments of the present application provide a network structure for fusing the extracted feature map of the current frame and the feature map of the cropped image, and the feature fusion of the current frame image and the cropped image can be realized through the short-term memory network, which better highlights the target of interest.

在一些实施例中，所述特征相似度融合模块根据如下公式得到所述融合图像：In some embodiments, the feature similarity fusion module obtains the fusion image according to the following formula:

其中，

表征所述融合图像对应的矩阵，feat_t表征与所述当前特征图对应的矩阵，feat_p表征与所述裁剪图像特征图对应的矩阵。in,

represents the matrix corresponding to the fusion image, feat _t represents the matrix corresponding to the current feature map, and feat _p represents the matrix corresponding to the cropped image feature map.

本申请的一些实施例提供了一种融合提取的当前帧图像的特征和提取的裁剪图像的特征的计算公式，实现了对融合特征的量化表征。Some embodiments of the present application provide a calculation formula for fusing the extracted features of the current frame image and the extracted features of the cropped image, so as to realize the quantitative representation of the fused features.

在一些实施例中，通过增强型短时记忆网络提取所述当前帧图像的特征，所述增强型短时记忆网络包括语义分割网络的至少一个编码器，所述至少一个编码器中各编码器并联连接，且每一个编码器接收一张输入的历史帧图像和对所述历史帧图像的分割结果。In some embodiments, the feature of the current frame image is extracted through an enhanced short-term memory network, the enhanced short-term memory network includes at least one encoder of a semantic segmentation network, and each encoder in the at least one encoder connected in parallel, and each encoder receives an input historical frame image and a segmentation result of the historical frame image.

本申请的一些实施例通过并联多个编码器挖掘多帧历史图像的特征，进而更好的实现了帧间跟踪。Some embodiments of the present application mine features of multiple frames of historical images by connecting multiple encoders in parallel, thereby better realizing inter-frame tracking.

在一些实施例中，所述根据所述各历史帧图像的特征对、所述当前帧图像的特征对和解码器，获取所述当前帧图像中感兴趣目标的分割掩膜，包括：对所述当前帧图像的特征对包括的当前键矩阵和所述各历史帧图像的特征对包括的历史键矩阵分别进行融合操作，得到融合键矩阵；将所述融合键矩阵和所述当前帧图像的特征对包括的当前值矩阵输入所述解码器得到所述分割掩膜。In some embodiments, obtaining the segmentation mask of the object of interest in the current frame image according to the feature pair of the historical frame images, the feature pair of the current frame image, and the decoder includes: The feature of the current frame image is respectively fused to the included current key matrix and the feature of each historical frame image to the included historical key matrix, to obtain a fusion key matrix; the fusion key matrix and the current frame image The current value matrix included in the feature pair is input to the decoder to obtain the segmentation mask.

本申请的一些实施例提供了一种对挖掘的历史帧图像的特征与挖掘的当前帧图像的特征进行融合方法，通过这种融合方法得到的融合特征对(即融合键矩阵和所述当前帧特征对包括的当前值矩阵)可以提升采用解码器解码得到的分割掩膜的精度。Some embodiments of the present application provide a method for fusing the features of the mined historical frame image and the features of the mined current frame image, and the fused feature pair (that is, the fusion key matrix and the current frame image) obtained by this fusion method The feature pair includes the current value matrix) to improve the accuracy of the segmentation mask decoded by the decoder.

在一些实施例中，所述对所述当前帧图像的特征对包括的当前键矩阵和所述各历史帧图像的特征对包括的历史键矩阵分别进行融合操作，得到融合键矩阵，包括：分别获取所述当前键矩阵和所有所述历史键矩阵中各历史键矩阵的第一相关度；根据所述第一相关度得到所述融合键矩阵。In some embodiments, performing a fusion operation on the current key matrix included in the feature pair of the current frame image and the historical key matrix included in the feature pair of each historical frame image, respectively, to obtain a fusion key matrix, including: respectively: Acquire the first correlation degree of each historical key matrix in the current key matrix and all the historical key matrices; obtain the fusion key matrix according to the first correlation degree.

本申请的一些实施例通过计算当前帧特征图和输入增强型短时记忆网络的历史帧特征图之间的相关度来确定融合键矩阵，更好的从历史帧图像中获取分割感兴趣目标的有价值的特征信息。In some embodiments of the present application, the fusion key matrix is determined by calculating the correlation between the feature map of the current frame and the feature map of the historical frame input to the enhanced short-term memory network, so as to better obtain the segmentation target of interest from the historical frame image. Valuable feature information.

在一些实施例中，输入所述解码器的特征对f_t为：In some embodiments, the feature pairs f _t input to the decoder are:

其中，R为所述当前键矩阵与各历史键矩阵的第一相关度，Z为所有第一相关度之和，i表征输入所述增强型短时记忆网络的任意一张历史帧图像的编号，t表征当前帧图像的编号和所有所述历史帧图像和所述当前帧图像的总张数，

用于表征所述当前帧图像的特征对包括的当前键矩阵，

用于表征第i历史帧图像的特征对包括的第i历史键矩阵，

用于表征所述第i历史帧图像的特征对包括的第i历史值矩阵，

用于表征所述当前帧图像的特征对包括的值矩阵，

用于表征所述融合键矩阵。Wherein, R is the first correlation degree between the current key matrix and each historical key matrix, Z is the sum of all the first correlation degrees, and i represents the serial number of any historical frame image input into the enhanced short-term memory network , t represents the number of the current frame image and the total number of all the historical frame images and the current frame image,

The current key matrix included in the feature pair for characterizing the current frame image,

The i-th history key matrix included in the feature pair used to characterize the i-th historical frame image,

The i-th history value matrix included in the feature pair for characterizing the i-th historical frame image,

The value matrix included in the feature pair used to characterize the current frame image,

used to characterize the fusion bond matrix.

本申请的一些实施例提供了一种量化融合历史帧图像特征对和当前帧图像的特征对的计算公式。Some embodiments of the present application provide a calculation formula for quantitatively fusing feature pairs of historical frame images and feature pairs of current frame images.

在一些实施例中，所述第一相关度是通过如下公式计算得到的：In some embodiments, the first correlation is calculated by the following formula:

其中，

用于表征所述当前键矩阵和所述第i历史键矩阵的点乘操作。in,

A dot product operation for characterizing the current key matrix and the i-th historical key matrix.

本申请的一些实施例采用e指数函数计算当前帧图像的特征对包括的键矩阵和各历史帧图像的特征对包括的键矩阵的融合结果。Some embodiments of the present application use the e-exponential function to calculate the fusion result of the key matrix included in the feature pair of the current frame image and the key matrix included in the feature pair of each historical frame image.

在一些实施例中，所述增强型短时记忆网络接收输入的至少一张历史帧图像的总数目为3。In some embodiments, the total number of at least one historical frame image received by the enhanced short-term memory network is three.

本申请的一些实施例通过设置3帧历史帧图像来实现帧间跟踪，计算量适中且有效提升了当前帧图像的分割精度。Some embodiments of the present application implement inter-frame tracking by setting 3 frames of historical frame images, the amount of calculation is moderate, and the segmentation accuracy of the current frame image is effectively improved.

在一些实施例中，所述提取与当前帧图像相邻的至少一张历史帧图像的特征，得到所述至少一张历史帧图像中各历史帧图像的特征对，包括：将第二历史帧图像和所述第二历史帧图像的分割掩膜串联后输入所述增强型短时记忆网络，以获取所述第二历史帧图像的特征对，其中，所述第二历史帧图像为所述至少一张历史帧图像中的任意一张图像。In some embodiments, the extracting features of at least one historical frame image adjacent to the current frame image to obtain feature pairs of each historical frame image in the at least one historical frame image, including: extracting the second historical frame image The image and the segmentation mask of the second historical frame image are concatenated and input into the enhanced short-term memory network to obtain feature pairs of the second historical frame image, wherein the second historical frame image is the Any image from at least one historical frame image.

本申请的一些实施例通过将各帧历史图像和对各帧历史图像的分割结果同时输入增强型短时记忆网络来实现更好的帧间跟踪，并最终提升感兴趣目标分割结果。Some embodiments of the present application achieve better inter-frame tracking by simultaneously inputting each frame of historical image and the segmentation result of each frame of historical image into an enhanced short-term memory network, and ultimately improve the segmentation result of the target of interest.

第二方面，本申请的一些实施例提供一种分割视频物体的装置，所述装置包括：历史帧特征挖掘模块，被配置为提取当前帧图像之前的至少一张历史帧图像的特征，得到所述至少一张历史帧图像中各历史帧图像的特征对；当前帧编码网络模块，被配置为提取所述当前帧图像的特征，得到当前帧图像的特征对；解码网络模块，被配置为根据所述各历史帧图像的特征对、所述当前帧图像的特征对和解码器，获取所述当前帧图像中感兴趣目标的分割掩膜图像；其中，所述特征对包括键矩阵和值矩阵。In a second aspect, some embodiments of the present application provide an apparatus for segmenting video objects, the apparatus comprising: a historical frame feature mining module configured to extract features of at least one historical frame image before the current frame image, and obtain the The feature pair of each historical frame image in the at least one historical frame image; the current frame encoding network module, configured to extract the feature of the current frame image, to obtain the feature pair of the current frame image; The decoding network module is configured to The feature pair of each historical frame image, the feature pair of the current frame image, and a decoder to obtain the segmentation mask image of the target of interest in the current frame image; wherein, the feature pair includes a key matrix and a value matrix .

在一些实施例中，所述当前帧编码网络模块还被配置为：采用卷积层和下采样层提取所述当前帧图像的特征，得到当前帧特征图；采用所述卷积层和所述下采样层提取裁剪图像的特征，得到裁剪图像特征图，其中，所述裁剪图像是从第一帧图像中裁剪所述感兴趣目标得到的，所述第一帧图像为所述视频序列中首次出现所述感兴趣目标的帧；根据特征相似度融合模块融合所述当前帧特征图和所述裁剪图像特征图得到所述当前帧图像的特征对。In some embodiments, the current frame encoding network module is further configured to: extract features of the current frame image by using a convolution layer and a downsampling layer to obtain a current frame feature map; use the convolution layer and the The downsampling layer extracts the features of the cropped image, and obtains the cropped image feature map, wherein the cropped image is obtained by cropping the target of interest from the first frame image, and the first frame image is the first image in the video sequence. The frame of the target of interest appears; and the feature pair of the current frame image is obtained by fusing the current frame feature map and the cropped image feature map according to the feature similarity fusion module.

第三方面，本申请的一些实施例提供一种分割视频物体的网络模型训练方法，所述分割视频物体的网络模型训练方法包括：采用数据集合对基础网络进行显著性物体分割训练，其中，所述基础网络包括当前帧编码器和解码器；根据至少一张历史帧图像、当前帧图像以及裁剪图像训练视频物体分割网络，其中，所述视频物体分别网络包括训练完成后得到的基础网络和增强型短时记忆网络，所述当前帧编码器还被配置为提取当前帧图像和所述裁剪图像的特征并融合，所述至少一张历史帧图像和所述当前帧图像来自于同一视频序列。In a third aspect, some embodiments of the present application provide a network model training method for segmenting video objects. The network model training method for segmenting video objects includes: using a data set to perform salient object segmentation training on a basic network, wherein the The basic network includes a current frame encoder and a decoder; the video object segmentation network is trained according to at least one historical frame image, the current frame image and the cropped image, wherein the video object separate network includes the basic network and the enhanced network obtained after the training is completed. The current frame encoder is further configured to extract and fuse features of the current frame image and the cropped image, and the at least one historical frame image and the current frame image are from the same video sequence.

第四方面，本申请的一些实施例提供一种系统，所述系统包括一个或多个计算机和存储指令的一个或多个存储设备，当所述指令由所述一个或多个计算机执行时，使得所述一个或多个计算机执行根据上述第一方面所述的相应方法的操作。In a fourth aspect, some embodiments of the present application provide a system comprising one or more computers and one or more storage devices storing instructions, when the instructions are executed by the one or more computers, The one or more computers are caused to perform operations according to the respective methods described in the first aspect above.

附图说明Description of drawings

为了更清楚地说明本申请实施例的技术方案，下面将对本申请实施例中所需要使用的附图作简单地介绍，应当理解，以下附图仅示出了本申请的某些实施例，因此不应被看作是对范围的限定，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他相关的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings that need to be used in the embodiments of the present application. It should be understood that the following drawings only show some embodiments of the present application, therefore It should not be regarded as a limitation of the scope. For those of ordinary skill in the art, other related drawings can also be obtained from these drawings without any creative effort.

图1为本申请实施例提供的视频物体分割网络的架构图之一；1 is one of the architectural diagrams of a video object segmentation network provided by an embodiment of the present application;

图2为本申请实施例提供的视频物体分割网络的架构图之二；FIG. 2 is the second architectural diagram of a video object segmentation network provided by an embodiment of the present application;

图3为本申请实施例提供的分割视频物体的方法的流程图之一；3 is one of the flowcharts of a method for segmenting a video object provided by an embodiment of the present application;

图4为本申请实施例提供的分割视频物体的方法的流程图之二；FIG. 4 is the second flowchart of the method for segmenting a video object provided by an embodiment of the present application;

图5为本申请实施例提供的视频物体分割网络的架构图之三；FIG. 5 is the third architectural diagram of a video object segmentation network provided by an embodiment of the present application;

图6为本申请实施例提供的分割视频物体的装置的组成框图。FIG. 6 is a block diagram of a composition of an apparatus for segmenting a video object provided by an embodiment of the present application.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行描述。The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.

应注意到：相似的标号和字母在下面的附图中表示类似项，因此，一旦某一项在一个附图中被定义，则在随后的附图中不需要对其进行进一步定义和解释。同时，在本申请的描述中，术语“第一”、“第二”等仅用于区分描述，而不能理解为指示或暗示相对重要性。It should be noted that like numerals and letters refer to like items in the following figures, so once an item is defined in one figure, it does not require further definition and explanation in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", etc. are only used to distinguish the description, and cannot be understood as indicating or implying relative importance.

相关技术的视频物体分割方法主要分为两类。其中一类尝试将第一帧的物体分割结果迁移到后续帧上去。例如，MaskTrack通过一个光流的方法，同时输入当前帧图像，当前帧物体掩膜mask，以及下一帧的图像，来学习出下一帧的物体掩膜mask相对于上一帧的光流变化。但是，这类方法在物体变化很大时精度会显著下降，而且错误的分割结果会严重影响到后续帧的分割精度。另一类方法则把视频分割方法拆分成图像分割和物体跟踪两个部分，先针对每帧图像进行实例分割，区分出每个物体，然后通过重识别reid(re-identification)等方法进行物体之间的分类，实现连续帧的物体对应。这类方法主要是受到实例分割方法的精度，实例分割的结果在边缘或小物体上要显著差于语义分割。The related art video object segmentation methods are mainly divided into two categories. One class attempts to transfer the object segmentation results from the first frame to subsequent frames. For example, MaskTrack uses an optical flow method to simultaneously input the current frame image, the object mask mask of the current frame, and the image of the next frame to learn the change of the object mask mask of the next frame relative to the optical flow of the previous frame. . However, the accuracy of such methods will drop significantly when the object changes greatly, and the wrong segmentation results will seriously affect the segmentation accuracy of subsequent frames. Another type of method divides the video segmentation method into two parts: image segmentation and object tracking. First, instance segmentation is performed for each frame of image to distinguish each object, and then the object is identified by re-identification (re-identification) and other methods. The classification between them realizes the object correspondence of consecutive frames. Such methods are mainly affected by the accuracy of instance segmentation methods, and the results of instance segmentation are significantly worse than semantic segmentation on edges or small objects.

与上述相关技术不同，本申请的实施例提供一种更有效的基于深度学习的方法来进行视频物体分割。例如，在本申请的一些实施例中基于一个增强型短时记忆网络、和对当前帧图像特征及裁剪图像特征的相似度进行融合的网络，将物体分割和跟踪融合到一个神经网络中，通过历史帧的分割结果来监督当前帧分割结果，实现了视频物体分割精度的显著提升。Different from the above-mentioned related art, the embodiments of the present application provide a more effective deep learning-based method for video object segmentation. For example, in some embodiments of the present application, based on an enhanced short-term memory network and a network that fuses the similarity between the current frame image feature and the cropped image feature, object segmentation and tracking are fused into a neural network, through The segmentation results of historical frames are used to supervise the segmentation results of current frames, which achieves a significant improvement in the accuracy of video object segmentation.

请参看图1，图1为本申请一些实施例提供的基于深度学习方法的视频物体分割网络模型的架构图。如图1所示，本申请的一些实施例的视频物体分割网络模型的架构包括增强型短时记忆网络100、当前帧编码器120以及解码器130，其中，解码器130直接或者通过其余功能模块与增强型短时记忆网络100和当前编码器120连接。Please refer to FIG. 1. FIG. 1 is an architecture diagram of a video object segmentation network model based on a deep learning method provided by some embodiments of the present application. As shown in FIG. 1 , the architecture of the video object segmentation network model according to some embodiments of the present application includes an enhanced short-term memory network 100, a current frame encoder 120 and a decoder 130, wherein the decoder 130 directly or through other functional modules Connect with the enhanced short-term memory network 100 and the current encoder 120 .

增强型短时记忆网络100包括一个或多个并联的历史帧编码器，其中，每个历史帧编码器110用于接收输入的一张历史帧图像。如图1所示，增强型短时记忆网络100包括N个历史帧编码器110，其中，每个历史帧编码器110用于接收输入的一张历史帧图像(例如，通过图1的N个历史帧编码器110分别接收输入的第一历史帧图像、第二历史帧图像，……，直至第N历史帧图像)并对输入的该帧历史帧图像进行特征提取得到与该历史帧图像对应的特征对。各历史帧编码器110包括多个卷积层和多个下采样层，并通过卷积层和下采样层对输入的历史帧图像进行特征提取分别得到第一特征对、第二特征对，……，直至第N特征对，N为大于1的自然数(例如，在一些示例中N的取值为3)。需要说明的是，图1的历史帧编码器110属于语义分割网络包括的编码器或者编码网络。例如，各历史编码器110为unet网络(或称为“U型卷积神经网络”或“U形卷积神经网络”，包含由卷积层构成的编码器encoder和解码器decoder)包括的编码器部分。通过历史帧编码器110对各历史帧图像进行不同尺度的特征提取，可以得到与各历史帧图像对应的第一特征矩阵和第二特征矩阵，其中，第一特征矩阵可以命名为历史帧键矩阵，第二特征矩阵可以命名为历史帧值矩阵。The enhanced short-term memory network 100 includes one or more parallel historical frame encoders, wherein each historical frame encoder 110 is configured to receive an input historical frame image. As shown in FIG. 1, the enhanced short-term memory network 100 includes N historical frame encoders 110, wherein each historical frame encoder 110 is configured to receive an input historical frame image (eg, through the N historical frame encoders in FIG. 1 ). The historical frame encoder 110 respectively receives the inputted first historical frame image, the second historical frame image, ... up to the Nth historical frame image) and performs feature extraction on the inputted historical frame image to obtain the corresponding historical frame image. feature pair. Each historical frame encoder 110 includes a plurality of convolutional layers and a plurality of downsampling layers, and performs feature extraction on the input historical frame images through the convolutional layers and the downsampling layers to obtain a first feature pair, a second feature pair, . . . ... up to the Nth feature pair, where N is a natural number greater than 1 (eg, N takes the value 3 in some examples). It should be noted that the historical frame encoder 110 in FIG. 1 belongs to an encoder or an encoding network included in a semantic segmentation network. For example, each historical encoder 110 is an encoder included in the unet network (or referred to as "U-shaped convolutional neural network" or "U-shaped convolutional neural network", including an encoder encoder and a decoder decoder composed of convolutional layers). device part. The historical frame encoder 110 performs feature extraction on each historical frame image at different scales, and a first feature matrix and a second feature matrix corresponding to each historical frame image can be obtained, wherein the first feature matrix can be named as a historical frame key matrix , the second feature matrix can be named as the history frame value matrix.

图1的当前帧编码器120用于接收输入的当前帧图像，并提取当前帧图像的特征得到当前帧的特征对。例如，当前帧编码器120属于语义分割网络包括的编码器，具体地，当前帧编码器可以是unet网络包括的编码器部分。作为一个实施例，当前帧编码器120包括：多个卷积层和多个下采样层，通过这些层对当前帧图像进行不同尺度的特征提取，可以得到与当前帧图像对应的第一特征矩阵和第二特征矩阵，其中，第一特征矩阵可以命名为当前帧键矩阵，第二特征矩阵可以命名为当前帧值矩阵。The current frame encoder 120 in FIG. 1 is configured to receive an input current frame image, and extract features of the current frame image to obtain feature pairs of the current frame. For example, the current frame encoder 120 belongs to the encoder included in the semantic segmentation network. Specifically, the current frame encoder may be an encoder part included in the unet network. As an embodiment, the current frame encoder 120 includes: a plurality of convolutional layers and a plurality of downsampling layers, and extracting features of different scales on the current frame image through these layers can obtain a first feature matrix corresponding to the current frame image and a second feature matrix, where the first feature matrix may be named the current frame key matrix, and the second feature matrix may be named the current frame value matrix.

需要说明的值，在本申请的一些实施例中的增强型短时记忆网络100还包括特征对连接模块115，用于连接各历史帧编码器提取的相应历史帧的特征对。本申请的另一些实施例中特征对连接模块115以及当前帧编码器120还与历史特征融合模块140连接，以实现对当前帧图像的特征对和各历史帧的特征对的融合，得到融合特征对(即下文的融合键矩阵和所述当前帧图像的特征对包括的当前值矩阵)后再输入解码器130进行解码，最终获取当前帧图像中感兴趣目标的分割掩膜图像。对于历史特征融合模块140的融合算法和过程具体参考下文的相关描述，为避免重复在此不做过多赘述。It should be noted that, in some embodiments of the present application, the enhanced short-term memory network 100 further includes a feature pair connection module 115 for connecting the feature pairs of corresponding historical frames extracted by each historical frame encoder. In other embodiments of the present application, the feature pair connection module 115 and the current frame encoder 120 are also connected to the historical feature fusion module 140 to realize the fusion of the feature pairs of the current frame image and the feature pairs of each historical frame to obtain fusion features The pair (ie the fusion key matrix below and the current value matrix included in the feature pair of the current frame image) are then input to the decoder 130 for decoding, and finally the segmentation mask image of the object of interest in the current frame image is obtained. For the fusion algorithm and process of the historical feature fusion module 140, please refer to the relevant description below, and to avoid repetition, no detailed description is given here.

为了进一步突出当前分割的感兴趣目标，以提升视频物体分割精度，本申请一些实施例的视频物体分割的网络模型的架构如图2所示。In order to further highlight the target of interest for the current segmentation and improve the accuracy of video object segmentation, the architecture of the network model for video object segmentation according to some embodiments of the present application is shown in FIG. 2 .

与图1不同的是，图2的当前帧编码器120包括两个并列的特征提取单元和一个特征相似度融合模块123。在本申请的一些实施例中，两个并列的特征提取单元中的每个特征提取单元包括多层卷积层和多层下采样层(即图2的多层卷积和下采样网络121)，将当前帧图像输入多层卷积和下采样网络121得到当前帧特征图，同时可以将包含感兴趣目标的裁剪图像输入多层卷积和下采样网络121得到裁剪图像特征图，之后再通过特征相似度融合模块123对当前帧特征图和裁剪图像特征图融合，得到融合特征，并基于融合特征得到当前帧图像的特征对(例如，由短时记忆网络提取融合特征得到当前帧图像的特征键值对)。Different from FIG. 1 , the current frame encoder 120 in FIG. 2 includes two parallel feature extraction units and a feature similarity fusion module 123 . In some embodiments of the present application, each of the two parallel feature extraction units includes a multi-layer convolution layer and a multi-layer downsampling layer (ie, the multi-layer convolution and downsampling network 121 of FIG. 2 ) , input the current frame image into the multi-layer convolution and down-sampling network 121 to obtain the current frame feature map, and at the same time input the cropped image containing the target of interest into the multi-layer convolution and down-sampling network 121 to obtain the cropped image feature map, and then pass The feature similarity fusion module 123 fuses the feature map of the current frame and the feature map of the cropped image to obtain a fusion feature, and obtains a feature pair of the current frame image based on the fusion feature (for example, extracting the fusion feature from a short-term memory network to obtain the feature of the current frame image. key-value pair).

可以理解的是，为了通过图1的视频物体分割的网络模型进行视频物体分割本申请的一些实施例需要先对图1的网络模型进行训练。在本申请的一些实施例中，本申请一些实施例通过的训练图1的视频分割网络模型的方法包括：采用数据集合对图1的基础网络20进行显著性物体分割训练，其中，所述基础网络20包括当前帧编码器120和解码器130；根据至少一张历史帧图像和当前帧图像训练视频物体分割网络10(即图1的整个网络模型)，其中，所述视频物体分别网络包括训练完成后得到的基础网络和增强型短时记忆网络100。It can be understood that, in order to perform video object segmentation by using the network model for video object segmentation in FIG. 1 , the network model in FIG. 1 needs to be trained first. In some embodiments of the present application, the method for training the video segmentation network model in FIG. 1 adopted by some embodiments of the present application includes: using a data set to perform salient object segmentation training on the basic network 20 in FIG. 1 , wherein the basic The network 20 includes a current frame encoder 120 and a decoder 130; the video object segmentation network 10 (ie, the entire network model in FIG. 1 ) is trained according to at least one historical frame image and the current frame image, wherein the video object segmentation network includes training The resulting base network and enhanced short-term memory network 100 upon completion.

为了通过图2的网络模型进行视频物体分割，本申请的实施例需要先对图2的视频物体分割网络模型进行训练。在本申请的一些实施例中，本申请一些实施例的训练图2的网络模型的方法包括：采用数据集合对图2的基础网络20进行显著性物体分割训练，其中，所述基础网络20包括多层卷积和下采样网络(至少实现当前帧编码器的部分功能)和解码器130；根据至少一个历史帧图像、当前帧图像以及裁剪图像训练视频物体分割网络10(即图2的整个网络模型)，其中，所述视频物体分别网络10包括训练完成后得到的基础网络、增强型短时记忆网络和用于获取当前帧图像和裁剪图像特征相似度的模型(具体包括图2的用于接收输入的裁剪图像的多层卷积和下采样网络121和图2的特征相似度融合模块123)，所述当前帧编码器120还被配置为提取当前帧图像和所述裁剪图像的特征并融合。In order to perform video object segmentation through the network model in FIG. 2 , the embodiment of the present application needs to first train the video object segmentation network model in FIG. 2 . In some embodiments of the present application, the method for training the network model in FIG. 2 according to some embodiments of the present application includes: using a data set to perform salient object segmentation training on the basic network 20 in FIG. 2 , wherein the basic network 20 includes Multi-layer convolution and downsampling network (implementing at least part of the function of the current frame encoder) and decoder 130; training the video object segmentation network 10 (ie the entire network in FIG. 2 ) according to at least one historical frame image, current frame image and cropped image model), wherein, the video object network 10 includes the basic network obtained after the training is completed, the enhanced short-term memory network and the model for obtaining the similarity of the current frame image and the cropped image feature (specifically including FIG. 2 for The multi-layer convolution and downsampling network 121 and the feature similarity fusion module 123 of FIG. 2 ) receiving the input cropped image), the current frame encoder 120 is further configured to extract the current frame image and the features of the cropped image and fusion.

也就是说，基于图1或者图2的网络架构，本申请的实施例在公开数据集上进行了训练和测试。由于待分割的物体种类很多，而且并不是已知的，本申请的一些实施例首先利用基础网络模型训练了一个显著性物体分割。显著性物体分割其目的是分割出图像中的显著物体，不限物体种类。例如，本申请实施例采用了COCO，VOC，以及其他salient的数据集，并随机从标注好的图像中选取一个物体当做显著物体，而剔除掉其他物体。基于该生成的数据集训练出单物体分割的基础网络，再以该基础网络作为初始化来训练图1或图2的视频物体分割网络10。在本申请的一些实施例中，视频物体分割网络10在youtube vos的数据集上测试得到的分割准确度评测结果优于其他参与评测的方法。That is to say, based on the network architecture of FIG. 1 or FIG. 2 , the embodiments of the present application are trained and tested on public datasets. Since there are many types of objects to be segmented and are not known, some embodiments of the present application first train a saliency object segmentation using the basic network model. The purpose of salient object segmentation is to segment the salient objects in the image, regardless of the type of objects. For example, the embodiment of the present application adopts COCO, VOC, and other salient data sets, and randomly selects an object from the labeled images as a salient object, and removes other objects. Based on the generated data set, a basic network for single object segmentation is trained, and then the basic network is used as initialization to train the video object segmentation network 10 in FIG. 1 or FIG. 2 . In some embodiments of the present application, the segmentation accuracy evaluation results obtained by the video object segmentation network 10 tested on the youtube vos data set are better than other evaluation methods.

需要说明的是，训练本申请实施例的视频物体分割网络模型时涉及的至少一张历史帧图像和当前帧图像均来自于同一视频序列，但是至少一张历史帧图像中各历史帧图像是随机从视频序列中选取的，这些历史帧图像与当前帧图像不一定是相邻的图像。It should be noted that at least one historical frame image and the current frame image involved in training the video object segmentation network model of the embodiment of the present application are from the same video sequence, but each historical frame image in the at least one historical frame image is random. Selected from the video sequence, these historical frame images and the current frame image are not necessarily adjacent images.

下面结合图3示例性阐述由训练好的视频物体分割网络10执行的视频物体分割方法的过程。The process of the video object segmentation method performed by the trained video object segmentation network 10 is exemplarily explained below with reference to FIG. 3 .

如图3所示，本申请的一些实施例提供一种分割视频物体的方法，所述方法包括：S101，提取当前帧图像之前的至少一张历史帧图像的特征，得到所述至少一张历史帧图像中各历史帧图像的特征对；S102，提取当前帧图像的特征，得到当前帧图像的特征对；S103，根据所述各历史帧图像的特征对、所述当前帧图像的特征对和解码器，获取所述当前帧图像中感兴趣目标的分割掩膜；其中，所述至少一张历史帧图像中的各历史帧图像是所述当前帧图像的前一帧或多帧图像，所述特征对包括键(key)矩阵和值(value)矩阵。本申请的一些实施例通过一个增强型短时记忆网络实现帧间跟踪，显著提升了对当前帧图像的视频物体分割精度。在本申请的一些实施例中，S101根据增强型短时记忆网络提取至少一张历史帧图像的特征，或者S102采用当前帧编码器提取当前帧图像的特征。As shown in FIG. 3 , some embodiments of the present application provide a method for segmenting a video object. The method includes: S101 , extracting features of at least one historical frame image before the current frame image, and obtaining the at least one historical frame image The feature pair of each historical frame image in the frame image; S102, extract the feature of the current frame image to obtain the feature pair of the current frame image; S103, according to the feature pair of each historical frame image, the feature pair of the current frame image and The decoder obtains the segmentation mask of the target of interest in the current frame image; wherein, each historical frame image in the at least one historical frame image is the previous frame or multiple frame images of the current frame image, so The feature pair includes a key matrix and a value matrix. Some embodiments of the present application implement inter-frame tracking through an enhanced short-term memory network, which significantly improves the video object segmentation accuracy for the current frame image. In some embodiments of the present application, S101 extracts the feature of at least one historical frame image according to the enhanced short-term memory network, or S102 uses the current frame encoder to extract the feature of the current frame image.

需要说明的是，输入增强型短时记忆网络的至少一张历史帧图像和输入当前帧编码器的当前帧图像均来自于同一视频序列中，且至少一张历史帧图像包括的各张历史帧图像是在视频序列中位于当前帧图像的前一帧或多帧图像。也就是说，在本申请的一些实施例中，利用训练好的视频物体分割网络对当前帧图像进行视频物体分割时，输入增强型短时记忆网络的至少一个历史帧图像与输入当前帧编码器的当前帧图像为若干个相邻帧图像。在本申请的另一些实施例中，利用训练好的视频物体分割网络对当前帧图像进行视频物体分割时，输入增强型短时记忆网络的至少一个历史帧图像与输入当前帧编码器的当前帧图像为不相邻的图像(例如，在当前帧图像之前每间隔n帧提取一个历史帧图像)。当增强型短时记忆网络的需要输入的历史帧的数目大于当前帧图像之前的全部帧数时，则至少一个历史帧中可以包括两个相同历史帧。例如，增强型短时记忆网络的输入节点数为3，则对当前帧图像进行视频物体分割时可以取当前帧图像的前3帧图像，如果不足3帧，则重复选取前1帧或2帧。It should be noted that at least one historical frame image input to the enhanced short-term memory network and the current frame image input to the current frame encoder are from the same video sequence, and each historical frame image included in the at least one historical frame image An image is the frame or frames preceding the current frame image in the video sequence. That is to say, in some embodiments of the present application, when using the trained video object segmentation network to perform video object segmentation on the current frame image, the input of at least one historical frame image of the enhanced short-term memory network and the input current frame encoder The current frame image of is several adjacent frame images. In other embodiments of the present application, when using the trained video object segmentation network to perform video object segmentation on the current frame image, at least one historical frame image input to the enhanced short-term memory network and the current frame input to the current frame encoder are used. The images are non-adjacent images (for example, a historical frame image is extracted every n frames before the current frame image). When the number of historical frames that need to be input to the enhanced short-term memory network is greater than the total number of frames before the current frame image, at least one historical frame may include two identical historical frames. For example, if the number of input nodes of the enhanced short-term memory network is 3, the first 3 frames of the current frame image can be taken when segmenting the video object of the current frame image. If there are less than 3 frames, the first frame or 2 frames can be repeatedly selected .

下面示例性阐述图3的各步骤。The steps of FIG. 3 are exemplified below.

在本申请的一些实施例中，S101包括：将任意一张(例如，第二历史帧图像)历史帧图像和预先获取的该历史帧图像的分割掩膜串联后输入增强型短时记忆网络包括的一个历史帧编码器，以获取所述该张历史帧图像的特征对。例如，当第二历史帧图像为RGB三通道图像时，输入增强型短时记忆网络中各历史帧编码器的信息为包括R通道、G通道、B通道以及相应历史帧图像的分割掩膜的四通道图像，且该历史帧编码器的输出为两个特征图，这两个特征图对应的矩阵可以命名为键矩阵和值矩阵。In some embodiments of the present application, S101 includes: concatenating any historical frame image (eg, the second historical frame image) and the pre-acquired segmentation mask of the historical frame image into the enhanced short-term memory network, including: A historical frame encoder of , to obtain the feature pair of the historical frame image. For example, when the second historical frame image is an RGB three-channel image, the information input to each historical frame encoder in the enhanced short-term memory network is the information including R channel, G channel, B channel and the segmentation mask of the corresponding historical frame image. Four-channel image, and the output of the historical frame encoder is two feature maps, and the matrices corresponding to these two feature maps can be named as key matrix and value matrix.

例如，S101的增强型短时记忆网络包括语义分割网络的至少一个编码器(即图1或图2的历史帧编码器)，所述至少一个编码器中各编码器并联连接，且每一个编码器接收一张输入的历史帧图像和对所述历史帧图像的分割结果。之后各编码器提取输入的历史帧图像的特征，分别得到各历史帧图像的特征对结果并输出。For example, the enhanced short-term memory network of S101 includes at least one encoder of the semantic segmentation network (ie, the historical frame encoder of FIG. 1 or FIG. 2 ), the encoders in the at least one encoder are connected in parallel, and each encoder The device receives an input historical frame image and the segmentation result of the historical frame image. After that, each encoder extracts the features of the input historical frame images, respectively obtains the feature pair results of each historical frame image and outputs them.

也就是说，如上文所述，S101采用的增强型短时记忆网络结构可以将多个unet等语义分割网络包括的编码器并联得到，且每个编码器接收一张输入的历史帧图像，得到与这帧图像对应的特征图，在本申请的一些实施例中，特征图包括两个特征矩阵，分别为键矩阵和值矩阵。例如，编码器包括多个卷积层和下采样层。That is to say, as mentioned above, the enhanced short-term memory network structure adopted in S101 can be obtained by paralleling encoders included in semantic segmentation networks such as unet, and each encoder receives an input historical frame image to obtain The feature map corresponding to this frame of image, in some embodiments of the present application, the feature map includes two feature matrices, which are a key matrix and a value matrix, respectively. For example, an encoder includes multiple convolutional and downsampling layers.

在本申请的一些实施例中，如图4所示，S102中的当前帧编码器包括卷积层、下采样层和特征相似度融合模块；其中，S102包括：S1021，采用所述卷积层和所述下采样层提取所述当前帧图像的特征，得到当前帧特征图；S1022，采用所述卷积层和所述下采样层提取裁剪图像的特征，得到裁剪图像特征图，其中，所述裁剪图像是从第一帧图像中裁剪所述感兴趣目标得到的，所述第一帧图像为视频序列中首次出现所述感兴趣目标的帧；S1023，根据特征相似度融合模块对所述当前帧特征图和所述裁剪图像特征图进行融合，得到融合图像，并基于所述融合图像得到所述当前帧图像的特征对。In some embodiments of the present application, as shown in FIG. 4 , the current frame encoder in S102 includes a convolution layer, a downsampling layer, and a feature similarity fusion module; wherein, S102 includes: S1021 , using the convolution layer Extracting features of the current frame image with the downsampling layer to obtain a feature map of the current frame; S1022, using the convolutional layer and the downsampling layer to extract features of the cropped image to obtain a cropped image feature map, wherein the The cropped image is obtained by cropping the target of interest from the first frame of image, and the first frame of image is the frame in which the target of interest appears for the first time in the video sequence; S1023, according to the feature similarity fusion module, the The current frame feature map and the cropped image feature map are fused to obtain a fused image, and a feature pair of the current frame image is obtained based on the fused image.

需要说明的是，获取当前帧特征图和获取裁剪图像特征图的过程可以同时进行，也可以先获取裁剪图像特征图再获取当前帧特征图，本申请实施例不限定S1021获取当前帧特征图和S1022获取裁剪图像特征图这两个步骤的先后次序。因此，在一些实施例中，S102包括：采用所述卷积层和所述下采样层提取裁剪图像的特征，得到裁剪图像特征图，其中，所述裁剪图像是从第一帧图像中裁剪所述感兴趣目标得到的，所述第一帧图像为视频序列中首次出现所述感兴趣目标的帧；采用所述卷积层和所述下采样层提取所述当前帧图像的特征，得到当前帧特征图；根据特征相似度融合模块对所述当前帧特征图和所述裁剪图像特征图进行融合，得到融合图像，并基于所述融合图像得到所述当前帧图像的特征对。It should be noted that the process of obtaining the feature map of the current frame and the feature map of the cropped image can be performed simultaneously, or the feature map of the cropped image can be obtained first and then the feature map of the current frame. The embodiment of the present application does not limit S1021 to obtain the feature map of the current frame and S1022 acquires the sequence of the two steps of cropping the image feature map. Therefore, in some embodiments, S102 includes: using the convolution layer and the down-sampling layer to extract features of the cropped image to obtain a cropped image feature map, wherein the cropped image is cropped from the first frame of image obtained from the target of interest, the first frame image is the frame in which the target of interest appears for the first time in the video sequence; the convolution layer and the downsampling layer are used to extract the features of the current frame image, and the current frame image is obtained. frame feature map; according to the feature similarity fusion module, the current frame feature map and the cropped image feature map are fused to obtain a fused image, and a feature pair of the current frame image is obtained based on the fused image.

为了进一步突显被分割的感兴趣目标，S102中的特征相似度融合模块通过短时记忆网络将所述当前帧特征图和所述裁剪图像特征图进行融合，得到所述融合图像，并基于融合图像得到当前帧图像的特征对。例如，特征相似度融合模块根据如下公式得到S1023的融合图像：In order to further highlight the segmented target of interest, the feature similarity fusion module in S102 fuses the current frame feature map and the cropped image feature map through a short-term memory network to obtain the fused image, and based on the fused image Get the feature pair of the current frame image. For example, the feature similarity fusion module obtains the fusion image of S1023 according to the following formula:

其中，

表征融合图像对应的矩阵，feat_t表征与当前帧特征图对应的矩阵，feat_p表征与裁剪图像特征图对应的矩阵。in,

represents the matrix corresponding to the fusion image, feat _t represents the matrix corresponding to the feature map of the current frame, and feat _p represents the matrix corresponding to the feature map of the cropped image.

在本申请的一些实施例中，S103包括：对所述当前帧图像的特征对包括的当前键矩阵和所述各历史帧图像的特征对包括的历史键矩阵分别进行融合操作，得到融合键矩阵；将所述融合键矩阵和所述当前帧特征对包括的当前值矩阵(即输入解码器的特征对)输入所述解码器得到所述分割掩膜图像。例如，所述对所述当前帧特征对包括的当前键矩阵和所述各历史帧图像的特征对包括的历史键矩阵分别进行融合操作，得到融合键矩阵，包括：获取所述当前键矩阵和所有所述历史键矩阵中各历史键矩阵的第一相关度；根据所述第一相关度得到所述融合键矩阵。也就是说，将融合键矩阵和当前帧特征对包括的当前值矩阵作为输入解码器的特征对。In some embodiments of the present application, S103 includes: respectively performing a fusion operation on the current key matrix included in the feature pair of the current frame image and the historical key matrix included in the feature pair of each historical frame image to obtain a fusion key matrix ; Inputting the fusion key matrix and the current value matrix (ie the feature pair input to the decoder) included in the feature pair of the current frame into the decoder to obtain the segmentation mask image. For example, performing a fusion operation on the current key matrix included in the feature pair of the current frame and the historical key matrix included in the feature pair of each historical frame image respectively, to obtain a fusion key matrix, includes: obtaining the current key matrix and The first correlation degree of each historical key matrix in all the historical key matrices; the fusion key matrix is obtained according to the first correlation degree. That is, the fusion key matrix and the current value matrix included in the feature pair of the current frame are taken as the feature pair of the input decoder.

具体地，输入解码器的特征对f_t为：Specifically, the feature pair ft input to the _decoder is:

其中，R为当前键矩阵与各历史键矩阵的第一相关度，Z为所有第一相关度之和，i表征输入所述增强型短时记忆网络的任意一张历史帧图像的编号，t表征当前帧图像的编号和所有所述历史帧图像和所述当前帧图像的总张数，

用于表征当前帧图像的特征对包括的当前键矩阵，

用于表征第i历史帧图像的特征对包括的第i历史键矩阵，

用于表征第i历史帧图像的特征对包括的第i历史值矩阵，

用于表征当前帧图像的特征对包括的值矩阵。需要说明的是，第i历史帧图像中的编号i并不是这帧图像在整个视频序列中的帧序号，而是根据每次输入增强型短时记忆网络的总张数进行的编号，

用于表征所述融合键矩阵。Among them, R is the first correlation degree between the current key matrix and each historical key matrix, Z is the sum of all the first correlation degrees, i represents the number of any historical frame image input into the enhanced short-term memory network, t Characterizing the number of the current frame image and the total number of all the historical frame images and the current frame image,

The current key matrix including the feature pair used to characterize the current frame image,

The i-th history value matrix including the feature pair used to characterize the i-th historical frame image,

A matrix of values comprised of feature pairs used to characterize the current frame image. It should be noted that the number i in the i-th historical frame image is not the frame number of this frame image in the entire video sequence, but is numbered according to the total number of sheets input to the enhanced short-term memory network each time,

used to characterize the fusion bond matrix.

例如，所述第一相关度是通过如下公式计算得到的：For example, the first correlation degree is calculated by the following formula:

其中，

用于表征当前键矩阵和第i历史键矩阵的点乘操作。本申请的一些实施例采用e指数函数计算当前帧图像的特征对包括的键矩阵和各历史帧图像的特征对包括的键矩阵的融合结果。需要说明的是，还可以采用其它的函数类型计算第一相关度，例如，使用矩阵点乘后的平方根或绝对值等计算相似度。in,

A dot product operation used to characterize the current key matrix and the i-th historical key matrix. Some embodiments of the present application use the e-exponential function to calculate the fusion result of the key matrix included in the feature pair of the current frame image and the key matrix included in the feature pair of each historical frame image. It should be noted that other function types may also be used to calculate the first correlation, for example, the similarity is calculated by using the square root or absolute value of matrix dot product.

本申请的一些实施例提供了一种对挖掘的历史帧图像的特征与挖掘的当前帧图像的特征进行融合方法，通过对这种融合方法得到的融合特征对(即融合键矩阵和当前帧特征对包括的当前值矩阵)进行解码可以提升采用解码器解码得到的分割掩膜的精度。Some embodiments of the present application provide a method for fusing the features of the mined historical frame image and the features of the mined current frame image, and the fused feature pair (that is, the fusion key matrix and the current frame feature) obtained by this fusion method is Decoding the included current value matrix) can improve the accuracy of the segmentation mask decoded with the decoder.

下面以3张历史帧图像为例，结合图5示例性阐述本申请一些实施例的视频物体分割方法以及相应的视频分割网络架构。Taking three historical frame images as an example, the video object segmentation method and the corresponding video segmentation network architecture according to some embodiments of the present application are exemplarily described with reference to FIG. 5 .

如图5所示，用于实现本申请的视频物体分割的视频物体分割网络架构包括主干网络，即由图5的当前帧编码器120和解码器130构成的网格。如图5中右部分所示，主干网络的作用和标准的语义分割网络一样。语义分割网络有非常多种。在本申请实施例提供的视频分割网络中，主干网络可以是任意的语义分割网络结构。例如，本申请的一些实施例采用了一个标准的unet结构。输入unet网络是一张图像，通过多层卷积和下采样的特征网络(即图5的当前帧编码器120)，抽取图像不同尺度的特征，再通过一个不断上采样的解码网络(即图5的解码器130)，对不同尺度的特征进行融合，输出一张物体分割掩膜mask(即图5的整个网络模型的最终输出)。As shown in FIG. 5 , the video object segmentation network architecture for implementing the video object segmentation of the present application includes a backbone network, that is, a grid formed by the current frame encoder 120 and the decoder 130 in FIG. 5 . As shown in the right part of Figure 5, the backbone network acts like a standard semantic segmentation network. There are many kinds of semantic segmentation networks. In the video segmentation network provided by the embodiments of the present application, the backbone network may be any semantic segmentation network structure. For example, some embodiments of the present application employ a standard unet structure. The input to the unet network is an image. Through the multi-layer convolution and down-sampling feature network (ie the current frame encoder 120 in Figure 5), the features of different scales of the image are extracted, and then through a continuous up-sampling decoding network (ie Figure 5). 5), which fuses features of different scales and outputs an object segmentation mask (that is, the final output of the entire network model in Figure 5).

如图5所示，本申请实施例的增强型短时记忆网络100是为了实现帧间的跟踪。对于当前帧图像It，其历史帧和历史分割结果分别是[I0,…,It-1]和[M0,…,Mt-1]。对于每帧图像，增强型短时记忆网络100会抽取出一个[k,v]的特征对(即图5的由编码器记忆网络的编码器Enc_m(Memory encoder)对各历史帧图像提取特征得到的键矩阵(即图5的Key矩阵)和值矩阵(即图5的Value矩阵)。例如，视频序列中包括的各帧图像为RGB图像，则增强型短时记忆网络100包括的各历史帧编码器(Encm)的输入是I与M串联之后的4通道图像(即包括R、G、B以及该帧的分割结果掩膜图像)，输出为两个特征图feature map或称为两个特征矩阵，即键矩阵Key和值矩阵Value。图5的视频物体分割网络还包括特征对连接模块Concat。As shown in FIG. 5 , the enhanced short-term memory network 100 in this embodiment of the present application is to implement tracking between frames. For the current frame image It, its historical frame and historical segmentation results are [I0,...,It-1] and [M0,...,Mt-1], respectively. For each frame of image, the enhanced short-term memory network 100 will extract a feature pair of [k, v] (that is, the encoder Enc_m (Memory encoder) of the encoder memory network in FIG. 5 extracts features from each historical frame image to obtain The key matrix (that is, the Key matrix of Fig. 5) and the value matrix (that is, the Value matrix of Fig. 5). For example, each frame image included in the video sequence is an RGB image, then each historical frame included in the enhanced short-term memory network 100 The input of the encoder (Encm) is the 4-channel image after I and M are concatenated (that is, including R, G, B and the segmentation result mask image of the frame), and the output is two feature maps or two features. matrix, namely the key matrix Key and the value matrix Value. The video object segmentation network in Figure 5 also includes a feature pair connection module Concat.

需要说明的是，为保证实际使用时，视频物体分割网络的输入输出保持一致，训练过程中，随机从历史帧中挑选出3帧图像输入增强型短时记忆网络100，对视频物体分割网络进行训练。而在实施的时候，可以取当前帧图像的前3帧图像，如果不足3帧，则重复选取前一帧历史帧图像或前两帧历史帧图像。针对当前帧图像，主干网络在抽取不同尺度特征的同时，也生成出一个[k,v]的特征对，即图5的key矩阵和value矩阵。It should be noted that, in order to ensure that the input and output of the video object segmentation network are consistent in actual use, during the training process, 3 frames of images are randomly selected from the historical frames and input to the enhanced short-term memory network 100, and the video object segmentation network is carried out. train. During implementation, the first three frames of the current frame image may be selected, and if there are less than three frames, the previous historical frame image or the previous two historical frame images will be repeatedly selected. For the current frame image, while extracting features of different scales, the backbone network also generates a [k, v] feature pair, that is, the key matrix and value matrix in Figure 5.

本申请的一些实施例在进行视频物体分割时，需要将提取的三张历史帧图像的特征对[k,v]和提取的当前帧图像的特征对[k,v]进行融合，获得一个新的融合特征图featmap，将融合特征图featmap作为后续主干网络中解码器130(或称为解码网络)的输入。例如，对历史帧图像的特征对和当前帧的特征对具体融合方法采用公式表征为，其中，t为当前帧图像的编号，i为输入增强型短时记忆网络的历史帧图像的编号，t＝3，i的取值为0、1和2：When performing video object segmentation in some embodiments of the present application, it is necessary to fuse the feature pair [k, v] of the extracted three historical frame images and the extracted feature pair [k, v] of the current frame image to obtain a new feature pair [k, v]. The fusion feature map featmap of , takes the fusion feature map featmap as the input of the decoder 130 (or called decoding network) in the subsequent backbone network. For example, the specific fusion method of the feature pair of the historical frame image and the feature pair of the current frame is represented by a formula, where t is the serial number of the current frame image, i is the serial number of the historical frame image input to the enhanced short-term memory network, t =3, i takes the values 0, 1 and 2:

其中，R为两个featmap的相关度，Z为所有帧的相关度之和。Among them, R is the correlation of the two featmaps, and Z is the sum of the correlations of all frames.

需要说明的是，上述各公式中涉及的相关参数的定义可以参考前文所述，为避免重复在此不做过多赘述。It should be noted that, for the definitions of the relevant parameters involved in the above formulas, reference may be made to the foregoing descriptions, and to avoid repetitions, detailed descriptions are omitted here.

进一步，为了更好的突出当前分割的物体，本申请的一些实施例还设计了将当前帧图像特征和裁剪图像特征进行融合的特征相似度模块(即图5的特征相似度融合模块123)。图5的当前帧编码器120包括：由多层卷积层和下采样层组成的当前帧编码器121、特征融合模块123以及短时记忆网络122，其中，当前帧编码器121用于获取当前帧图像的特征和裁剪图像的特征、特征融合模块123用于融合经由当前帧编码器121处理得到的当前帧特征图和裁剪图像特征图，经由特征融合模块123融合后的特征使用短时记忆网络122得到由矩阵k和矩阵v组成的键值对。例如，将包含感兴趣目标的裁剪图像输入到多层卷积层和下采样层(即当前帧编码器121)中得到一个裁剪图像特征图featmap，将当前帧图像输入到多层卷积层和下采样层(即特征提取网络包括的当前帧编码器121)中得到当前帧特征图，再将该裁剪图像特征图featmap和主干网络抽取的当前帧特征图featmap采用特征融合模块123根据下述的融合公式进行融合，再将融合后获得的新的featmap进入短时记忆网络122得到键值对key和Value。Further, in order to better highlight the currently segmented object, some embodiments of the present application also design a feature similarity module (ie, the feature similarity fusion module 123 in FIG. 5 ) that fuses the current frame image feature and the cropped image feature. The current frame encoder 120 in FIG. 5 includes: a current frame encoder 121 composed of multi-layer convolutional layers and downsampling layers, a feature fusion module 123 and a short-term memory network 122, wherein the current frame encoder 121 is used to obtain the current frame encoder 121. The feature of the frame image and the feature of the cropped image, the feature fusion module 123 is used to fuse the current frame feature map and the cropped image feature map processed by the current frame encoder 121, and the features fused by the feature fusion module 123 use a short-term memory network. 122 to get a key-value pair consisting of matrix k and matrix v. For example, input the cropped image containing the target of interest into the multi-layer convolutional layer and the downsampling layer (ie the current frame encoder 121) to obtain a cropped image feature map featmap, input the current frame image to the multi-layer convolutional layer and The current frame feature map is obtained in the downsampling layer (that is, the current frame encoder 121 included in the feature extraction network), and then the current frame feature map featmap extracted by the cropped image feature map featmap and the backbone network adopts the feature fusion module 123 according to the following: The fusion formula is fused, and the new featmap obtained after fusion is entered into the short-term memory network 122 to obtain the key-value pair key and Value.

在本申请的一些实施例中，特征融合模块123融合当前帧特征图和裁剪图像特征图的方法为，将当前帧特征图和裁剪图像特征图这两个特征图featmap进行相乘，并串联上原始主干特征(即串联上当前帧特征图)，具体融合公式如下，其中，如下融合公式中各参量的含义可以参考上文描述为避免重复在此不做过多赘述。In some embodiments of the present application, the method for the feature fusion module 123 to fuse the current frame feature map and the cropped image feature map is to multiply the two feature maps featmaps of the current frame feature map and the cropped image feature map, and connect them in series. The original backbone feature (that is, the feature map of the current frame is connected in series), the specific fusion formula is as follows, wherein, the meaning of each parameter in the following fusion formula can be referred to the above description to avoid repetition.

需要说明的是，为了获取裁剪图像，本申请的一些实施例还需要将第一帧图像(即同一视频序列中首次出现感兴趣目标的帧，而并非视频序列中的绝对第一帧图像)中的感兴趣目标物体裁剪出来，同时物体mask以外的地方置为0，只保留物体的纹理信息，得到如图5所示的输入多层卷积和下采样网络121的裁剪图像。图5的读取模块160被配置为读取历史帧图像的历史键值对，并读取融合了当前帧图像和裁剪图像的当前帧的键值对，ASSP(即atrous spatial pyramid pooling空洞空间金字塔池化)被配置为将历史键值对和当前帧的键值对融合得到进一步的特征。It should be noted that, in order to obtain a cropped image, some embodiments of the present application also need to put the first frame of image (that is, the frame in which the target of interest appears for the first time in the same video sequence, rather than the absolute first frame of image in the video sequence). The target object of interest is cropped out, and the place other than the object mask is set to 0, only the texture information of the object is retained, and the cropped image of the input multi-layer convolution and downsampling network 121 as shown in Figure 5 is obtained. The reading module 160 of FIG. 5 is configured to read historical key-value pairs of historical frame images, and read key-value pairs of the current frame fused with the current frame image and the cropped image, ASSP (ie atrous spatial pyramid pooling Pooling) is configured to fuse the historical key-value pair with the key-value pair of the current frame to obtain further features.

在本申请的一些实施例中，当前帧编码器120可以采用孪生查询编码器Siamesequery encoder，其中，孪生指两个编码器encoder的参数共享。In some embodiments of the present application, the current frame encoder 120 may employ a Siamese query encoder, where the twin refers to the sharing of parameters of the two encoders.

请参考图6，图6示出了本申请实施例提供的一种分割视频物体的装置，应理解，该装置与上述图3方法实施例对应，能够执行上述方法实施例涉及的各个步骤，该装置的具体功能可以参见上文中的描述，为避免重复，此处适当省略详细描述。装置包括至少一个能以软件或固件的形式存储于存储器中或固化在装置的操作系统中的软件功能模块，该分割视频物体的装置，包括：历史帧特征挖掘模块101，被配置为提取当前帧图像之前的至少一张历史帧图像的特征，得到所述至少一张历史帧图像中各历史帧图像的特征对；当前帧编码网络模块102，被配置为提取所述当前帧图像的特征，得到当前帧图像的特征对；解码网络模块103，被配置为根据所述各历史帧图像的特征对、所述当前帧图像的特征对和解码器，获取所述当前帧图像中感兴趣目标的分割掩膜图像；其中，所述至少一张历史帧图像中的各历史帧图像是所述当前帧图像的前一帧或多帧图像，所述特征对包括键矩阵和值矩阵。Please refer to FIG. 6. FIG. 6 shows an apparatus for segmenting a video object provided by an embodiment of the present application. It should be understood that the apparatus corresponds to the method embodiment of FIG. 3 and can perform various steps involved in the above method embodiment. For the specific functions of the device, reference may be made to the above description, and to avoid repetition, the detailed description is appropriately omitted here. The device includes at least one software function module that can be stored in the memory or solidified in the operating system of the device in the form of software or firmware. The device for segmenting video objects includes: a historical frame feature mining module 101, configured to extract the current frame The feature of at least one historical frame image before the image is obtained, and the feature pair of each historical frame image in the at least one historical frame image is obtained; the current frame encoding network module 102 is configured to extract the feature of the current frame image to obtain The feature pair of the current frame image; the decoding network module 103 is configured to obtain the segmentation of the object of interest in the current frame image according to the feature pair of each historical frame image, the feature pair of the current frame image and the decoder A mask image; wherein each historical frame image in the at least one historical frame image is the previous frame or images of the current frame image, and the feature pair includes a key matrix and a value matrix.

在本申请的一些实施例中，所述当前帧编码网络模块102还被配置为：采用卷积层和下采样层提取所述当前帧图像的特征，得到当前帧特征图；采用所述卷积层和所述下采样层提取裁剪图像的特征，得到裁剪图像特征图，其中，所述裁剪图像是从第一帧图像中裁剪所述感兴趣目标得到的，所述第一帧图像为所述视频序列中首次出现所述感兴趣目标的帧；根据特征相似度融合模块融合所述当前帧特征图和所述裁剪图像特征图得到所述当前帧的特征对。In some embodiments of the present application, the current frame encoding network module 102 is further configured to: extract features of the current frame image by using a convolution layer and a downsampling layer to obtain a feature map of the current frame; layer and the downsampling layer to extract the features of the cropped image to obtain a cropped image feature map, wherein the cropped image is obtained by cropping the target of interest from the first frame of image, and the first frame of image is the The frame of the target of interest appears for the first time in the video sequence; the feature pair of the current frame is obtained by fusing the feature map of the current frame and the feature map of the cropped image according to the feature similarity fusion module.

本申请的一些实施例提供一种分割视频物体的网络模型训练方法，所述视频物体分割网络模型训练方法包括：采用数据集合对基础网络进行显著性物体分割训练，其中，所述基础网络包括当前帧编码器和解码器；根据至少一张历史帧图像、当前帧图像以及裁剪图像训练视频物体分割网络，其中，所述视频物体分别网络包括训练完成后得到的基础网络和增强型短时记忆网络，所述当前帧编码器还被配置为提取当前帧图像和所述裁剪图像的特征并融合，所述至少一张历史帧图像和所述当前帧图像来自于同一视频序列。Some embodiments of the present application provide a network model training method for segmenting video objects, the method for training a network model for video object segmentation includes: using a data set to perform salient object segmentation training on a basic network, wherein the basic network includes a current Frame encoder and decoder; video object segmentation network is trained according to at least one historical frame image, current frame image and cropped image, wherein the video object separate network includes a basic network and an enhanced short-term memory network obtained after training is completed , the current frame encoder is further configured to extract and fuse the features of the current frame image and the cropped image, and the at least one historical frame image and the current frame image are from the same video sequence.

本申请的一些实施例提供一种系统，所述系统包括一个或多个计算机和存储指令的一个或多个存储设备，当所述指令由所述一个或多个计算机执行时，使得所述一个或多个计算机执行根据上述图3所述的相应方法的操作。Some embodiments of the present application provide a system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to The computer or computers perform operations according to the respective methods described above with respect to FIG. 3 .

在本申请所提供的几个实施例中，应该理解到，所揭露的装置和方法，也可以通过其它的方式实现。以上所描述的装置实施例仅仅是示意性的，例如，附图中的流程图和框图显示了根据本申请的多个实施例的装置、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段或代码的一部分，所述模块、程序段或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现方式中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个连续的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合，可以用执行规定的功能或动作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may also be implemented in other manners. The apparatus embodiments described above are merely illustrative, for example, the flowcharts and block diagrams in the accompanying drawings illustrate the architectures, functions and possible implementations of apparatuses, methods and computer program products according to various embodiments of the present application. operate. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more functions for implementing the specified logical function(s) executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in dedicated hardware-based systems that perform the specified functions or actions , or can be implemented in a combination of dedicated hardware and computer instructions.

另外，在本申请各个实施例中的各功能模块可以集成在一起形成一个独立的部分，也可以是各个模块单独存在，也可以两个或两个以上模块集成形成一个独立的部分。In addition, each functional module in each embodiment of the present application may be integrated together to form an independent part, or each module may exist independently, or two or more modules may be integrated to form an independent part.

所述功能如果以软件功能模块的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。If the functions are implemented in the form of software function modules and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes .

以上所述仅为本申请的实施例而已，并不用于限制本申请的保护范围，对于本领域的技术人员来说，本申请可以有各种更改和变化。凡在本申请的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本申请的保护范围之内。应注意到：相似的标号和字母在下面的附图中表示类似项，因此，一旦某一项在一个附图中被定义，则在随后的附图中不需要对其进行进一步定义和解释。The above descriptions are merely examples of the present application, and are not intended to limit the protection scope of the present application. For those skilled in the art, various modifications and changes may be made to the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included within the protection scope of this application. It should be noted that like numerals and letters refer to like items in the following figures, so once an item is defined in one figure, it does not require further definition and explanation in subsequent figures.

以上所述，仅为本申请的具体实施方式，但本申请的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本申请的保护范围之内。因此，本申请的保护范围应所述以权利要求的保护范围为准。The above are only specific embodiments of the present application, but the protection scope of the present application is not limited to this. should be covered within the scope of protection of this application. Therefore, the protection scope of the present application should be based on the protection scope of the claims.

需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that, in this document, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any relationship between these entities or operations. any such actual relationship or sequence exists. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device that includes a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.

Claims

1. A method of segmenting video objects, the method comprising:

extracting the characteristics of at least one historical frame image before the current frame image to obtain the characteristic pair of each historical frame image in the at least one historical frame image;

extracting the characteristics of the current frame image to obtain a characteristic pair of the current frame image;

acquiring a segmentation mask of an interest target in the current frame image according to the feature pair of each historical frame image, the feature pair of the current frame image and a decoder;

wherein the feature pair comprises a key matrix and a value matrix.

2. The method of claim 1, wherein the features of the current frame image are extracted by a current frame encoder, the current frame encoder comprising a convolutional layer, a downsampling layer, and a feature similarity fusion module; wherein,

the extracting the features of the current frame image by using the current frame encoder to obtain the feature pairs of the current frame image comprises the following steps:

extracting the characteristics of the current frame image by adopting the convolution layer and the down-sampling layer to obtain a current frame characteristic diagram;

extracting features of a cut image by adopting the convolution layer and the downsampling layer to obtain a cut image feature map, wherein the cut image is obtained by cutting the interested target from a first frame image, and the first frame image is a frame of the video sequence in which the interested target appears for the first time;

and fusing the current frame feature map and the cut image feature map according to the feature similarity fusion module to obtain a fused image, and obtaining a feature pair of the current frame image based on the fused image.

3. The method of claim 2, wherein the feature similarity fusion module fuses the current frame feature map and the cropped image feature map through a short-time memory network to obtain the fused image.

4. The method of claim 2 or 3, wherein the feature similarity fusion module obtains the fused image according to the following formula:

wherein,

characterizing the matrix, feat, to which the fused image corresponds_tCharacterizing a matrix, feat, corresponding to the current frame profile_pAnd characterizing a matrix corresponding to the feature map of the cut image.

5. The method according to any one of claims 1-4, wherein the features of the current frame image are extracted through an enhanced short-time memory network, the enhanced short-time memory network comprises at least one encoder of a semantic segmentation network, the encoders of the at least one encoder are connected in parallel, and each encoder receives an input historical frame image and a segmentation result of the historical frame image.

6. The method according to any one of claims 1 to 4, wherein the obtaining a segmentation mask of the object of interest in the current frame image according to the feature pair of the history frame images, the feature pair of the current frame image and a decoder comprises:

respectively carrying out fusion operation on the current key matrix included by the feature pairs of the current frame image and the historical key matrix included by the feature pairs of each historical frame image to obtain a fusion key matrix;

and inputting the current value matrixes included by the fusion key matrix and the feature pairs of the current frame image into the decoder to obtain the segmentation mask.

7. The method of claim 6, wherein the performing a fusion operation on the current key matrix included in the feature pair of the current frame image and the historical key matrix included in the feature pair of each historical frame image to obtain a fused key matrix comprises:

respectively acquiring first relevancy of the current key matrix and each historical key matrix in all the historical key matrices;

and obtaining the fusion key matrix according to the first correlation.

8. The method of claim 6, wherein the pair of features f input to the decoder_tComprises the following steps:

wherein R is the first correlation degree of the current key matrix and each historical key matrix, Z is the sum of all the first correlation degrees, i represents the number of any one historical frame image input into the enhanced short-time memory network, t represents the number of the current frame image and the total number of all the historical frame images and the current frame image,

a current key matrix for characterizing feature pairs of the current frame image,

an ith history key matrix included in the feature pairs for characterizing the ith history frame image,

for watchesAn ith history value matrix included in a feature pair characterizing the ith history frame image,

a matrix of values comprised by pairs of features characterizing the current frame image,

for characterizing the fusion key matrix.

9. The method of claim 8, wherein the first degree of correlation is calculated by the formula:

wherein,

a dot product operation for characterizing the current key matrix and the ith historical key matrix.

10. The method of claim 1, wherein a total number of the at least one historical frame image is 3.

11. The method according to any one of claims 1 to 10, wherein the extracting features of at least one historical frame image adjacent to the current frame image to obtain a feature pair of each historical frame image in the at least one historical frame image comprises: and connecting the segmentation masks of a second historical frame image and the second historical frame image in series and inputting the second historical frame image and the segmentation masks of the second historical frame image into an enhanced short-time memory network to obtain a feature pair of the second historical frame image, wherein the second historical frame image is any one of the at least one historical frame image.

12. An apparatus for segmenting video objects, the apparatus comprising:

the historical frame feature mining module is configured to extract features of at least one historical frame image before a current frame image to obtain feature pairs of each historical frame image in the at least one historical frame image;

the current frame coding network module is configured to extract the features of the current frame image to obtain a feature pair of the current frame image;

the decoding network module is configured to acquire a segmentation mask image of an interested target in the current frame image according to the feature pairs of the historical frame images, the feature pairs of the current frame image and a decoder;

wherein the feature pair comprises a key matrix and a value matrix.

13. A network model training method for segmenting video objects is characterized by comprising the following steps:

carrying out salient object segmentation training on a basic network by adopting a data set, wherein the basic network comprises a current frame encoder and a decoder;

training a video object segmentation network according to at least one historical frame image, a current frame image and a cut image, wherein the video object networks respectively comprise a basic network and an enhanced short-time memory network which are obtained after training is completed, the current frame encoder is further configured to extract and fuse the characteristics of the current frame image and the cut image, and the at least one historical frame image and the current frame image are from the same video sequence.

14. A system comprising one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform operations of the respective methods of any of claims 1-11.