CN112861880A

CN112861880A - Weak supervision RGBD image saliency detection method and system based on image classification

Info

Publication number: CN112861880A
Application number: CN202110245920.XA
Authority: CN
Inventors: 潘昌琴; 林涵阳; 刘国辉; 王力军; 俞伟明; 蔡桥英; 郑骁凌
Original assignee: Jiangsu Start Dima Data Processing Co ltd
Current assignee: Jiangsu Start Dima Data Processing Co ltd
Priority date: 2021-03-05
Filing date: 2021-03-05
Publication date: 2021-05-28
Anticipated expiration: 2041-03-05
Also published as: CN112861880B

Abstract

The invention relates to a weak supervision RGBD image saliency detection method and system based on image classification, wherein the method comprises the following steps: step S1: for images in a training data set, respectively generating a class response graph and an initial saliency graph by utilizing a class response mechanism based on gradient and an RGBD image salient object detection algorithm; step S2: performing depth optimization on the category response graph and the initial saliency map, and fusing the category response graph and the initial saliency map to generate an initial saliency map pseudo label; step S3: constructing a network model and a mixing loss function for RGBD image significance detection; training the network model, and learning the optimal parameters of the network model by minimizing the hybrid loss to obtain the trained network model; step S4: and predicting a saliency map of the RGBD image by using the trained network model. The method and the system are beneficial to improving the accuracy of the saliency detection of the weakly supervised RGBD image.

Description

Weak supervision RGBD image saliency detection method and system based on image classification

Technical Field

The invention belongs to the field of image processing and computer vision, and particularly relates to a weak supervision RGBD image saliency detection method and system based on image classification.

Background

Since the heavily supervised saliency detection algorithms are labeled pixel by pixel, the cost of manual labeling is very expensive. Therefore, in recent years, some scholars have studied a weakly supervised saliency detection algorithm, which uses image-level labeling or only one frame, which is a low-cost label, for supervised training of saliency detection. Parthipan Siva et al propose a method for weakly supervised image saliency detection with bounding box labeling, which treats saliency detection as a sampling problem. Wang et al use image-level labeling for saliency detection for the first time, they combine the saliency detection task with the image classification task, and use a multitask architecture to achieve weakly supervised saliency detection. Zeng et al propose a multi-source weakly supervised saliency detection framework to remedy the deficiencies of classification labels. Zhang et al in recent new work proposed a network structure based on weak significance detection of graffiti labels and proposed a corresponding data set. However, these methods are weak supervised saliency detection for studying pure RGB images, and are rarely involved in weak supervised saliency detection for RGBD images.

Disclosure of Invention

The invention aims to provide a method and a system for detecting the saliency of a weakly supervised RGBD image based on image classification, which are favorable for improving the saliency detection precision of the weakly supervised RGBD image.

In order to achieve the purpose, the invention adopts the technical scheme that: a weak supervision RGBD image saliency detection method based on image classification comprises the following steps:

step S1: for the images in the training data set, respectively utilizing a class response mechanism based on gradient and an RGBD image salient object detection algorithm to generate a class response image I_camAnd an initial saliency map S_cdcp；

Step S2: carrying out depth optimization on the class response graph and the initial saliency map, and fusing the class response graph and the initial saliency map to generate an initial saliency map pseudo label Y_noisy；

Step S3: constructing a network model and a mixing loss function for RGBD image significance detection; training the network model, and learning the optimal parameters of the network model by minimizing the hybrid loss to obtain the trained network model;

step S4: and predicting a saliency map of the RGBD image by using the trained network model.

Further, the step S1 specifically includes the following steps:

step S11: scaling each color image and the corresponding depth image in the training data set together to ensure that the sizes of all RGBD images in the training data set are the same;

step S12: color map I after zooming_rgbInputting a pre-trained classification network model ResNet50 for image classification to obtain a final layer generation characteristic diagram set of ResNet50 convolutional layer, and defining the final layer generation characteristic diagram set as a matrix A belonging to R^H×W×NWherein H, W represents the height and width of the feature map and N represents the number of channels; in the gradient-based class response mechanism, a feature map set A is linearly combined into a class response map, and the weight of the linear combination is determined by the partial derivative of the classification probability on the feature map; the method specifically comprises the following steps: first, the classification result y of the last layer is divided into^cAnd the kth feature map A in the feature map set^kPartial derivatives are calculated and are obtained through global average pooling to act on the feature mapLinear combining weights of

It is formulated as:

wherein GAP (-) represents a global average pooling operator,

represents a partial derivative operation;

secondly, linearly combining the feature graphs and generating a preliminary class response graph through Relu function filtering

It is formulated as:

wherein Relu (-) denotes a Relu activation function, and Σ denotes a summing operation;

finally, normalizing the preliminary class response graph to obtain a final class response graph I_camIt is formulated as:

wherein MaxPool represents maximum pooling;

step S13: color drawing I_rgbAnd depth map I_depthMeanwhile, an initial saliency map S is generated through an RGBD image saliency detection algorithm based on central dark channel prior_cdcpIt is formulated as:

S_cdcp＝function_cdcp(I_rgb，I_depth)

wherein the function_cdcp(. represents a groupAnd performing a priori RGBD image saliency detection algorithm on the central dark channel.

Further, the step S2 specifically includes the following steps:

step S21: firstly, carrying out depth enhancement on the category response map Icam through a depth map Idepth to obtain a depth-enhanced category response map

Then carrying out deep optimization through a conditional random field to obtain an optimized class response map

It is formulated as:

wherein,

expressing pixel-by-pixel dot multiplication, CRF (-) expressing conditional random field optimization, and alpha expressing a hyperparameter larger than 1;

step S22: by depth map I_depthFor the initial saliency map S_cdcpCarrying out depth enhancement to obtain a depth-enhanced saliency map

Then carrying out depth optimization through a conditional random field to obtain an optimized saliency map

It is formulated as:

wherein,

expressing pixel-by-pixel dot multiplication, CRF (-) expressing conditional random field optimization, and beta expressing a hyperparameter larger than 1;

step S23: the optimized class response graph

And saliency map

Fusing to a pseudo tag Y with lower noise_NoisyThe method is used for training the network model and is formulated as follows:

where x denotes a multiplier and δ denotes a parameter greater than 0 and less than 1.

Further, the step S3 specifically includes the following steps:

step S31: constructing a network model for RGBD image significance detection, wherein the network model consists of a feature fusion module and a full convolution neural network (FCN) module;

step S32: and constructing a mixed loss function comprising weighted cross entropy loss, conditional random field inference loss and edge loss, and training the network model by using the mixed loss function to obtain the network model with good robustness.

Further, the step S31 specifically includes the following steps:

step S311: constructing a characteristic fusion module which is formed by two 3 multiplied by 3 convolutions and used for inputting a color image I of the network model_rgbAnd depth map I_depthCarrying out feature fusion; firstly, carrying out channel splicing on an input color image and a depth image to generate a network model input with the size of (b, 4, h, w); this input is then convolved by two layers 3X 3 to obtain a feature X' of size (b, 3, h, w), which is expressed by the formula:

Input＝Concat(I_rgb，I_depth)

X＝Conv_3×3(Input)

X′＝Conv_3×3(X)

wherein, Concat () represents a splicing operator, Input represents the Input of the network model, and X represents the intermediate feature of convolution;

step S312: the FCN module changes the last layer of the classification network into a convolution layer and performs pooling on the 5 th layer of the classification network to obtain the characteristic Feat⁵Performing upsampling, performing convolution to obtain features with fewer channels, and performing activation function to obtain a final significance prediction graph, wherein the final significance prediction graph is expressed by a formula:

out＝FCN(X′)

S＝Sigmoid(out)

wherein, FCN (-) represents FCN module, out represents output of network model, Sigmoid (-) represents Sigmoid activating function, and S represents saliency map predicted by network model.

Further, the step S32 specifically includes the following steps:

step S321: reconstructing an original cross entropy loss function to obtain a weighted cross entropy loss function, and reducing the influence of noise in a label during network model training, wherein the formula expression is as follows:

w＝|Y[i，j]-0.5|

wherein w denotes acting on a certain pixelThe weight of the loss of (a) is,

representing a weighted cross-entropy loss function, Y_NoisyIndicates the pseudo tag generated in step S23,

representing an original cross entropy loss function, Y representing a real label, i and j representing indexes of rows and columns where pixels are located, log (-) representing a logarithmic function, and | represents an absolute value operator;

step S322: constructing a conditional random field inference loss function, so that the network model can infer uncertain regions in the pseudo labels through the determined labels, and the formula expression is as follows:

S_crf＝CRF(S，I_rgb)

wherein CRF (. cndot.) represents conditional random field optimization, S_crfRepresenting the saliency map after conditional random field optimization, in this step as saliency map S for label supervised prediction,

representing a conditional random field inference loss function;

step S323: constructing an edge loss function to optimize the edges of the prediction saliency map;

firstly, a color image I_rgbConverted into a grey-scale map I_grayAnd obtaining a global edge map I through an edge detection operator_edgeIt is formulated as:

I_edge＝ΔI_gray

wherein Δ represents a gradient operation in edge detection;

secondly, the predicted saliency map S is subjected to expansion and erosion operations to generate a mask map I_maskActing on the edge map to filter out redundant edges to obtain edge lossA missing label, which is formulated as:

S_dil＝Dilate(S)

S_ero＝Erode(S)

I_mask＝S_dil-S_ero

wherein, the die (-) represents the dilation operation, the Erode (-) represents the erosion operation,

indicating a pixel-by-pixel dot multiplication, Y_edgeA label representing an effect on edge loss;

defining an edge loss function

Comprises the following steps:

wherein Δ S represents an edge map of the predicted saliency map;

step S324: the losses in steps S321-S323 are summed to obtain the final mixed loss function:

wherein,

representing the mixing loss function.

Further, the Adam optimizer optimizes the mixing loss function to obtain the optimal parameters of the network model for testing the network model.

The invention also provides a weakly supervised RGBD image saliency detection system based on image classification, comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the method steps are implemented when the computer program is run by the processor.

Compared with the prior art, the invention has the following beneficial effects: the invention provides a weakly supervised RGBD image saliency detection scheme, designs a depth optimization strategy to optimize a pseudo label, simultaneously considers noise on the pseudo label and incomplete label objects, and constructs a mixed loss to enable a model to effectively infer the full view of the object.

Drawings

Fig. 1 is a schematic flow chart of a method implementation of the embodiment of the present invention.

FIG. 2 is a network model architecture diagram of weakly supervised RGBD image saliency detection in an embodiment of the present invention.

FIG. 3 is a schematic diagram of a feature fusion module according to an embodiment of the invention.

Detailed Description

The invention is further described with reference to the following figures and specific embodiments.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

As shown in fig. 1, the present embodiment provides a weakly supervised RGBD image saliency detection method based on image classification, including the following steps:

step S1: for the images in the training data set, respectively utilizing a class response mechanism based on gradient and a traditional RGBD image salient object detection algorithm to generate a class response image I_camAnd an initial saliency map S_cdcp。

Step S2: carrying out depth optimization on the class response graph and the initial saliency map, and fusing the class response graph and the initial saliency map to generate an initial saliency map pseudo label Y_noisy。

Step S3: and constructing a network model and a mixing loss function for the RGBD image saliency detection. And training the network model, and learning the optimal parameters of the network model by minimizing the hybrid loss to obtain the trained network model.

Wherein the color map is RGB in fig. 2, and the Depth map is Depth in fig. 2. The gradient-based class response mechanism generally frames an upper network frame as in fig. 2.

In this embodiment, the step S1 specifically includes the following steps:

step S11: scaling each color map in the training data set and its corresponding depth map together to make all the RGBD images in the training data set have the same size, so that the saliency map pseudo label Y generated in step S2_noisyHave the same size.

Step S12: color map I after zooming_rgbInputting a pre-trained classification network model ResNet50 for image classification to obtain a final layer generation characteristic diagram set of ResNet50 convolutional layer, and defining the final layer generation characteristic diagram set as a matrix A belonging to R^H×W×NWhere H, W denotes the height and width of the feature map and N denotes the number of channels. In the gradient-based class response mechanism, the feature map set a is linearly combined into a class response map, and the weight of the linear combination is determined by the partial derivative of the classification probability on the feature map. The method specifically comprises the following steps: first, the classification result y of the last layer is divided into^cAnd the kth feature map A in the feature map set^kPartial derivatives are calculated and passed through global averagingPooling results in linear combination weights acting on the profile

It is formulated as:

wherein GAP (-) represents a global average pooling operator,

representing partial derivative operations.

It is formulated as:

where Relu (-) denotes the Relu activation function and Σ denotes the summing operation.

Finally, normalizing the preliminary class response graph to obtain a final class response graph I_cam(e.g., the category response graph in FIG. 2), which is formulated as:

where MaxPool represents the maximum pooling.

S_cdcp＝function_cdcp(I_rgb，I_depth)

wherein the function_cdcp(. to) shows an RGBD image saliency detection algorithm based on a central dark channel prior.

In this embodiment, the step S2 specifically includes the following steps:

It is formulated as:

wherein,

representing a pixel-by-pixel dot product, CRF (-) represents a conditional random field optimization, and α represents a hyperparameter greater than 1.

It is formulated as:

wherein,

representing a pixel-by-pixel dot product, CRF (·) represents conditional random field optimization, and β represents a hyperparameter greater than 1.

Step S23: the optimized class response graph

And saliency map

Fusing to a pseudo tag Y with lower noise_Noisy(noise label in fig. 2) for training of the network model, which is formulated as:

In this embodiment, the step S3 specifically includes the following steps:

step S31: and constructing a network model (as shown in figure 2) for the RGBD image saliency detection, wherein the network model consists of a feature fusion module (as shown in figure 3) and a full convolution neural network (FCN) module. The step S31 specifically includes the following steps:

step S311: constructing a characteristic fusion module which is formed by two 3 multiplied by 3 convolutions and used for inputting a color image I of the network model_rgbAnd depth map I_depthAnd performing feature fusion. Firstly, channel splicing is carried out on the input color image and the input depth image to generate a network model input with the size of (b, 4, h, w). The input is then convolved by two layers 3 x 3 to get largeA characteristic X' as small as (b, 3, h, w), which is expressed by the formula:

Input＝Concat(I_rgb，I_depth)

X＝Conv_3×3(Input)

X′＝Conv_3×3(X)

where Concat (·) represents the concatenation operator, Input represents the Input to the network model, and X represents the intermediate features of the convolution.

out＝FCN(X′)

S＝Sigmoid(out)

Step S32: and constructing a mixed loss function comprising weighted cross entropy loss, conditional random field inference loss and edge loss, and training the network model by using the mixed loss function to obtain the network model with good robustness. The step S32 specifically includes the following steps:

w＝|Y[i，j]-0.5|

wherein w is represented asWith the loss weight on a certain pixel,

represents a weighted cross-entropy loss function, YNoisy represents the pseudo label generated in step S23,

representing the original cross-entropy loss function, Y the true label, i and j the indices of the row and column where the pixel is located, log (-) represents the logarithmic function, and | represents the absolute operator.

S_crf＝CRF(S，I_rgb)

representing a conditional random field inference loss function.

Step S323: constructing an edge loss function optimizes predicting the edges of the saliency map.

I_edge＝ΔI_gray

where Δ represents the gradient operation in edge detection.

Secondly, the predicted saliency map S is subjected to expansion and erosion operations to generate a mask map I_maskActing on the edge map to filter out redundant edges to obtainA label for edge loss, formulated as:

S_dil＝Dilate(S)

S_ero＝Erode(S)

I_mask＝S_dil-S_ero

indicating a pixel-by-pixel dot multiplication, Y_edgeIndicating the label acting on the edge loss.

Defining an edge loss function

Comprises the following steps:

here, AS represents an edge graph of the predicted saliency map.

wherein,

representing the mixing loss function.

And then, optimizing the mixing loss function through an Adam optimizer to obtain the optimal parameters of the network model for testing the network model.

The invention also provides a weakly supervised RGBD image saliency detection system based on image classification, comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein when the computer program is run by the processor, the above-mentioned method steps are implemented.

The depth map is an expression of the spatial distance between the object and the camera, can provide sufficient position information, and the depth map with smaller noise amplitude can provide complete object structure information, and the depth map is considered as additional auxiliary information for the salient detection of the weakly supervised image. The invention provides a weakly supervised RGBD image saliency detection framework, designs a depth optimization strategy to optimize a pseudo label, considers noise on the pseudo label and incomplete label objects, and designs a mixing loss to enable a model to effectively deduce the full view of the object, thereby remarkably improving the detection precision of the weakly supervised RGBD image saliency object.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is directed to preferred embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow. However, any simple modification, equivalent change and modification of the above embodiments according to the technical essence of the present invention are within the protection scope of the technical solution of the present invention.

Claims

1. A weak supervision RGBD image saliency detection method based on image classification is characterized by comprising the following steps:

2. The weakly supervised RGBD image saliency detection method based on image classification as claimed in claim 1, wherein the step S1 specifically includes the following steps:

step S12: color map I after zooming_rgbInputting a pre-trained classification network model ResNet50 for image classification to obtain a final layer generation characteristic diagram set of ResNet50 convolutional layer, and defining the final layer generation characteristic diagram set as a matrix A belonging to R^H×W×NWherein H, W represents the height and width of the feature map and N represents the number of channels; in the gradient-based class response mechanism, a feature map set A is linearly combined into a class response map, and the weight of the linear combination is determined by the partial derivative of the classification probability on the feature map; the method specifically comprises the following steps: first, the classification result y of the last layer is divided into^cAnd the kth feature map A in the feature map set^kPartial derivatives are calculated and linear combination weights acting on the feature map are obtained through global average pooling

It is formulated as:

wherein GAP (-) represents a global average pooling operator,

represents a partial derivative operation;

secondly, the feature maps are linearly groupedCombined and filtered by Relu function to generate preliminary class response graph

It is formulated as:

wherein MaxPool represents maximum pooling;

S_cdcp＝function_cdcp(I_rgb，I_depth)

3. The weakly supervised RGBD image saliency detection method based on image classification as claimed in claim 2, wherein the step S2 specifically includes the following steps:

step S21: first by a depth map I_depthResponse to class diagram I_camCarrying out depth enhancement to obtain a class response map with the depth enhancement

Then subject to the following conditionsCarrying out deep optimization on the airport to obtain an optimized class response diagram

It is formulated as:

wherein,

It is formulated as:

wherein,

step S23: the optimized class response graph

And saliency map

4. The weakly supervised RGBD image saliency detection method based on image classification as claimed in claim 3, wherein the step S3 specifically includes the following steps:

5. The weakly supervised RGBD image saliency detection method based on image classification as claimed in claim 4, wherein the step S31 specifically includes the following steps:

Input＝Concat(I_rgb，I_depth)

X＝Conv_3×3(Input)

X′＝Conv_3×3(X)

out＝FCN(X′)

S＝Sigmoid(out)

6. The weakly supervised RGBD image saliency detection method based on image classification as claimed in claim 5, wherein the step S32 specifically includes the following steps:

w＝|Y[i，j]-0.5|

where w represents the loss weight applied to a pixel,

S_crf＝CRF(S，I_rgb)

representing a conditional random field inference loss function;

I_edge＝ΔI_gray

wherein Δ represents a gradient operation in edge detection;

secondly, the predicted saliency map S is subjected to expansion and erosion operations to generate a mask map I_maskAnd filtering redundant edges on the edge graph to obtain a label of edge loss, wherein the label is expressed by the formula:

S_dil＝Dilate(S)

S_ero＝Erode(S)

I_mask＝S_dil-S_ero

defining an edge loss function

Comprises the following steps:

wherein Δ S represents an edge map of the predicted saliency map;

wherein,

representing the mixing loss function.

7. The image classification-based weakly supervised RGBD image saliency detection method according to claim 6, characterized in that the optimal parameters of the network model are obtained by optimizing the mixture loss function by an Adam optimizer for testing the network model.

8. A weakly supervised RGBD image saliency detection system based on image classification, comprising a memory, a processor and a computer program stored on the memory and being executable on the processor, the computer program, when executed by the processor, implementing the method steps of any of claims 1-7.