CN111563418A

CN111563418A - Asymmetric multi-mode fusion significance detection method based on attention mechanism

Info

Publication number: CN111563418A
Application number: CN202010291052.4A
Authority: CN
Inventors: 周武杰; 张欣悦; 雷景生; 靳婷; 史文彬
Original assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Current assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date: 2020-04-14
Filing date: 2020-04-14
Publication date: 2020-08-21

Abstract

The invention discloses an asymmetric multi-mode fusion significance detection method based on an attention mechanism. Inputting an RGB (red, green and blue) image and a depth image of an original stereo image into a convolutional neural network for training to obtain a corresponding saliency detection image; obtaining the optimal weight vector and the bias term of the convolutional neural network classification training model by calculating a loss function between a set formed by a significance detection graph generated by the model and a set formed by a corresponding real human eye gazing graph; and inputting the stereo images in the selected data set into the trained convolutional neural network model to obtain the images with the significance detection. The invention adopts an asymmetric coding structure to fully extract the characteristics of RGB and depth maps, effectively utilizes the image information rich in RGB after an internal perception module is added, adds a channel and a spatial attention mechanism, enhances the expression of a saliency region and saliency characteristics and improves the detection accuracy of visual saliency detection.

Description

Asymmetric multi-mode fusion significance detection method based on attention mechanism

Technical Field

The invention relates to a visual saliency detection method for deep learning, in particular to an asymmetric multi-modal fusion saliency detection method based on an attention mechanism.

Background

When looking for an object of interest in an image, a person can automatically capture semantic information between the object and its context, give high attention to salient objects, and selectively suppress unimportant factors. This precise visual attention mechanism has been explained in various biological logic models. The purpose of saliency detection is to automatically detect the most informative and attractive parts of an image. In many image applications, such as image quality assessment, semantic segmentation, image recognition, etc., determining salient objects may not only reduce computational cost, but may also improve the performance of salient models. Early saliency detection methods employed manual features, namely, the saliency of the approximate human eye gaze was simulated empirically, primarily for image color, texture, contrast, and the like. With the progress of significance studies, it was found that these manual features have not been sufficient to capture features in images well, because such manual features fail to extract high level semantics of object features and their surroundings in the images. Therefore, the image features can be better extracted by adopting the deep learning method, so that a better significance detection effect is achieved. Most of the existing significance detection methods adopt a deep learning method, and utilize a method of combining a convolution layer and a pooling layer to extract image features, but the image features obtained by simply using convolution operation and pooling operation are not representative, and especially the pooling operation can lose feature information of the image, so that the obtained significance prediction image has poor effect and low prediction accuracy.

Disclosure of Invention

In order to solve the problems in the background art, the technical problem to be solved by the invention is to provide an asymmetric multi-modal fusion saliency detection method based on an attention mechanism, which has high detection accuracy.

The technical scheme adopted by the invention for solving the technical problems is as follows: an asymmetric multi-mode fusion significance detection method based on an attention mechanism is characterized by comprising a training stage and a testing stage;

the training stage is as follows: when a convolutional neural network is constructed, an RGB image and a corresponding depth image of an original stereo image are input to an input end of an input layer, and the RGB image (namely an RGB color image) and the depth image of the original stereo image are input to the convolutional neural network for training to obtain a corresponding saliency detection image; obtaining the optimal weight vector and the bias term of the convolutional neural network classification training model by calculating a loss function between a set formed by a significance detection graph generated by the model and a set formed by a corresponding real human eye gazing graph; and inputting the stereo images in the selected data set into the trained convolutional neural network model to obtain the images with the significance detection.

The specific steps of the training phase process are as follows:

step 1.1): collecting and selecting RGB (red, green and blue) images and depth images of n original stereo images (RGB images) with target objects and forming a training set with a real human eye annotation image obtained by labeling, and adopting a HHA (Hilbert-Huang-AdHope-Huang) method to combine the depth images in the training set

All processed as a set H having three channels as the original stereo imageⁱ；

The original stereo image is specifically an image recognition for a static object, for example, a vehicle/pedestrian detection in a monitoring camera on a road.

In the training set, the RGB graph of the ith (i is more than or equal to 1 and less than or equal to n) original stereo image is marked as

The depth map corresponding to the original stereo image is recorded as

The real eye annotation view corresponding to the original stereo image and the depth map is marked as { Gⁱ(x, y) }, where (x, y) represents the coordinate position of a pixel point, and the width of the original stereoscopic image is denoted by WAnd H represents the height of the original stereo image, x is more than or equal to 1 and less than or equal to W, and y is more than or equal to 1 and less than or equal to H.

Step 1.2): constructing a convolutional neural network;

step 1.3): inputting the RGB image and the depth image of the original stereo image in the training set as input into the constructed convolutional neural network for training to obtain a significance detection image corresponding to the original stereo image, and recording a set formed by the significance detection images obtained after training as

Step 1.4): set formed by significance detection graphs obtained by computational training

Corresponding to a true human eye gaze image { GⁱThe value of the loss function between the sets of (x, y) } is noted as

Step 1.5) continuously and repeatedly executing the step 1.3) and the step 1.4) for iteration for m times to obtain a convolutional neural network classification training model, obtaining n × m loss function values in total, then finding out the loss function value with the minimum value from the n × m loss function values, and then reserving the weight vector and the bias item of the convolutional neural network corresponding to the minimum loss function value as the optimal weight vector W of the trained convolutional neural network^BestAnd an optimum bias term B^Best；

The test stage process comprises the following specific steps:

step 2.1): combining an RGB map and a depth map for detecting a target object

The R channel component, the G channel component and the B channel component are input into the trained convolutional neural network, and the optimal weight vector W is utilized^BestAnd an optimum bias term B^BestMaking a prediction to obtainTo

Corresponding saliency detection images

Wherein

To represent

And the pixel value of the pixel point with the middle coordinate position of (x ', y').

A (x ', y') represents

The pixel value of the pixel point with the middle coordinate position (x ', y') is represented by W

Width of (A), H' represents

The height of (b) is 1-1 'x-W' and 1-y 'H'.

As shown in fig. 1, the convolutional neural network in step 1.2) includes an input layer and a hidden layer, where the hidden layer output is the output of the convolutional neural network:

the input end of the input layer inputs the RGB image and the depth map of the original stereo image, the output end of the input layer outputs the R channel component, the G channel component and the B channel component of the RGB image of the original stereo image and the coding map of the depth map, and the output quantity of the input layer is the input quantity of the hidden layer; the depth map is processed in an HHA coding mode in an input layer and then has a coding map with three channels to form the depth map like an RGB map, namely the depth map is processed into three components after passing through the input layer, and the RGB map and the depth map of the original stereo image have the same width and height and are both W and H;

the hidden layer comprises the following components: ten neural network blocks, Channel Attention Modules (CAM), Interior Perception Modules (IPM), Spatial Attention Modules (SAM) and four decoding blocks; specifically, the first decoding Module is a 1 st neural network block, a 2 nd neural network block, a 3 rd neural network block, a 4 th neural network block, a 5 th neural network block, a 6 th neural network block, a 7 th neural network block, a 8 th neural network block, a 9 th neural network block, a 10 th neural network block, a Channel Attention Module (CAM), an Internal Perception Module (IPM), a spatial Attention Module SAM, a 1 st decoding block, a 2 nd decoding block, a 3 rd decoding block, a 4 th decoding block;

for the processing of the depth map:

the 1 st neural network block is composed of a first convolution layer, a first activation layer, a second convolution layer, a second activation layer and a first maximum pooling layer which are connected in sequence, the input is a coding graph of a depth map output by an input layer, and the output is a first depth feature map set D formed by 64 processed feature maps₁The width of each figure is

Has a height of

The 2 nd neural network block consists of a third convolution layer, a third activation layer, a fourth convolution layer, a fourth activation layer and a second maximum pooling layer, 64 feature maps output by the 1 st neural network block are input, and a second depth feature map set D is formed by outputting 128 feature maps₂The width of each figure is

Has a height of

The input of the 3 rd neural network block is 128 feature maps output by the 2 nd neural network block, and the output is 256 feature maps which form a third depth feature map set D₃The width of each figure is

Has a height of

The input of the 4 th neural network block is 256 feature maps output by the 3 rd neural network block, and the output is 512 feature maps to form a fourth depth feature map set D₄The width of each figure is

Has a height of

The input of the 5 th neural network block is 512 feature maps output by the 4 th neural network block, and the output is 512 feature maps to form a fifth depth feature map set D₅The width of each figure is

Has a height of

Processing the depth map from the 1 st neural network block to the 5 th neural network block to obtain five depth feature map sets respectively D₁、D₂、D₃、D₄、D₅；

For the processing of the RGB map:

the 6 th neural network block consists of an eleventh convolutional layer, a first normalization layer, an eleventh activation layer and a sixth maximum pooling layer, the input is a three-channel original RGB map, and the output is a first RGB feature map set R formed by 64 processed feature maps₁The width of each figure is

Has a height of

The input of the 7 th neural network block is 64 feature maps output by the 6 th neural network block, and the output is 256 feature maps forming a second RGB feature map set D₂The width of each figure is

Has a height of

The 7 th neural network block consists of three continuous convolution blocks; each convolution block is formed by connecting four continuous convolution layers, the input of the fourth convolution layer is the output of the third convolution layer and the output of the previous convolution block (or 64 characteristic graphs of the output of the sixth maximum pooling layer of the 6 th neural network block), and the output is 256 characteristic graphs after addition;

the 8 th neural network block consists of four continuous convolution blocks, the input is 256 characteristic maps output by the 7 th neural network block, and the output is 512 characteristic maps to form a third RGB characteristic map set R₃The width of each figure is

Has a height of

The 9 th neural network block consists of six continuous convolution blocks, 512 feature maps output by the 8 th neural network block are input, 1024 feature maps are output to form a fourth RGB feature map set R₄The width of each figure is

Has a height of

The 10 th neural network block consists of three continuous convolution blocks, and 1024 characteristic graphs output by the 9 th neural network block are input2048 feature maps form a fifth RGB feature map set R₅The width of each figure is

Has a height of

Processing the RGB map respectively through the 6 th neural network block to the 10 th neural network block to obtain five RGB feature map sets, wherein the five RGB feature map sets are R respectively₁、R₂、R₃、R₄、R₅；

Then, the first depth feature map set D₁And a first set R of RGB feature maps₁Outputting 128 characteristic graphs as a first characteristic graph set a after channel overlapping operation after the processing of the channel attention modules CAM; second depth feature map set D₂And a second set R of RGB signatures₂All processed by respective channel attention modules CAM, and output 384 characteristic graphs after channel overlapping operation as a second characteristic graph set b; third depth feature map set D₃And a third set of RGB feature maps R₃After being processed by the respective channel attention modules CAM, the data processing module outputs 768 characteristic graphs after channel overlapping operation, and the 768 characteristic graphs are used as a third characteristic graph set c; fourth depth feature map set D₄And a fourth set of RGB feature maps R₄After being processed by the respective channel attention modules CAM, the channel data processing module outputs 1536 feature graphs as a fourth feature graph set d after channel data stacking operation;

the channel number stacking operation specifically refers to merging the feature maps of the output RGB or depth maps by means of channel number addition under the condition that the feature maps have the same size.

Fifth RGB feature map set R₅Obtaining a perception feature map set F, a perception feature map set F and a fifth depth feature map set D after IPM processing₅The output of the spatial attention module SAM and the fourth feature map set d are input to the 1 st decoding block, the 1 st decoding blockAnd the output of the block and the third feature map set c are input into a 2 nd decoding block after being overlapped by the channel number, the output of the 2 nd decoding block and the second feature map set b are input into a 3 rd decoding block after being overlapped by the channel number, the output of the 3 rd decoding block and the first feature map set a are input into a 4 th decoding block after being overlapped by the channel number, and the output of the 4 th decoding block is taken as the output of a hidden layer, namely the final significance prediction map.

The channel attention module CAM specifically comprises: input as a feature graph set X_i，X_i∈(D₁，D₂，D₃，D₄，R₁，R₂，R₃，R₄) First, after the matrix shape adjustment operation (reshape), a first adjustment diagram RE (X) is obtained_i) (ii) a Then, the first adjustment diagram RE (X)_i) Performing matrix transposition (transpose) to obtain a second adjustment diagram RE^T(X_i) (ii) a Then, the second adjustment chart RE is used^T(X_i) And a first adjustment chart RE (X)_i) Matrix multiplication is carried out to obtain a third adjustment chart M (X)_i) Then, the attention feature map S (X) is obtained through the processing of the softmax function_i) (ii) a Then the third adjustment diagram RE (X)_i) And attention to the feature map S (X)_i) Matrix multiplication is carried out, and then the matrix shape is adjusted to obtain a fourth adjustment diagram SR (X)_i) (ii) a Finally, the fourth adjustment diagram SR (X)_i) Multiplying by range parameter α and inputting feature map set X_iPerforming an addition operation to finally output a fifth adjustment diagram O (X)_i) As the output of the channel attention module CAM.

As shown in FIG. 2, the input of the internal perception module IPM is a fifth RGB feature map set R output by the 5 th neural network block of the RGB map₅The output is 1024 characteristic graphs as a perception characteristic graph set F; the IPM comprises a 1 st expansion volume block, a 2 nd expansion volume block, a 3 rd expansion volume block, a 4 th expansion volume block and a first upper sampling layer, wherein the 1 st expansion volume block and the 2 nd expansion volume block are sequentially connected, the output of the 1 st expansion volume block is input into the 2 nd expansion volume block, and the output of the 1 st expansion volume block and the output channel number of the 2 nd expansion volume block are overlapped and then are connected with the 2 nd expansion volume blockThe output of 1 expansion volume block is input to the 3 rd expansion volume block after being overlapped again; the output of the 3 rd expansion volume block and the input of the 3 rd expansion volume block are input to the fourth expansion volume block after being overlapped with the output of the 3 rd expansion volume block through the channel number, the output of the fourth expansion volume block is directly input to the first up-sampling layer, and the output of the first up-sampling layer is used as the output of the internal sensing module IPM.

The method specifically comprises the following steps:

the 1 st expansion convolution block is formed by sequentially connecting a twelfth convolution layer, a second merging layer and a twelfth activation layer, and a fifth RGB feature map set R output by the 5 th neural network block of the RGB map is input₅The output forms a first expansion feature map set F for 1024 feature maps₁；

The 2 nd expansion volume block is formed by sequentially connecting a thirteenth volume layer, a third returning layer and a thirteenth activation layer, and is input as a first expansion feature map set F₁And outputting 512 feature maps to form a second expansion feature map set F₂(ii) a The first expansion feature map set F₁And a second set of expansion profiles F₂Performing channel number superposition to obtain 1536 feature maps as a third expansion feature map set F₃And then a third expansion feature map set F₃And a first set of expansion profiles F₁Performing channel number superposition to obtain 2560 feature maps as a fourth expansion feature map set F₄；

The 3 rd expansion volume block is formed by sequentially connecting a fourteenth volume layer, a fourth merging layer and a fourteenth active layer, and the input is F fourth expansion feature map set F₄And the 1024 feature maps are output as a fifth expansion feature map set F₅；

Set F of fifth expansion feature map₅And a fourth set of expansion profiles F₄Performing channel number superposition to obtain 3584 characteristic graphs as a sixth expansion characteristic graph set F₆Then, the sixth expansion feature map set F₆And a fifth set of expansion feature maps F₅Performing channel overlapping to obtain 4608 characteristic maps to form a seventh expansion characteristic map set F₇；

4 thEach expansion volume block is formed by sequentially connecting a fifteenth volume layer, a fifth returning layer and a fifteenth activation layer, and is input into a seventh expansion feature map set F₇2048 feature maps are output as an eighth expansion feature map set F₈；

The input of the first upsampling layer is an eighth expansion feature map set F₈The output is 1024 characteristic graphs, and the width of each graph is

Has a height of

As shown in fig. 3, the spatial attention module SAM is mainly composed of a sixteenth convolutional layer, a sixth convolutional layer, a sixteenth activation layer and a second upsampling layer, where the sixteenth convolutional layer is inputted with a fifth depth feature map set D outputted by the 5 th neural network block of the depth map₅The output of the soft max activation function is subjected to matrix multiplication with the perception feature map set F, and then is multiplied by the range parameter β to obtain a feature map set S₄Feature map set S₄A fifth depth feature map set D output by the 5 th neural network block of the final sum depth map₅Output attention feature set S of additive operation together₅As output of the spatial attention module SAM.

For the 1 st decoding block, the decoding block is mainly formed by sequentially connecting a first fusion layer, a seventeenth convolutional layer, a seventh return layer, a seventeenth active layer, an eighteenth convolutional layer, an eighth return layer, an eighteenth active layer and a third upper sampling layer; for the 2 nd decoding block, the decoding block is mainly formed by sequentially connecting a second fusion layer, a nineteenth convolution layer, a ninth returning layer, a nineteenth active layer, a twentieth convolution layer, a tenth returning layer, a twentieth active layer and a fourth upper sampling layer; for the 3 rd decoding block, the decoding block is mainly formed by sequentially connecting a third fusion layer, a twenty-first convolution layer, an eleventh returning layer, a twenty-first activation layer, a twenty-second convolution layer, a twelfth returning layer, a twenty-second activation layer and a fifth upper sampling layer; the 4 th decoding block is mainly formed by sequentially connecting a fourth fusion layer, a twenty-third convolution layer, a thirteenth merging layer, a twenty-third active layer, a twenty-fourth convolution layer, a fourteenth merging layer, a twenty-fourth active layer, a twenty-fifth convolution layer, a fifteenth merging layer, a twenty-fifth active layer and a sixth up-sampling layer.

The invention has the advantages that the asymmetrical coding structure is adopted to fully extract the characteristics of RGB and depth maps, the image information rich in RGB is effectively utilized after the internal perception module is added, the channel and the space attention mechanism are added, the expression of the saliency region and the saliency characteristics is enhanced, finally the multi-scale and multi-level characteristic fusion is carried out in the decoding stage, and the detection accuracy of the visual saliency detection is improved.

Compared with the prior art, the invention has the advantages that:

when the method is used for constructing the convolutional neural network, an asymmetric coding structure is adopted, the depth map is used as the complementary information of the RGB map, the RGB information and the depth information are respectively extracted by using different trunk networks, the original stereo image and the depth map information can be fully extracted, and a multi-level feature map is obtained;

the method adopts IPM (Interior Perception Module), which takes the output of RGB coding network as input and can carry out self-adaptive feature refinement on the input feature map so as to capture more sufficient RGB feature information, thereby improving the final visual saliency detection precision;

the method adopts SAM (Spatial Attention Module), which takes the output of the depth map coding structure as the input, can effectively combine the multi-scale depth information and the refined RGB information, retains the Spatial details of the characteristics, enhances the expression of the salient region and improves the precision of the salient detection.

Drawings

FIG. 1 is a block diagram of an overall implementation of the method of the present invention;

FIG. 2 is a diagram of an internal sense module (IPM) implementation of the method of the present invention;

FIG. 3 is a diagram of a spatial attention mechanism module (SAM) implementation of the method of the present invention;

fig. 4(a) is a true eye gaze view corresponding to the 1 st original stereo image of the same scene;

fig. 4(b) is a saliency detection map obtained by detecting the original stereo image shown in fig. 4(a) by using the method of the present invention;

fig. 5(a) is a true eye gaze view corresponding to the 2 nd original stereo image of the same scene;

fig. 5(b) is a saliency detection map obtained by detecting the original stereo image shown in fig. 5(a) by using the method of the present invention;

fig. 6(a) is a true eye gaze view corresponding to the 3 rd original stereo image of the same scene;

FIG. 6(b) is a saliency detection map obtained by detecting the original stereo image shown in FIG. 6(a) by the method of the present invention;

fig. 7(a) is a true eye gaze view corresponding to the 4 th original stereo image of the same scene;

fig. 7(b) is a saliency detection map obtained by detecting the original stereo image shown in fig. 7(a) by the method of the present invention.

Detailed Description

The invention is described in further detail below with reference to the accompanying examples.

The process of the embodiment of the invention is shown in fig. 1, and comprises a training stage and a testing stage:

the specific steps of the training phase process are as follows:

step ① _1, selecting RGB images and depth images of N original stereo images and corresponding real human eye annotation images to form a training set, N ∈ { N }⁺| n is more than or equal to 200}, and the RGB graph of the ith (i is more than or equal to n and less than or equal to n) original stereo image in the training set is recorded as

The depth map corresponding to the original stereo image is recorded as

The real eye annotation view corresponding to the original stereo image and the depth map is marked as { Gⁱ(x, y) }, wherein (x, y) represents the coordinate position of a pixel point, W represents the width of the original stereo image, H represents the height of the original stereo image, x is more than or equal to 1 and less than or equal to W, and y is more than or equal to 1 and less than or equal to H; and using the existing HHA method (Horizontal disparity, height above ground, and the pixel's local masks with the updated depth direction, i.e. the one-hot encoding technique) to train the concentrated depth map

Processed as a set H with three channels as the original stereo image (RGB map)ⁱ(ii) a In the data set in the experiment, 420 images in a visual saliency detection data set NUS and 332 images in NCTU are selected as training sets, 60 NUS images and 48 NCTU images are selected as verification sets, and the remaining 95 NUS images and 120 NCTU images are selected as test sets;

step (1) -2: the constructed convolutional neural network comprises an input layer, a hidden layer and an output layer;

the input end of the input layer inputs an RGB (red, green and blue) graph and a corresponding depth map of an original stereo image, the output end of the input layer outputs an R channel component, a G channel component and a B channel component of the original input image, and the output quantity of the input layer is the input quantity of the hidden layer; the depth map is processed in a HHA coding mode and then has three channels as the RGB map, namely the depth map is processed into three components after being processed by an input layer, and the width of an input original stereo image is W and the height of the input original stereo image is H;

the components of the hidden layer are as follows: a 1 st neural network block, a 2 nd neural network block, a 3 rd neural network block, a 4 th neural network block, a 5 th neural network block, a 6 th neural network block, a 7 th neural network block, an 8 th neural network block, a 9 th neural network block, a 10 th neural network block, a CAM (Channel Attention Module), an internal perception Module IPM, a spatial Attention Module SAM, a 1 st decoding block, a 2 nd decoding block, a 3 rd decoding block, a 4 th decoding block;

for the processing of the depth map, the input of the 1 st neural network block is a three-channel image coded by HHA, the output is 64 processed feature maps, and the width of each map is

Has a height of

The convolution kernels of the first convolutional layer and the second convolutional layer are set to be 64 × 3 × 3, namely the number (filters) is 64, the size (kernel _ size) is 3 × 3, the value of a zero padding parameter (padding) is 1, the activation modes of the first activation layer and the second activation layer are 'ReLU function', the pooling size (pool _ size) of the first maximum pooling layer is 2, and the step size (stride) is 2;

the 2 nd neural network block consists of a third convolutional layer, a third activation layer, a fourth convolutional layer, a fourth activation layer and a second maximum pooling layer; the input is 64 characteristic graphs output by the 1 st neural network block, and 128 characteristic graphs are output, wherein the width of each graph is

Has a height of

The number of convolution kernels (filters) of the third convolution layer and the fourth convolution layer is 128, the size of convolution kernels (kernel _ size) is 3 × 3, the value of the zero padding parameter (padding) is 1, the activation mode of the third activation layer and the fourth activation layer is 'ReLU function', the pooling size (pool _ size) of the second maximum pooling layer is 2, and the step size (stride) is 2;

the input of the 3 rd neural network block is 128 characteristic maps output by the 2 nd neural network block, the output is 256 characteristic maps, and the width of each map is

Has a height of

The number of convolution kernels (filters) of the fifth convolution layer and the sixth convolution layer is 256, the size of a convolution kernel (kernel _ size) is 3 × 3, the value of a zero padding parameter (padding) is 1, the activation mode of the fifth activation layer and the sixth activation layer is a 'ReLU function', the pooling size (pool _ size) of the third maximum pooling layer is 2, and the step size (stride) is 2;

the input of the 4 th neural network block is 256 characteristic maps output by the 3 rd neural network block, the output is 512 characteristic maps, and the width of each map is

Has a height of

The number of convolution kernels (filters) of the seventh convolution layer and the eighth convolution layer is 512, the size of convolution kernels (kernel _ size) is 3 × 3, the value of the zero padding parameter (padding) is 1, the activation mode of the seventh activation layer and the eighth activation layer is 'ReLU function', the pooling size (pool _ size) of the fourth maximum pooling layer is 2, and the step size (stride) is 2;

the input of the 5 th neural network block is 512 feature maps output by the 4 th neural network block, the output is 512 feature maps, and the width of each map is

Has a height of

The number of convolution kernels (filters) of the ninth convolution layer and the tenth convolution layer is 512, the size of convolution kernels (kernel _ size) is 3 × 3, the value of a zero padding parameter (padding) is 1, the activation mode of the ninth activation layer and the tenth activation layer is a ReLU function, the pooling size of the fifth maximum pooling layer (pool _ size) is 2, the step size (stride) is 2, and 5 feature map sets obtained by processing the depth map are respectively recorded as D₁，D₂，D₃，D₄，D₅；

For RGB mapThe 6 th neural network block has the input of three-channel original RGB map and the output of 64 processed feature maps, each map having the width of

Has a height of

The 6 th neural network block consists of an eleventh convolutional layer, a first Normalization layer, an eleventh activation layer and a sixth maximum pooling layer, wherein the number of convolution kernels (filters) of the eleventh convolutional layer is 64, the size (kernel _ size) is 7 ×, the value of a zero padding parameter (padding) is 3, and the step size (stride) is 2;

the input of the 7 th neural network block is 64 feature maps of the output of the 6 th neural network block, the output is 256 feature maps, and the width of each map is

Has a height of

The 7 th neural network block is composed of 3 convolutional blocks, each convolutional block comprises 4 convolutional layers, the first convolutional layer inputs 64 feature maps output by the 1 st neural network block, the output is processed 64 feature maps, the convolutional kernel number (filters) is 64, the size (kernel _ size) is 1 × 1, and the step size (stride) is 1, the second convolutional layer inputs 64 feature maps output by the first convolutional layer, the output is processed 64 feature maps, the convolutional kernel number (filters) is 64, the size (kernel _ size) is 3 × 3, the zero padding parameter (padding) is 1, the step size (stride) is 1, the third convolutional layer input is 64 feature maps output by the second convolutional layer, the output is processed 256 feature maps, the kernel number (filters) is 256, the size (kernel _ size) is 1, the stride 1 × 1 is 1, the stride 1, the fourth convolutional layer (stride) is 1, and the output is four step size (stride ×)The convolutional layer input is 64 feature maps output by the previous convolutional block (or the first maximum pooling layer), the output is 256 feature maps, the number of convolution kernels (filters) is 256, the size (kernel _ size) is 1 × 1, and the step size (stride) is 1;

the 8 th neural network block consists of 4 convolution blocks, the input is 256 characteristic maps output by the 7 th neural network block, the output is 512 characteristic maps, and the width of each map is

Has a height of

Each convolution block contains 4 convolution layers, the first convolution layer inputs 256 feature maps output by the 2 nd neural network block, the output is 128 feature maps after processing, the number of convolution kernels (filters) is 128, the size (kernel _ size) is 1 ×, the step size (stride) is 1, the second convolution layer inputs 128 feature maps output by the first convolution layer, the output is 128 feature maps after processing, the number of convolution kernels (filters) is 128, the size (kernel _ size) is 3 × 3, the value of a zero padding parameter (padding) is 1, the step size (stride) is 1, the third convolution layer inputs 128 feature maps output by the second convolution layer, the output is 512 feature maps after processing, the number of convolution kernels (filters) is 512, the size (kernel _ size) is 1 × 1, the size (stride) is 1, the fourth convolution layer input is the previous convolution layer or previous convolution block (kernel) is 2, the number of neural network blocks (kernel _ size) is 512, the size of the convolution kernel _ size (kernel _ size) is 512, the size (kernel _ size) is 512 feature maps after processing, the number of the convolution kernels (kernel _ size) is 512, the step size (size) is 512, the output is 512 feature maps (size) is 512 _ size of the first convolution kernel _ size is 512, the output is 1, the size of the second convolution layer is 1, the number of the convolution kernel _ size (stride) is 1;

the 9 th neural network block consists of 6 convolution blocks, 512 feature maps output by the 8 th neural network block are input, 1024 feature maps are output, and the width of each map is

Has a height of

Each of the volume blocks comprises 4 convolution layers, the first oneThe convolutional layer input is 512 feature maps output by the 3 rd neural network block, the output is 256 feature maps after processing, the number of convolution kernels (filters) is 256, the size (kernel _ size) is 1 ×, the step size (stride) is 1, the second convolutional layer input is 256 feature maps output by the first convolutional layer, the output is 256 feature maps after processing, the number of convolution kernels (filters) is 256, the size (kernel _ size) is 3 × 3, the value of a zero padding parameter (padding) is 1, the step size (stride) is 1, the third convolutional layer input is 256 feature maps output by the second convolutional layer, the output is 1024 feature maps after processing, the number of convolution kernels (filters) is 1024, the size (kernel _ size) is 1 ×, the step size (stride) is 1, the fourth convolutional layer input is the previous convolutional layer block (or the 3 rd neural network block), the output is 512 feature maps, the size (kernel _ size) is 1024) is 1, the fourth convolutional layer input is 512 feature maps after processing, the number of the previous convolutional layer (kernel _ size) is 512 kernel _ size (size) is 1, and the size (stride) is 35;

the 10 th neural network block consists of 3 convolution blocks, the input is 1024 characteristic graphs output by the 9 th neural network block, the output is 2048 characteristic graphs, and the width of each graph is

Has a height of

Each convolution block contains 4 convolution layers, the first convolution layer inputs 1024 feature maps output by the 4 th neural network block, the output is 512 processed feature maps, the number of convolution kernels (filters) is 512, the size (kernel _ size) is 1 × 1, the step size (stride) is 1, the second convolution layer inputs 512 feature maps output by the first convolution layer, the output is 512 processed feature maps, the number of convolution kernels (filters) is 512, the size (kernel _ size) is 3 × 3, the value of a zero padding parameter (padding) is 1, the step size (stride) is 1, the third convolution layer inputs 512 feature maps output by the second convolution layer, the output is 2048 processed feature maps, the number of convolution kernels (filters) is 2048, the size (kernel _ size) is 1 × 1, the stride) is 1, the first convolution layer inputs is 1024 neural network block output by the first 4 convolution layer, or the first 4 neural network blockThe feature map comprises 2048 processed feature maps as output, 2048 convolution kernels (filters), size (kernel _ size) of 1 × 1, step size (stride) of 2, and 5 feature map sets obtained by processing RGB maps and respectively marked as R₁，R₂，R₃，R₄，R₅；

For the channel attention Module CAM, the input is X_i，X_i∈(D₁，D₂，D₃，D₄，R₁，R₂，R₃，R₄)，

I.e., channel C, map height H, map width W, and after matrix shape adjustment (reshape), it is marked as RE (X)_i)，

Then RE (X)_i) Performing matrix transposition (transpose) and marking as RE^T(X_i) (ii) a RE is mixed with^T(X_i) And RE (X)_i) Matrix multiplication is carried out to obtain M (X)_i) After that, by the softmax function processing, the attention feature map S (X) is obtained_i)，

Mixing RE (X)_i) And S (X)_i) Matrix multiplication is carried out, and then the matrix shape (reshape) is adjusted to obtain SR (X)_i)，

SR (X)_i) The multiplied range parameter α is learned by the neural network gradually starting from 0, X_iAnd α× SR (X)_i) Performing an addition operation to obtain a final output O (X)_i)，

The fusion mode at this stageThe method comprises the following steps: d₁After CAM processing, 64 processed characteristic graphs D are output₁', same principle, D₂，D₃，D₄After the CAM, the feature map sets D after processing are respectively output₂’，D₃’，D₄', respectively containing 128, 256 and 512 characteristic diagrams; r₁After CAM processing, 64 processed characteristic graphs R are output₁', same reason, R₂，R₃，R₄After passing through CAM, respectively outputting the processed feature map set R₂’，R₃’，R₄', respectively containing 256, 512 and 1024 characteristic maps; thereafter D₁' and R₁' carry on the channel number and pile up, output 128 pieces of characteristic diagrams, label as characteristic diagram set a, like the same reason, D₂' and R₂' making channel number superposition, outputting 384 characteristic graphs, recording as characteristic graph sets b, D₃' and R₃' carry on the channel number and pile up, output 768 pieces of characteristic diagrams, note as characteristic diagram set c, D₄' and R₄Performing channel number superposition, outputting 1536 feature graphs, and recording as a feature graph set d;

for the internal sensing module IPM, it is composed of the 1 st expanded volume block, the 2 nd expanded volume block, the 3 rd expanded volume block, the 4 th expanded volume block and the first up-sampling layer; the input of the IPM is 2048 characteristic graphs output by the 5 th neural network block of the RGB graph, and the output is 1024 characteristic graphs marked as F; the 1 st expansion volume block consists of a twelfth volume layer, a second returning layer and a twelfth active layer; the twelfth convolutional layer has 2048 characteristic graphs output by the 5 th neural network block of the RGB map as input and 1024 characteristic graphs as output, and the characteristic graph set is marked as F₁The number of convolution kernels (filters) is 1024, the expansion ratio (scaling) is 2, the size (kernel _ size) is 3 × 3, the value of the zero padding parameter (padding) is 2, the step size (stride) is 1, the Normalization algorithm used by the second Normalization layer is 'Batch Normalization', the activation mode of the twelfth activation layer is 'ReLU function', the 2 nd expansion convolution block is composed of a thirteenth convolution layer, a third Normalization layer and a thirteenth activation layer, the input of the thirteenth convolution layer is 1024 characteristic maps output by the 1 st expansion convolution block, the output is 512 characteristic maps, and the number of convolution kernels (scaling) is (are) 2filters) is 512, dilation (scaling) is 2, size (kernel _ size) is 3 × 3, zero padding parameter (padding) is 2, step size (stride) is 1, Normalization algorithm used in the third layer is "Batch Normalization", activation mode of thirteenth layer is "ReLU function", and feature map set is F₂(ii) a The fusion at this stage: f is to be₁And F₂Performing channel stacking to obtain 1536 characteristic graphs marked as F₃，F₃And F₁Performing channel stacking to obtain 2560 characteristic graphs marked as F₄；

The 3 rd expansion volume block consists of a fourteenth volume layer, a fourth returning layer and a fourteenth active layer; the fourteenth convolution layer input is F₄2560 included feature maps, the output is 1024 feature maps, the number of convolution kernels (filters) is 1024, the expansion ratio (variance) is 2, the size (kernel _ size) is 3 × 3, the value of the zero padding parameter (padding) is 2, the step size (stride) is 1, the Normalization algorithm used in the fourth Normalization layer is "Batch Normalization", the activation mode of the fourteenth activation layer is "ReLU function" feature map set, which is denoted as F₅(ii) a The fusion at this stage: f is to be₅And F₄Performing channel stacking to obtain 3584 characteristic graphs marked as F₆(ii) a F is to be₆And F₅Performing channel overlapping to obtain 4608 characteristic diagrams marked as F₇；

The 4 th expansion volume block consists of a fifteenth volume layer, a fifth returning layer and a fifteenth active layer; the fifteenth convolution layer input is F₄4608 characteristic maps are contained, and the output is 2048 characteristic maps which are marked as F₈The number of convolution kernels (filters) is 2048, the expansion ratio (scaling) is 2, the size (kernel _ size) is 3 × 3, the value of the zero padding parameter (padding) is 2, the step size (stride) is 1, the Normalization algorithm used in the fifth Normalization layer is "Batch Normalization", and the activation mode of the fifteenth activation layer is "ReLU function";

the input of the first upsampling layer is 2048 feature maps output by the 4 th expanded volume block, the designated multiple (scale _ factor) is set to be 2, the output is 1024 feature maps, and the width of each map is

Has a height of

For the spatial attention module SAM, it is composed of sixteenth convolution layer, sixth regression layer and sixteenth activation layer and second up-sampling layer, where the sixteenth convolution layer inputs 512 feature map sets D outputted by the 5 th neural network block of the depth map₅The output is 1024 feature maps, the number of convolution kernels (filters) is 1024, the size (kernel _ size) is 3 × 3, the value of the zero padding parameter (padding) is 1, the step size (stride) is 1, the normalization algorithm used in the sixth normalization layer is "batch normalization", the activation mode of the sixteenth activation layer is "ReLU function", and the feature map set is recorded as S₁(ii) a The second upsampling layer specified multiple (scale _ factor) is set to 2, and the input is S₁The output is 512 characteristic graphs which are marked as S₂The width of each figure is

Has a height of

Will S₂Multiplying the output F of the IPM by a matrix, and sending the result into a softmax function to obtain S₃(ii) a Will S₃The multiplication of the sum F by the matrix and the multiplication by β is obtained by gradually learning the neural network from 0 to obtain S₄Finally, S is₂And S₄Performing addition operation to obtain final SAM output S₅；

For the 1 st decoding block, the decoding block consists of a first fusion layer, a seventeenth convolution layer, a seventh return layer, a seventeenth active layer, an eighteenth convolution layer, an eighth return layer, an eighteenth active layer and a third up-sampling layer; the first fused layer will be₄' and R₄' 1536 feature sets d for output of channel number stack and output of SAM S₄Performing channel stacking, outputting 2560 characteristic graphs, and recording as J₁(ii) a Seventeenth convolutional layer input J₁The output is 1024 characteristic graphs, the number of convolution kernels (filters) is 1024, and the size (kernel \ u \size) is 3 × 3, the zero padding parameter (padding) has a value of 1, the step size (stride) is 1, the Normalization algorithm used in the seventh Normalization layer is "Batch Normalization", the activation mode of the seventeenth activation layer is "ReLU function", the feature map set is denoted as J₂(ii) a Eighteenth convolution layer input J₂The output is 512 feature maps, the number of convolution kernels (filters) is 512, the size (kernel _ size) is 3 × 3, the value of the zero padding parameter (padding) is 1, the step size (stride) is 1, the Normalization algorithm used in the eighth Normalization layer is "Batch Normalization", the activation mode of the eighteenth activation layer is "ReLU function", and the feature map set is recorded as J₃(ii) a The third upsampling layer specified multiple (scale _ factor) is set to 2, and the input is J₃The output is 512 characteristic graphs, and the picture size is the width of each graph

Has a height of

For the 2 nd decoding block, the decoding block consists of a second fusion layer, a nineteenth convolution layer, a ninth layer, a nineteenth active layer, a twentieth convolution layer, a tenth layer, a twentieth active layer and a fourth up-sampling layer; the second fusion layer will be D₃' and R₃' 768 feature sets c and J for channel number stack output₂Performing channel number stacking, outputting 1280 characteristic graphs, and recording as J₄(ii) a Nineteenth convolutional layer input J₄The output is 512 feature maps, the number of convolution kernels (filters) is 512, the size (kernel _ size) is 3 × 3, the value of the zero padding parameter (padding) is 1, the step size (stride) is 1, the normalization algorithm used in the ninth layer is "batch normalization", the activation mode of the nineteenth activation layer is "ReLU function", and the feature map set is recorded as J₅(ii) a Twentieth convolution layer input J₅The output is 256 feature maps, the number of convolution kernels (filters) is 256, the size (kernel _ size) is 3 × 3, the value of the zero padding parameter (padding) is 1, the step size (stride) is 1, the normalization algorithm used in the tenth normalization layer is "batch normalization", the activation mode of the twentieth activation layer is "ReLU function", and the result is that the feature maps are to be generatedSet of characteristic diagrams is denoted J₆(ii) a The fourth upsampling layer specified multiple (scale _ factor) is set to 2, and the input is J₆The output is 256 characteristic graphs, and the width of each graph is

Has a height of

For the 3 rd decoding block, the decoding block consists of a third fusion layer, a twenty-first convolution layer, an eleventh merging layer, a twenty-first active layer, a twenty-second convolution layer, a twelfth merging layer, a twenty-second active layer and a fifth up-sampling layer; the third fused layer is₂' and R₂' 384 feature map sets b and J for channel number stack output₆Performing channel stacking, outputting 640 characteristic graphs, and recording as J₇(ii) a Twenty-first convolution layer input J₇The output is 256 feature maps, the number of convolution kernels (filters) is 256, the size (kernel _ size) is 3 × 3, the value of the zero padding parameter (padding) is 1, the step size (stride) is 1, the Normalization algorithm used in the eleventh Normalization layer is "Batch Normalization", the activation mode of the twenty-first activation layer is "ReLU function", and the feature map set is recorded as J₈(ii) a Twenty-second convolution layer input J₉The output is 128 feature maps, the number of convolution kernels (filters) is 128, the size (kernel _ size) is 3 × 3, the value of the zero padding parameter (padding) is 1, the step size (stride) is 1, the Normalization algorithm used in the twelfth layer is "Batch Normalization", the activation mode of the twenty-second activation layer is "ReLU function", and the feature map set is recorded as J₉(ii) a The fifth upsampling layer specified multiple (scale _ factor) is set to 2, and the input is J₉The output is 128 characteristic graphs, and the picture size is the width of each graph

Has a height of

For the 4 th decoding block, the decoding process is carried out byA fourth fusion layer, a twenty-third convolution layer, a thirteenth convolution layer, a twenty-third active layer, a twenty-fourth convolution layer, a fourteenth convolution layer, a twenty-fourth active layer, a twenty-fifth convolution layer, a fifteenth convolution layer, a twenty-fifth active layer and a sixth up-sampling layer; the fourth fused layer is formed by₁' and R₁' 128 feature sets a and J for channel number output₉Performing channel stacking, and outputting 256 characteristic graphs marked as J₁₀(ii) a Twenty-third convolutional layer input J₁₀The output is 128 feature maps, the number of convolution kernels (filters) is 128, the size (kernel _ size) is 3 × 3, the value of the zero padding parameter (padding) is 1, the step size (stride) is 1, the Normalization algorithm used in the thirteenth Normalization layer is "Batch Normalization", the activation mode of the twenty-third activation layer is "ReLU function", and the feature map set is recorded as J₁₁(ii) a Twenty-fourth convolutional layer input J₁₁The output is 64 feature maps, the number of convolution kernels (filters) is 64, the size (kernel _ size) is 3 × 3, the value of the zero padding parameter (padding) is 1, the step size (stride) is 1, the Normalization algorithm used in the fourteenth layer is "Batch Normalization", the activation mode of the twenty-fourth activation layer is "ReLU function", and the feature map set is represented as J₁₂(ii) a Twenty-fifth convolution layer input J₁₂The output is 1 feature map, the number of convolution kernels (filters) is 1, the size (kernel _ size) is 3 × 3, the value of zero padding parameter (padding) is 1, the step size (stride) is 1, the normalization algorithm used in the fifteenth layer is "batch normalization", the activation mode of the twenty-fifth activation layer is "ReLU function", and the feature map set is recorded as J₁₃(ii) a The sixth upsampling layer specified multiple (scale _ factor) is set to 2, and the input is J₁₄The output is 1 characteristic graph with the size of W × H and J₁₄The final significance prediction graph is obtained;

① _3, inputting the RGB image and depth image of the original stereo image in the training set as input into the constructed convolutional neural network for training to obtain the significance detection image corresponding to the original stereo image, and recording the set formed by the significance detection images obtained after training as the set

Step ① _4, computing a set of trained saliency detection maps

Step ① _5, repeatedly executing step ① _3 and step ① _4m times to obtain a convolutional neural network classification training model, obtaining n × m loss function values in total, then finding out the loss function value with the minimum value from the n × m loss function values, and then correspondingly taking the weight vector and the bias item corresponding to the minimum loss function value as the optimal weight vector and the optimal bias item of the convolutional neural network classification training model, and correspondingly marking as W^BestAnd B^Best(ii) a Wherein m is more than 1, and m is 50 in the experiment;

the test stage process comprises the following specific steps:

step ② _ 1: order

Representing a saliency stereo RGB image to be detected and a corresponding depth image; a (x ', y') represents

Width of (A), H' represents

The height of the glass is that x 'is more than or equal to 1 and less than or equal to W', and y 'is more than or equal to 1 and less than or equal to H';

step ② _2

The R channel component, the G channel component and the B channel component are input into a convolutional neural network training model which is constructed, and W is utilized^BestAnd B^BestMaking a prediction to obtain

Corresponding saliency detection images, denoted

Wherein

To represent

The pixel value of the pixel point with the middle coordinate position (x ', y');

to further verify the feasibility and effectiveness of the method of the invention, experiments were performed.

A deep learning library PyTorch1.1.0 based on python is used for constructing a convolutional neural network architecture of the asymmetric multi-modal fusion significance detection method based on the attention mechanism. The data sets NUS and NCTU are used to analyze the detection effect of the significant images (600 and 475 stereo images respectively) detected by the method. In this experiment, 4 common objective parameters of the significance evaluation detection method were used as evaluation indexes: linear correlation coefficient (abbreviated CC), Kullback-Leibler Divergence coefficient (abbreviated KLDiv), AUC parameter (the operator Under the receiver operating characterization measurement, abbreviated AUC), and Normalized scan path significance (abbreviated NSS) to evaluate the detection performance of significance detection images.

The method is utilized to detect each three-dimensional image in the two data sets NUS and NCTU to obtain a significance detection image corresponding to each three-dimensional image, and the linear correlation coefficient CC, the Kullback-Leibler divergence coefficient KLdiv, the AUC parameter and the standardized scanning path significance NSS which reflect the significance detection effect of the method are listed in Table 1.

TABLE 1 evaluation results obtained by the method of the invention

As is clear from the data shown in Table 1, the detection results of the saliency detection images obtained by the method of the present invention are good. The objective evaluation result is consistent with the result of human eye subjective perception, which is enough to explain the feasibility and effectiveness of the method. Fig. 4(a) shows a human eye gaze image corresponding to the 1 st original stereo image of the same scene in the NCTU data set; FIG. 4(b) shows a saliency detection image resulting from detection of the original stereo image shown in FIG. 4(a) using the method of the present invention; fig. 5(a) shows a human eye gaze image corresponding to the 2 nd original stereo image of the same scene in the NCTU data set; FIG. 5(b) shows a saliency detection image resulting from detection of the original stereo image shown in FIG. 5(a) using the method of the present invention; fig. 6(a) shows a human eye gaze image corresponding to the 3 rd original stereo image of the same scene in the NUS data set; FIG. 6(b) shows a saliency detection image resulting from detection of the original stereo image shown in FIG. 6(a) using the method of the present invention; fig. 7(a) shows a human eye gaze image corresponding to the 4 th original stereo image of the same scene in the NUS data set; fig. 7(b) shows a saliency detection image obtained by detecting the original stereo image shown in fig. 7(a) by the method of the present invention.

Comparing fig. 4(a) and 4(b), comparing fig. 5(a) and 5(b), comparing fig. 6(a) and 6(b), and comparing fig. 7(a) and 7(b), it can be seen that the prediction accuracy of the saliency detection image obtained by the method of the present invention is improved, and the prominent technical effect is obtained.

Claims

1. An asymmetric multi-mode fusion significance detection method based on an attention mechanism is characterized by comprising a training stage and a testing stage;

the specific steps of the training phase process are as follows:

step 1.1): collecting and selecting RGB (Red Green blue) images and depth images of n original stereo images with target objects and forming a training set with a real human eye annotation image obtained by labeling, and adopting an HHA (Hilbert-Huang-Advance analysis) method to combine the depth images in the training set

Step 1.2): constructing a convolutional neural network;

The test stage process comprises the following specific steps:

step 2.1): combining an RGB map and a depth map for detecting a target object

The R channel component, the G channel component and the B channel component are input into the trained convolutional neural network, and the optimal weight vector W is utilized^BestAnd an optimum bias term B^BestMaking a prediction to obtain

Corresponding saliency detection images

Wherein

To represent

2. The asymmetric multi-modal fusion saliency detection method based on attention mechanism as claimed in claim 1, characterized in that:

the convolutional neural network in the step 1.2) comprises an input layer and a hidden layer, wherein the output of the hidden layer is the output of the convolutional neural network:

the input end of the input layer inputs the RGB image and the depth map of the original stereo image, the output end of the input layer outputs the R channel component, the G channel component and the B channel component of the RGB image of the original stereo image and the coding map of the depth map, and the output quantity of the input layer is the input quantity of the hidden layer; the depth map is processed in an input layer in a HHA coding mode and then has a coding map with three channels to form the depth map like an RGB map, and the RGB map and the depth map of the original stereo image have the same width and height and are both W and H;

the hidden layer comprises the following components: ten neural network blocks, a channel attention module, an internal perception module, a spatial attention module SAM and four decoding blocks;

for the processing of the depth map:

Has a height of

Has a height of

Has a height of

The 4 th neural network block has 256 feature maps output by the 3 rd neural network block as inputs, and the output is 512 feature maps forming a fourth depthFeature map set D₄The width of each figure is

Has a height of

Has a height of

For the processing of the RGB map:

Has a height of

Has a height of

The 7 th neural network block consists of three continuous convolution blocks; each convolution block is formed by connecting four continuous convolution layers, the input of the fourth convolution layer is the output of the third convolution layer and the output of the previous convolution block, and the output is 256 characteristic graphs after addition;

Has a height of

Has a height of

The 10 th neural network block consists of three continuous convolution blocks, 1024 characteristic graphs output by the 9 th neural network block are input, 2048 characteristic graphs are output to form a fifth RGB characteristic graph set R₅The width of each figure is

Has a height of

fifth RGB feature map set R₅Obtaining a perception feature map set F, a perception feature map set F and a fifth depth feature map set D after IPM processing₅The output of the spatial attention module SAM and the fourth feature map set d are input into a 1 st decoding block after being overlapped by the channel number, the output of the 1 st decoding block and the third feature map set c are input into a 2 nd decoding block after being overlapped by the channel number, the output of the 2 nd decoding block and the second feature map set b are input into a 3 rd decoding block after being overlapped by the channel number, the output of the 3 rd decoding block and the first feature map set a are input into a 4 th decoding block after being overlapped by the channel number, and the output of the 4 th decoding block is used as the output of a hidden layer, namely the final significance prediction map.

3. The asymmetric multi-modal fusion saliency detection method based on attention mechanism as claimed in claim 2, characterized in that:

the channel attention module CAM specifically comprises: input as a feature graph set X_i，X_i∈(D₁，D₂，D₃，D₄，R₁，R₂，R₃，R₄) First, after the matrix shape adjustment operation (reshape), a first adjustment diagram RE (X) is obtained_i) (ii) a Then, the first adjustment diagram RE (X)_i) Performing matrix transposition (transpose) to obtain a second adjustment diagram RE^T(X_i) (ii) a Then, the second adjustment chart RE is used^T(X_i) And a first adjustment chart RE (X)_i) Matrix multiplication is carried out to obtain a third adjustment chart M (X)_i) Then, the attention feature map S (X) is obtained through the processing of the softmax function_i) (ii) a Then the third adjustment diagram RE (X)_i) And attention to the feature map S (X)_i) Matrix multiplication is carried out, and then the matrix shape is adjusted to obtain a fourth adjustment diagram SR (X)_i) (ii) a Finally, the fourth adjustment diagram SR (X)_i) Multiplying by range parameter α and inputting feature map set X_iPerforming an addition operation to finally output a fifth adjustment diagram O (X)_i) As the output of the channel attention module CAM;

the input of the internal perception module IPM is a fifth RGB feature map set R output by the 5 th neural network block of the RGB map₅The output is 1024 characteristic graphs as a perception characteristic graph set F; the internal sensing module IPM comprises a 1 st expansion volume block, a 2 nd expansion volume block, a 3 rd expansion volume block, a 4 th expansion volume block and a first up-sampling layer, wherein the output of the 1 st expansion volume block is input into the 2 nd expansion volume block, and the output of the 1 st expansion volume block and the output channel of the 2 nd expansion volume block are overlapped and then input into the 3 rd expansion volume block after being overlapped with the output channel of the 1 st expansion volume block; the output of the 3 rd expansion volume block and the input of the 3 rd expansion volume block are input to the fourth expansion volume block after being overlapped with the output of the 3 rd expansion volume block through the channel number, the output of the fourth expansion volume block is directly input to the first up-sampling layer, and the output of the first up-sampling layer is used as the output of the internal sensing module IPM.

4. The asymmetric multi-modal fusion saliency detection method based on attention mechanism as claimed in claim 2, characterized in that: the SAM is mainly composed of a sixteenth convolutional layer, a sixth convolutional layer, a sixteenth active layer and a second up-sampling layer, wherein the sixteenth convolutional layer is input into a fifth depth feature map set D output by a 5 th neural network block of a depth map₅The output of the soft max activation function is subjected to matrix multiplication with the perception feature map set F, and then is multiplied by the range parameter β to obtain a feature map set S₄Feature map set S₄A fifth depth feature map set D output by the 5 th neural network block of the final sum depth map₅Output attention feature set S of additive operation together₅As output of the spatial attention module SAM.

5. The asymmetric multi-modal fusion saliency detection method based on attention mechanism as claimed in claim 2, characterized in that: for the 1 st decoding block, the decoding block is mainly formed by sequentially connecting a first fusion layer, a seventeenth convolutional layer, a seventh return layer, a seventeenth active layer, an eighteenth convolutional layer, an eighth return layer, an eighteenth active layer and a third upper sampling layer; for the 2 nd decoding block, the decoding block is mainly formed by sequentially connecting a second fusion layer, a nineteenth convolution layer, a ninth returning layer, a nineteenth active layer, a twentieth convolution layer, a tenth returning layer, a twentieth active layer and a fourth upper sampling layer; for the 3 rd decoding block, the decoding block is mainly formed by sequentially connecting a third fusion layer, a twenty-first convolution layer, an eleventh returning layer, a twenty-first activation layer, a twenty-second convolution layer, a twelfth returning layer, a twenty-second activation layer and a fifth upper sampling layer; the 4 th decoding block is mainly formed by sequentially connecting a fourth fusion layer, a twenty-third convolution layer, a thirteenth merging layer, a twenty-third active layer, a twenty-fourth convolution layer, a fourteenth merging layer, a twenty-fourth active layer, a twenty-fifth convolution layer, a fifteenth merging layer, a twenty-fifth active layer and a sixth up-sampling layer.