CN111563418A - Asymmetric multi-mode fusion significance detection method based on attention mechanism - Google Patents
Asymmetric multi-mode fusion significance detection method based on attention mechanism Download PDFInfo
- Publication number
- CN111563418A CN111563418A CN202010291052.4A CN202010291052A CN111563418A CN 111563418 A CN111563418 A CN 111563418A CN 202010291052 A CN202010291052 A CN 202010291052A CN 111563418 A CN111563418 A CN 111563418A
- Authority
- CN
- China
- Prior art keywords
- layer
- output
- neural network
- block
- input
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/52—Surveillance or monitoring of activities, e.g. for recognising suspicious objects
- G06V20/53—Recognition of crowd images, e.g. recognition of crowd congestion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/08—Detecting or categorising vehicles
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses an asymmetric multi-mode fusion significance detection method based on an attention mechanism. Inputting an RGB (red, green and blue) image and a depth image of an original stereo image into a convolutional neural network for training to obtain a corresponding saliency detection image; obtaining the optimal weight vector and the bias term of the convolutional neural network classification training model by calculating a loss function between a set formed by a significance detection graph generated by the model and a set formed by a corresponding real human eye gazing graph; and inputting the stereo images in the selected data set into the trained convolutional neural network model to obtain the images with the significance detection. The invention adopts an asymmetric coding structure to fully extract the characteristics of RGB and depth maps, effectively utilizes the image information rich in RGB after an internal perception module is added, adds a channel and a spatial attention mechanism, enhances the expression of a saliency region and saliency characteristics and improves the detection accuracy of visual saliency detection.
Description
Technical Field
The invention relates to a visual saliency detection method for deep learning, in particular to an asymmetric multi-modal fusion saliency detection method based on an attention mechanism.
Background
When looking for an object of interest in an image, a person can automatically capture semantic information between the object and its context, give high attention to salient objects, and selectively suppress unimportant factors. This precise visual attention mechanism has been explained in various biological logic models. The purpose of saliency detection is to automatically detect the most informative and attractive parts of an image. In many image applications, such as image quality assessment, semantic segmentation, image recognition, etc., determining salient objects may not only reduce computational cost, but may also improve the performance of salient models. Early saliency detection methods employed manual features, namely, the saliency of the approximate human eye gaze was simulated empirically, primarily for image color, texture, contrast, and the like. With the progress of significance studies, it was found that these manual features have not been sufficient to capture features in images well, because such manual features fail to extract high level semantics of object features and their surroundings in the images. Therefore, the image features can be better extracted by adopting the deep learning method, so that a better significance detection effect is achieved. Most of the existing significance detection methods adopt a deep learning method, and utilize a method of combining a convolution layer and a pooling layer to extract image features, but the image features obtained by simply using convolution operation and pooling operation are not representative, and especially the pooling operation can lose feature information of the image, so that the obtained significance prediction image has poor effect and low prediction accuracy.
Disclosure of Invention
In order to solve the problems in the background art, the technical problem to be solved by the invention is to provide an asymmetric multi-modal fusion saliency detection method based on an attention mechanism, which has high detection accuracy.
The technical scheme adopted by the invention for solving the technical problems is as follows: an asymmetric multi-mode fusion significance detection method based on an attention mechanism is characterized by comprising a training stage and a testing stage;
the training stage is as follows: when a convolutional neural network is constructed, an RGB image and a corresponding depth image of an original stereo image are input to an input end of an input layer, and the RGB image (namely an RGB color image) and the depth image of the original stereo image are input to the convolutional neural network for training to obtain a corresponding saliency detection image; obtaining the optimal weight vector and the bias term of the convolutional neural network classification training model by calculating a loss function between a set formed by a significance detection graph generated by the model and a set formed by a corresponding real human eye gazing graph; and inputting the stereo images in the selected data set into the trained convolutional neural network model to obtain the images with the significance detection.
The specific steps of the training phase process are as follows:
step 1.1): collecting and selecting RGB (red, green and blue) images and depth images of n original stereo images (RGB images) with target objects and forming a training set with a real human eye annotation image obtained by labeling, and adopting a HHA (Hilbert-Huang-AdHope-Huang) method to combine the depth images in the training setAll processed as a set H having three channels as the original stereo imagei;
The original stereo image is specifically an image recognition for a static object, for example, a vehicle/pedestrian detection in a monitoring camera on a road.
In the training set, the RGB graph of the ith (i is more than or equal to 1 and less than or equal to n) original stereo image is marked asThe depth map corresponding to the original stereo image is recorded asThe real eye annotation view corresponding to the original stereo image and the depth map is marked as { Gi(x, y) }, where (x, y) represents the coordinate position of a pixel point, and the width of the original stereoscopic image is denoted by WAnd H represents the height of the original stereo image, x is more than or equal to 1 and less than or equal to W, and y is more than or equal to 1 and less than or equal to H.
Step 1.2): constructing a convolutional neural network;
step 1.3): inputting the RGB image and the depth image of the original stereo image in the training set as input into the constructed convolutional neural network for training to obtain a significance detection image corresponding to the original stereo image, and recording a set formed by the significance detection images obtained after training as
Step 1.4): set formed by significance detection graphs obtained by computational trainingCorresponding to a true human eye gaze image { GiThe value of the loss function between the sets of (x, y) } is noted as
Step 1.5) continuously and repeatedly executing the step 1.3) and the step 1.4) for iteration for m times to obtain a convolutional neural network classification training model, obtaining n × m loss function values in total, then finding out the loss function value with the minimum value from the n × m loss function values, and then reserving the weight vector and the bias item of the convolutional neural network corresponding to the minimum loss function value as the optimal weight vector W of the trained convolutional neural networkBestAnd an optimum bias term BBest;
The test stage process comprises the following specific steps:
step 2.1): combining an RGB map and a depth map for detecting a target objectThe R channel component, the G channel component and the B channel component are input into the trained convolutional neural network, and the optimal weight vector W is utilizedBestAnd an optimum bias term BBestMaking a prediction to obtainToCorresponding saliency detection imagesWhereinTo representAnd the pixel value of the pixel point with the middle coordinate position of (x ', y').
A (x ', y') representsThe pixel value of the pixel point with the middle coordinate position (x ', y') is represented by WWidth of (A), H' representsThe height of (b) is 1-1 'x-W' and 1-y 'H'.
As shown in fig. 1, the convolutional neural network in step 1.2) includes an input layer and a hidden layer, where the hidden layer output is the output of the convolutional neural network:
the input end of the input layer inputs the RGB image and the depth map of the original stereo image, the output end of the input layer outputs the R channel component, the G channel component and the B channel component of the RGB image of the original stereo image and the coding map of the depth map, and the output quantity of the input layer is the input quantity of the hidden layer; the depth map is processed in an HHA coding mode in an input layer and then has a coding map with three channels to form the depth map like an RGB map, namely the depth map is processed into three components after passing through the input layer, and the RGB map and the depth map of the original stereo image have the same width and height and are both W and H;
the hidden layer comprises the following components: ten neural network blocks, Channel Attention Modules (CAM), Interior Perception Modules (IPM), Spatial Attention Modules (SAM) and four decoding blocks; specifically, the first decoding Module is a 1 st neural network block, a 2 nd neural network block, a 3 rd neural network block, a 4 th neural network block, a 5 th neural network block, a 6 th neural network block, a 7 th neural network block, a 8 th neural network block, a 9 th neural network block, a 10 th neural network block, a Channel Attention Module (CAM), an Internal Perception Module (IPM), a spatial Attention Module SAM, a 1 st decoding block, a 2 nd decoding block, a 3 rd decoding block, a 4 th decoding block;
for the processing of the depth map:
the 1 st neural network block is composed of a first convolution layer, a first activation layer, a second convolution layer, a second activation layer and a first maximum pooling layer which are connected in sequence, the input is a coding graph of a depth map output by an input layer, and the output is a first depth feature map set D formed by 64 processed feature maps1The width of each figure isHas a height of
The 2 nd neural network block consists of a third convolution layer, a third activation layer, a fourth convolution layer, a fourth activation layer and a second maximum pooling layer, 64 feature maps output by the 1 st neural network block are input, and a second depth feature map set D is formed by outputting 128 feature maps2The width of each figure isHas a height of
The input of the 3 rd neural network block is 128 feature maps output by the 2 nd neural network block, and the output is 256 feature maps which form a third depth feature map set D3The width of each figure isHas a height of
The input of the 4 th neural network block is 256 feature maps output by the 3 rd neural network block, and the output is 512 feature maps to form a fourth depth feature map set D4The width of each figure isHas a height of
The input of the 5 th neural network block is 512 feature maps output by the 4 th neural network block, and the output is 512 feature maps to form a fifth depth feature map set D5The width of each figure isHas a height of
Processing the depth map from the 1 st neural network block to the 5 th neural network block to obtain five depth feature map sets respectively D1、D2、D3、D4、D5;
For the processing of the RGB map:
the 6 th neural network block consists of an eleventh convolutional layer, a first normalization layer, an eleventh activation layer and a sixth maximum pooling layer, the input is a three-channel original RGB map, and the output is a first RGB feature map set R formed by 64 processed feature maps1The width of each figure isHas a height of
The input of the 7 th neural network block is 64 feature maps output by the 6 th neural network block, and the output is 256 feature maps forming a second RGB feature map set D2The width of each figure isHas a height ofThe 7 th neural network block consists of three continuous convolution blocks; each convolution block is formed by connecting four continuous convolution layers, the input of the fourth convolution layer is the output of the third convolution layer and the output of the previous convolution block (or 64 characteristic graphs of the output of the sixth maximum pooling layer of the 6 th neural network block), and the output is 256 characteristic graphs after addition;
the 8 th neural network block consists of four continuous convolution blocks, the input is 256 characteristic maps output by the 7 th neural network block, and the output is 512 characteristic maps to form a third RGB characteristic map set R3The width of each figure isHas a height of
The 9 th neural network block consists of six continuous convolution blocks, 512 feature maps output by the 8 th neural network block are input, 1024 feature maps are output to form a fourth RGB feature map set R4The width of each figure isHas a height of
The 10 th neural network block consists of three continuous convolution blocks, and 1024 characteristic graphs output by the 9 th neural network block are input2048 feature maps form a fifth RGB feature map set R5The width of each figure isHas a height of
Processing the RGB map respectively through the 6 th neural network block to the 10 th neural network block to obtain five RGB feature map sets, wherein the five RGB feature map sets are R respectively1、R2、R3、R4、R5;
Then, the first depth feature map set D1And a first set R of RGB feature maps1Outputting 128 characteristic graphs as a first characteristic graph set a after channel overlapping operation after the processing of the channel attention modules CAM; second depth feature map set D2And a second set R of RGB signatures2All processed by respective channel attention modules CAM, and output 384 characteristic graphs after channel overlapping operation as a second characteristic graph set b; third depth feature map set D3And a third set of RGB feature maps R3After being processed by the respective channel attention modules CAM, the data processing module outputs 768 characteristic graphs after channel overlapping operation, and the 768 characteristic graphs are used as a third characteristic graph set c; fourth depth feature map set D4And a fourth set of RGB feature maps R4After being processed by the respective channel attention modules CAM, the channel data processing module outputs 1536 feature graphs as a fourth feature graph set d after channel data stacking operation;
the channel number stacking operation specifically refers to merging the feature maps of the output RGB or depth maps by means of channel number addition under the condition that the feature maps have the same size.
Fifth RGB feature map set R5Obtaining a perception feature map set F, a perception feature map set F and a fifth depth feature map set D after IPM processing5The output of the spatial attention module SAM and the fourth feature map set d are input to the 1 st decoding block, the 1 st decoding blockAnd the output of the block and the third feature map set c are input into a 2 nd decoding block after being overlapped by the channel number, the output of the 2 nd decoding block and the second feature map set b are input into a 3 rd decoding block after being overlapped by the channel number, the output of the 3 rd decoding block and the first feature map set a are input into a 4 th decoding block after being overlapped by the channel number, and the output of the 4 th decoding block is taken as the output of a hidden layer, namely the final significance prediction map.
The channel attention module CAM specifically comprises: input as a feature graph set Xi,Xi∈(D1,D2,D3,D4,R1,R2,R3,R4) First, after the matrix shape adjustment operation (reshape), a first adjustment diagram RE (X) is obtainedi) (ii) a Then, the first adjustment diagram RE (X)i) Performing matrix transposition (transpose) to obtain a second adjustment diagram RET(Xi) (ii) a Then, the second adjustment chart RE is usedT(Xi) And a first adjustment chart RE (X)i) Matrix multiplication is carried out to obtain a third adjustment chart M (X)i) Then, the attention feature map S (X) is obtained through the processing of the softmax functioni) (ii) a Then the third adjustment diagram RE (X)i) And attention to the feature map S (X)i) Matrix multiplication is carried out, and then the matrix shape is adjusted to obtain a fourth adjustment diagram SR (X)i) (ii) a Finally, the fourth adjustment diagram SR (X)i) Multiplying by range parameter α and inputting feature map set XiPerforming an addition operation to finally output a fifth adjustment diagram O (X)i) As the output of the channel attention module CAM.
As shown in FIG. 2, the input of the internal perception module IPM is a fifth RGB feature map set R output by the 5 th neural network block of the RGB map5The output is 1024 characteristic graphs as a perception characteristic graph set F; the IPM comprises a 1 st expansion volume block, a 2 nd expansion volume block, a 3 rd expansion volume block, a 4 th expansion volume block and a first upper sampling layer, wherein the 1 st expansion volume block and the 2 nd expansion volume block are sequentially connected, the output of the 1 st expansion volume block is input into the 2 nd expansion volume block, and the output of the 1 st expansion volume block and the output channel number of the 2 nd expansion volume block are overlapped and then are connected with the 2 nd expansion volume blockThe output of 1 expansion volume block is input to the 3 rd expansion volume block after being overlapped again; the output of the 3 rd expansion volume block and the input of the 3 rd expansion volume block are input to the fourth expansion volume block after being overlapped with the output of the 3 rd expansion volume block through the channel number, the output of the fourth expansion volume block is directly input to the first up-sampling layer, and the output of the first up-sampling layer is used as the output of the internal sensing module IPM.
The method specifically comprises the following steps:
the 1 st expansion convolution block is formed by sequentially connecting a twelfth convolution layer, a second merging layer and a twelfth activation layer, and a fifth RGB feature map set R output by the 5 th neural network block of the RGB map is input5The output forms a first expansion feature map set F for 1024 feature maps1;
The 2 nd expansion volume block is formed by sequentially connecting a thirteenth volume layer, a third returning layer and a thirteenth activation layer, and is input as a first expansion feature map set F1And outputting 512 feature maps to form a second expansion feature map set F2(ii) a The first expansion feature map set F1And a second set of expansion profiles F2Performing channel number superposition to obtain 1536 feature maps as a third expansion feature map set F3And then a third expansion feature map set F3And a first set of expansion profiles F1Performing channel number superposition to obtain 2560 feature maps as a fourth expansion feature map set F4;
The 3 rd expansion volume block is formed by sequentially connecting a fourteenth volume layer, a fourth merging layer and a fourteenth active layer, and the input is F fourth expansion feature map set F4And the 1024 feature maps are output as a fifth expansion feature map set F5;
Set F of fifth expansion feature map5And a fourth set of expansion profiles F4Performing channel number superposition to obtain 3584 characteristic graphs as a sixth expansion characteristic graph set F6Then, the sixth expansion feature map set F6And a fifth set of expansion feature maps F5Performing channel overlapping to obtain 4608 characteristic maps to form a seventh expansion characteristic map set F7;
4 thEach expansion volume block is formed by sequentially connecting a fifteenth volume layer, a fifth returning layer and a fifteenth activation layer, and is input into a seventh expansion feature map set F72048 feature maps are output as an eighth expansion feature map set F8;
The input of the first upsampling layer is an eighth expansion feature map set F8The output is 1024 characteristic graphs, and the width of each graph isHas a height of
As shown in fig. 3, the spatial attention module SAM is mainly composed of a sixteenth convolutional layer, a sixth convolutional layer, a sixteenth activation layer and a second upsampling layer, where the sixteenth convolutional layer is inputted with a fifth depth feature map set D outputted by the 5 th neural network block of the depth map5The output of the soft max activation function is subjected to matrix multiplication with the perception feature map set F, and then is multiplied by the range parameter β to obtain a feature map set S4Feature map set S4A fifth depth feature map set D output by the 5 th neural network block of the final sum depth map5Output attention feature set S of additive operation together5As output of the spatial attention module SAM.
For the 1 st decoding block, the decoding block is mainly formed by sequentially connecting a first fusion layer, a seventeenth convolutional layer, a seventh return layer, a seventeenth active layer, an eighteenth convolutional layer, an eighth return layer, an eighteenth active layer and a third upper sampling layer; for the 2 nd decoding block, the decoding block is mainly formed by sequentially connecting a second fusion layer, a nineteenth convolution layer, a ninth returning layer, a nineteenth active layer, a twentieth convolution layer, a tenth returning layer, a twentieth active layer and a fourth upper sampling layer; for the 3 rd decoding block, the decoding block is mainly formed by sequentially connecting a third fusion layer, a twenty-first convolution layer, an eleventh returning layer, a twenty-first activation layer, a twenty-second convolution layer, a twelfth returning layer, a twenty-second activation layer and a fifth upper sampling layer; the 4 th decoding block is mainly formed by sequentially connecting a fourth fusion layer, a twenty-third convolution layer, a thirteenth merging layer, a twenty-third active layer, a twenty-fourth convolution layer, a fourteenth merging layer, a twenty-fourth active layer, a twenty-fifth convolution layer, a fifteenth merging layer, a twenty-fifth active layer and a sixth up-sampling layer.
The invention has the advantages that the asymmetrical coding structure is adopted to fully extract the characteristics of RGB and depth maps, the image information rich in RGB is effectively utilized after the internal perception module is added, the channel and the space attention mechanism are added, the expression of the saliency region and the saliency characteristics is enhanced, finally the multi-scale and multi-level characteristic fusion is carried out in the decoding stage, and the detection accuracy of the visual saliency detection is improved.
Compared with the prior art, the invention has the advantages that:
when the method is used for constructing the convolutional neural network, an asymmetric coding structure is adopted, the depth map is used as the complementary information of the RGB map, the RGB information and the depth information are respectively extracted by using different trunk networks, the original stereo image and the depth map information can be fully extracted, and a multi-level feature map is obtained;
the method adopts IPM (Interior Perception Module), which takes the output of RGB coding network as input and can carry out self-adaptive feature refinement on the input feature map so as to capture more sufficient RGB feature information, thereby improving the final visual saliency detection precision;
the method adopts SAM (Spatial Attention Module), which takes the output of the depth map coding structure as the input, can effectively combine the multi-scale depth information and the refined RGB information, retains the Spatial details of the characteristics, enhances the expression of the salient region and improves the precision of the salient detection.
Drawings
FIG. 1 is a block diagram of an overall implementation of the method of the present invention;
FIG. 2 is a diagram of an internal sense module (IPM) implementation of the method of the present invention;
FIG. 3 is a diagram of a spatial attention mechanism module (SAM) implementation of the method of the present invention;
fig. 4(a) is a true eye gaze view corresponding to the 1 st original stereo image of the same scene;
fig. 4(b) is a saliency detection map obtained by detecting the original stereo image shown in fig. 4(a) by using the method of the present invention;
fig. 5(a) is a true eye gaze view corresponding to the 2 nd original stereo image of the same scene;
fig. 5(b) is a saliency detection map obtained by detecting the original stereo image shown in fig. 5(a) by using the method of the present invention;
fig. 6(a) is a true eye gaze view corresponding to the 3 rd original stereo image of the same scene;
FIG. 6(b) is a saliency detection map obtained by detecting the original stereo image shown in FIG. 6(a) by the method of the present invention;
fig. 7(a) is a true eye gaze view corresponding to the 4 th original stereo image of the same scene;
fig. 7(b) is a saliency detection map obtained by detecting the original stereo image shown in fig. 7(a) by the method of the present invention.
Detailed Description
The invention is described in further detail below with reference to the accompanying examples.
The process of the embodiment of the invention is shown in fig. 1, and comprises a training stage and a testing stage:
the specific steps of the training phase process are as follows:
step ① _1, selecting RGB images and depth images of N original stereo images and corresponding real human eye annotation images to form a training set, N ∈ { N }+| n is more than or equal to 200}, and the RGB graph of the ith (i is more than or equal to n and less than or equal to n) original stereo image in the training set is recorded asThe depth map corresponding to the original stereo image is recorded asThe real eye annotation view corresponding to the original stereo image and the depth map is marked as { Gi(x, y) }, wherein (x, y) represents the coordinate position of a pixel point, W represents the width of the original stereo image, H represents the height of the original stereo image, x is more than or equal to 1 and less than or equal to W, and y is more than or equal to 1 and less than or equal to H; and using the existing HHA method (Horizontal disparity, height above ground, and the pixel's local masks with the updated depth direction, i.e. the one-hot encoding technique) to train the concentrated depth mapProcessed as a set H with three channels as the original stereo image (RGB map)i(ii) a In the data set in the experiment, 420 images in a visual saliency detection data set NUS and 332 images in NCTU are selected as training sets, 60 NUS images and 48 NCTU images are selected as verification sets, and the remaining 95 NUS images and 120 NCTU images are selected as test sets;
step (1) -2: the constructed convolutional neural network comprises an input layer, a hidden layer and an output layer;
the input end of the input layer inputs an RGB (red, green and blue) graph and a corresponding depth map of an original stereo image, the output end of the input layer outputs an R channel component, a G channel component and a B channel component of the original input image, and the output quantity of the input layer is the input quantity of the hidden layer; the depth map is processed in a HHA coding mode and then has three channels as the RGB map, namely the depth map is processed into three components after being processed by an input layer, and the width of an input original stereo image is W and the height of the input original stereo image is H;
the components of the hidden layer are as follows: a 1 st neural network block, a 2 nd neural network block, a 3 rd neural network block, a 4 th neural network block, a 5 th neural network block, a 6 th neural network block, a 7 th neural network block, an 8 th neural network block, a 9 th neural network block, a 10 th neural network block, a CAM (Channel Attention Module), an internal perception Module IPM, a spatial Attention Module SAM, a 1 st decoding block, a 2 nd decoding block, a 3 rd decoding block, a 4 th decoding block;
for the processing of the depth map, the input of the 1 st neural network block is a three-channel image coded by HHA, the output is 64 processed feature maps, and the width of each map isHas a height ofThe convolution kernels of the first convolutional layer and the second convolutional layer are set to be 64 × 3 × 3, namely the number (filters) is 64, the size (kernel _ size) is 3 × 3, the value of a zero padding parameter (padding) is 1, the activation modes of the first activation layer and the second activation layer are 'ReLU function', the pooling size (pool _ size) of the first maximum pooling layer is 2, and the step size (stride) is 2;
the 2 nd neural network block consists of a third convolutional layer, a third activation layer, a fourth convolutional layer, a fourth activation layer and a second maximum pooling layer; the input is 64 characteristic graphs output by the 1 st neural network block, and 128 characteristic graphs are output, wherein the width of each graph isHas a height ofThe number of convolution kernels (filters) of the third convolution layer and the fourth convolution layer is 128, the size of convolution kernels (kernel _ size) is 3 × 3, the value of the zero padding parameter (padding) is 1, the activation mode of the third activation layer and the fourth activation layer is 'ReLU function', the pooling size (pool _ size) of the second maximum pooling layer is 2, and the step size (stride) is 2;
the input of the 3 rd neural network block is 128 characteristic maps output by the 2 nd neural network block, the output is 256 characteristic maps, and the width of each map isHas a height ofThe number of convolution kernels (filters) of the fifth convolution layer and the sixth convolution layer is 256, the size of a convolution kernel (kernel _ size) is 3 × 3, the value of a zero padding parameter (padding) is 1, the activation mode of the fifth activation layer and the sixth activation layer is a 'ReLU function', the pooling size (pool _ size) of the third maximum pooling layer is 2, and the step size (stride) is 2;
the input of the 4 th neural network block is 256 characteristic maps output by the 3 rd neural network block, the output is 512 characteristic maps, and the width of each map isHas a height ofThe number of convolution kernels (filters) of the seventh convolution layer and the eighth convolution layer is 512, the size of convolution kernels (kernel _ size) is 3 × 3, the value of the zero padding parameter (padding) is 1, the activation mode of the seventh activation layer and the eighth activation layer is 'ReLU function', the pooling size (pool _ size) of the fourth maximum pooling layer is 2, and the step size (stride) is 2;
the input of the 5 th neural network block is 512 feature maps output by the 4 th neural network block, the output is 512 feature maps, and the width of each map isHas a height ofThe number of convolution kernels (filters) of the ninth convolution layer and the tenth convolution layer is 512, the size of convolution kernels (kernel _ size) is 3 × 3, the value of a zero padding parameter (padding) is 1, the activation mode of the ninth activation layer and the tenth activation layer is a ReLU function, the pooling size of the fifth maximum pooling layer (pool _ size) is 2, the step size (stride) is 2, and 5 feature map sets obtained by processing the depth map are respectively recorded as D1,D2,D3,D4,D5;
For RGB mapThe 6 th neural network block has the input of three-channel original RGB map and the output of 64 processed feature maps, each map having the width ofHas a height ofThe 6 th neural network block consists of an eleventh convolutional layer, a first Normalization layer, an eleventh activation layer and a sixth maximum pooling layer, wherein the number of convolution kernels (filters) of the eleventh convolutional layer is 64, the size (kernel _ size) is 7 ×, the value of a zero padding parameter (padding) is 3, and the step size (stride) is 2;
the input of the 7 th neural network block is 64 feature maps of the output of the 6 th neural network block, the output is 256 feature maps, and the width of each map isHas a height ofThe 7 th neural network block is composed of 3 convolutional blocks, each convolutional block comprises 4 convolutional layers, the first convolutional layer inputs 64 feature maps output by the 1 st neural network block, the output is processed 64 feature maps, the convolutional kernel number (filters) is 64, the size (kernel _ size) is 1 × 1, and the step size (stride) is 1, the second convolutional layer inputs 64 feature maps output by the first convolutional layer, the output is processed 64 feature maps, the convolutional kernel number (filters) is 64, the size (kernel _ size) is 3 × 3, the zero padding parameter (padding) is 1, the step size (stride) is 1, the third convolutional layer input is 64 feature maps output by the second convolutional layer, the output is processed 256 feature maps, the kernel number (filters) is 256, the size (kernel _ size) is 1, the stride 1 × 1 is 1, the stride 1, the fourth convolutional layer (stride) is 1, and the output is four step size (stride ×)The convolutional layer input is 64 feature maps output by the previous convolutional block (or the first maximum pooling layer), the output is 256 feature maps, the number of convolution kernels (filters) is 256, the size (kernel _ size) is 1 × 1, and the step size (stride) is 1;
the 8 th neural network block consists of 4 convolution blocks, the input is 256 characteristic maps output by the 7 th neural network block, the output is 512 characteristic maps, and the width of each map isHas a height ofEach convolution block contains 4 convolution layers, the first convolution layer inputs 256 feature maps output by the 2 nd neural network block, the output is 128 feature maps after processing, the number of convolution kernels (filters) is 128, the size (kernel _ size) is 1 ×, the step size (stride) is 1, the second convolution layer inputs 128 feature maps output by the first convolution layer, the output is 128 feature maps after processing, the number of convolution kernels (filters) is 128, the size (kernel _ size) is 3 × 3, the value of a zero padding parameter (padding) is 1, the step size (stride) is 1, the third convolution layer inputs 128 feature maps output by the second convolution layer, the output is 512 feature maps after processing, the number of convolution kernels (filters) is 512, the size (kernel _ size) is 1 × 1, the size (stride) is 1, the fourth convolution layer input is the previous convolution layer or previous convolution block (kernel) is 2, the number of neural network blocks (kernel _ size) is 512, the size of the convolution kernel _ size (kernel _ size) is 512, the size (kernel _ size) is 512 feature maps after processing, the number of the convolution kernels (kernel _ size) is 512, the step size (size) is 512, the output is 512 feature maps (size) is 512 _ size of the first convolution kernel _ size is 512, the output is 1, the size of the second convolution layer is 1, the number of the convolution kernel _ size (stride) is 1;
the 9 th neural network block consists of 6 convolution blocks, 512 feature maps output by the 8 th neural network block are input, 1024 feature maps are output, and the width of each map isHas a height ofEach of the volume blocks comprises 4 convolution layers, the first oneThe convolutional layer input is 512 feature maps output by the 3 rd neural network block, the output is 256 feature maps after processing, the number of convolution kernels (filters) is 256, the size (kernel _ size) is 1 ×, the step size (stride) is 1, the second convolutional layer input is 256 feature maps output by the first convolutional layer, the output is 256 feature maps after processing, the number of convolution kernels (filters) is 256, the size (kernel _ size) is 3 × 3, the value of a zero padding parameter (padding) is 1, the step size (stride) is 1, the third convolutional layer input is 256 feature maps output by the second convolutional layer, the output is 1024 feature maps after processing, the number of convolution kernels (filters) is 1024, the size (kernel _ size) is 1 ×, the step size (stride) is 1, the fourth convolutional layer input is the previous convolutional layer block (or the 3 rd neural network block), the output is 512 feature maps, the size (kernel _ size) is 1024) is 1, the fourth convolutional layer input is 512 feature maps after processing, the number of the previous convolutional layer (kernel _ size) is 512 kernel _ size (size) is 1, and the size (stride) is 35;
the 10 th neural network block consists of 3 convolution blocks, the input is 1024 characteristic graphs output by the 9 th neural network block, the output is 2048 characteristic graphs, and the width of each graph isHas a height ofEach convolution block contains 4 convolution layers, the first convolution layer inputs 1024 feature maps output by the 4 th neural network block, the output is 512 processed feature maps, the number of convolution kernels (filters) is 512, the size (kernel _ size) is 1 × 1, the step size (stride) is 1, the second convolution layer inputs 512 feature maps output by the first convolution layer, the output is 512 processed feature maps, the number of convolution kernels (filters) is 512, the size (kernel _ size) is 3 × 3, the value of a zero padding parameter (padding) is 1, the step size (stride) is 1, the third convolution layer inputs 512 feature maps output by the second convolution layer, the output is 2048 processed feature maps, the number of convolution kernels (filters) is 2048, the size (kernel _ size) is 1 × 1, the stride) is 1, the first convolution layer inputs is 1024 neural network block output by the first 4 convolution layer, or the first 4 neural network blockThe feature map comprises 2048 processed feature maps as output, 2048 convolution kernels (filters), size (kernel _ size) of 1 × 1, step size (stride) of 2, and 5 feature map sets obtained by processing RGB maps and respectively marked as R1,R2,R3,R4,R5;
For the channel attention Module CAM, the input is Xi,Xi∈(D1,D2,D3,D4,R1,R2,R3,R4),I.e., channel C, map height H, map width W, and after matrix shape adjustment (reshape), it is marked as RE (X)i),Then RE (X)i) Performing matrix transposition (transpose) and marking as RET(Xi) (ii) a RE is mixed withT(Xi) And RE (X)i) Matrix multiplication is carried out to obtain M (X)i) After that, by the softmax function processing, the attention feature map S (X) is obtainedi),Mixing RE (X)i) And S (X)i) Matrix multiplication is carried out, and then the matrix shape (reshape) is adjusted to obtain SR (X)i), SR (X)i) The multiplied range parameter α is learned by the neural network gradually starting from 0, XiAnd α× SR (X)i) Performing an addition operation to obtain a final output O (X)i),
The fusion mode at this stageThe method comprises the following steps: d1After CAM processing, 64 processed characteristic graphs D are output1', same principle, D2,D3,D4After the CAM, the feature map sets D after processing are respectively output2’,D3’,D4', respectively containing 128, 256 and 512 characteristic diagrams; r1After CAM processing, 64 processed characteristic graphs R are output1', same reason, R2,R3,R4After passing through CAM, respectively outputting the processed feature map set R2’,R3’,R4', respectively containing 256, 512 and 1024 characteristic maps; thereafter D1' and R1' carry on the channel number and pile up, output 128 pieces of characteristic diagrams, label as characteristic diagram set a, like the same reason, D2' and R2' making channel number superposition, outputting 384 characteristic graphs, recording as characteristic graph sets b, D3' and R3' carry on the channel number and pile up, output 768 pieces of characteristic diagrams, note as characteristic diagram set c, D4' and R4Performing channel number superposition, outputting 1536 feature graphs, and recording as a feature graph set d;
for the internal sensing module IPM, it is composed of the 1 st expanded volume block, the 2 nd expanded volume block, the 3 rd expanded volume block, the 4 th expanded volume block and the first up-sampling layer; the input of the IPM is 2048 characteristic graphs output by the 5 th neural network block of the RGB graph, and the output is 1024 characteristic graphs marked as F; the 1 st expansion volume block consists of a twelfth volume layer, a second returning layer and a twelfth active layer; the twelfth convolutional layer has 2048 characteristic graphs output by the 5 th neural network block of the RGB map as input and 1024 characteristic graphs as output, and the characteristic graph set is marked as F1The number of convolution kernels (filters) is 1024, the expansion ratio (scaling) is 2, the size (kernel _ size) is 3 × 3, the value of the zero padding parameter (padding) is 2, the step size (stride) is 1, the Normalization algorithm used by the second Normalization layer is 'Batch Normalization', the activation mode of the twelfth activation layer is 'ReLU function', the 2 nd expansion convolution block is composed of a thirteenth convolution layer, a third Normalization layer and a thirteenth activation layer, the input of the thirteenth convolution layer is 1024 characteristic maps output by the 1 st expansion convolution block, the output is 512 characteristic maps, and the number of convolution kernels (scaling) is (are) 2filters) is 512, dilation (scaling) is 2, size (kernel _ size) is 3 × 3, zero padding parameter (padding) is 2, step size (stride) is 1, Normalization algorithm used in the third layer is "Batch Normalization", activation mode of thirteenth layer is "ReLU function", and feature map set is F2(ii) a The fusion at this stage: f is to be1And F2Performing channel stacking to obtain 1536 characteristic graphs marked as F3,F3And F1Performing channel stacking to obtain 2560 characteristic graphs marked as F4;
The 3 rd expansion volume block consists of a fourteenth volume layer, a fourth returning layer and a fourteenth active layer; the fourteenth convolution layer input is F42560 included feature maps, the output is 1024 feature maps, the number of convolution kernels (filters) is 1024, the expansion ratio (variance) is 2, the size (kernel _ size) is 3 × 3, the value of the zero padding parameter (padding) is 2, the step size (stride) is 1, the Normalization algorithm used in the fourth Normalization layer is "Batch Normalization", the activation mode of the fourteenth activation layer is "ReLU function" feature map set, which is denoted as F5(ii) a The fusion at this stage: f is to be5And F4Performing channel stacking to obtain 3584 characteristic graphs marked as F6(ii) a F is to be6And F5Performing channel overlapping to obtain 4608 characteristic diagrams marked as F7;
The 4 th expansion volume block consists of a fifteenth volume layer, a fifth returning layer and a fifteenth active layer; the fifteenth convolution layer input is F44608 characteristic maps are contained, and the output is 2048 characteristic maps which are marked as F8The number of convolution kernels (filters) is 2048, the expansion ratio (scaling) is 2, the size (kernel _ size) is 3 × 3, the value of the zero padding parameter (padding) is 2, the step size (stride) is 1, the Normalization algorithm used in the fifth Normalization layer is "Batch Normalization", and the activation mode of the fifteenth activation layer is "ReLU function";
the input of the first upsampling layer is 2048 feature maps output by the 4 th expanded volume block, the designated multiple (scale _ factor) is set to be 2, the output is 1024 feature maps, and the width of each map isHas a height of
For the spatial attention module SAM, it is composed of sixteenth convolution layer, sixth regression layer and sixteenth activation layer and second up-sampling layer, where the sixteenth convolution layer inputs 512 feature map sets D outputted by the 5 th neural network block of the depth map5The output is 1024 feature maps, the number of convolution kernels (filters) is 1024, the size (kernel _ size) is 3 × 3, the value of the zero padding parameter (padding) is 1, the step size (stride) is 1, the normalization algorithm used in the sixth normalization layer is "batch normalization", the activation mode of the sixteenth activation layer is "ReLU function", and the feature map set is recorded as S1(ii) a The second upsampling layer specified multiple (scale _ factor) is set to 2, and the input is S1The output is 512 characteristic graphs which are marked as S2The width of each figure isHas a height ofWill S2Multiplying the output F of the IPM by a matrix, and sending the result into a softmax function to obtain S3(ii) a Will S3The multiplication of the sum F by the matrix and the multiplication by β is obtained by gradually learning the neural network from 0 to obtain S4Finally, S is2And S4Performing addition operation to obtain final SAM output S5;
For the 1 st decoding block, the decoding block consists of a first fusion layer, a seventeenth convolution layer, a seventh return layer, a seventeenth active layer, an eighteenth convolution layer, an eighth return layer, an eighteenth active layer and a third up-sampling layer; the first fused layer will be4' and R4' 1536 feature sets d for output of channel number stack and output of SAM S4Performing channel stacking, outputting 2560 characteristic graphs, and recording as J1(ii) a Seventeenth convolutional layer input J1The output is 1024 characteristic graphs, the number of convolution kernels (filters) is 1024, and the size (kernel \ u \size) is 3 × 3, the zero padding parameter (padding) has a value of 1, the step size (stride) is 1, the Normalization algorithm used in the seventh Normalization layer is "Batch Normalization", the activation mode of the seventeenth activation layer is "ReLU function", the feature map set is denoted as J2(ii) a Eighteenth convolution layer input J2The output is 512 feature maps, the number of convolution kernels (filters) is 512, the size (kernel _ size) is 3 × 3, the value of the zero padding parameter (padding) is 1, the step size (stride) is 1, the Normalization algorithm used in the eighth Normalization layer is "Batch Normalization", the activation mode of the eighteenth activation layer is "ReLU function", and the feature map set is recorded as J3(ii) a The third upsampling layer specified multiple (scale _ factor) is set to 2, and the input is J3The output is 512 characteristic graphs, and the picture size is the width of each graphHas a height of
For the 2 nd decoding block, the decoding block consists of a second fusion layer, a nineteenth convolution layer, a ninth layer, a nineteenth active layer, a twentieth convolution layer, a tenth layer, a twentieth active layer and a fourth up-sampling layer; the second fusion layer will be D3' and R3' 768 feature sets c and J for channel number stack output2Performing channel number stacking, outputting 1280 characteristic graphs, and recording as J4(ii) a Nineteenth convolutional layer input J4The output is 512 feature maps, the number of convolution kernels (filters) is 512, the size (kernel _ size) is 3 × 3, the value of the zero padding parameter (padding) is 1, the step size (stride) is 1, the normalization algorithm used in the ninth layer is "batch normalization", the activation mode of the nineteenth activation layer is "ReLU function", and the feature map set is recorded as J5(ii) a Twentieth convolution layer input J5The output is 256 feature maps, the number of convolution kernels (filters) is 256, the size (kernel _ size) is 3 × 3, the value of the zero padding parameter (padding) is 1, the step size (stride) is 1, the normalization algorithm used in the tenth normalization layer is "batch normalization", the activation mode of the twentieth activation layer is "ReLU function", and the result is that the feature maps are to be generatedSet of characteristic diagrams is denoted J6(ii) a The fourth upsampling layer specified multiple (scale _ factor) is set to 2, and the input is J6The output is 256 characteristic graphs, and the width of each graph isHas a height of
For the 3 rd decoding block, the decoding block consists of a third fusion layer, a twenty-first convolution layer, an eleventh merging layer, a twenty-first active layer, a twenty-second convolution layer, a twelfth merging layer, a twenty-second active layer and a fifth up-sampling layer; the third fused layer is2' and R2' 384 feature map sets b and J for channel number stack output6Performing channel stacking, outputting 640 characteristic graphs, and recording as J7(ii) a Twenty-first convolution layer input J7The output is 256 feature maps, the number of convolution kernels (filters) is 256, the size (kernel _ size) is 3 × 3, the value of the zero padding parameter (padding) is 1, the step size (stride) is 1, the Normalization algorithm used in the eleventh Normalization layer is "Batch Normalization", the activation mode of the twenty-first activation layer is "ReLU function", and the feature map set is recorded as J8(ii) a Twenty-second convolution layer input J9The output is 128 feature maps, the number of convolution kernels (filters) is 128, the size (kernel _ size) is 3 × 3, the value of the zero padding parameter (padding) is 1, the step size (stride) is 1, the Normalization algorithm used in the twelfth layer is "Batch Normalization", the activation mode of the twenty-second activation layer is "ReLU function", and the feature map set is recorded as J9(ii) a The fifth upsampling layer specified multiple (scale _ factor) is set to 2, and the input is J9The output is 128 characteristic graphs, and the picture size is the width of each graphHas a height of
For the 4 th decoding block, the decoding process is carried out byA fourth fusion layer, a twenty-third convolution layer, a thirteenth convolution layer, a twenty-third active layer, a twenty-fourth convolution layer, a fourteenth convolution layer, a twenty-fourth active layer, a twenty-fifth convolution layer, a fifteenth convolution layer, a twenty-fifth active layer and a sixth up-sampling layer; the fourth fused layer is formed by1' and R1' 128 feature sets a and J for channel number output9Performing channel stacking, and outputting 256 characteristic graphs marked as J10(ii) a Twenty-third convolutional layer input J10The output is 128 feature maps, the number of convolution kernels (filters) is 128, the size (kernel _ size) is 3 × 3, the value of the zero padding parameter (padding) is 1, the step size (stride) is 1, the Normalization algorithm used in the thirteenth Normalization layer is "Batch Normalization", the activation mode of the twenty-third activation layer is "ReLU function", and the feature map set is recorded as J11(ii) a Twenty-fourth convolutional layer input J11The output is 64 feature maps, the number of convolution kernels (filters) is 64, the size (kernel _ size) is 3 × 3, the value of the zero padding parameter (padding) is 1, the step size (stride) is 1, the Normalization algorithm used in the fourteenth layer is "Batch Normalization", the activation mode of the twenty-fourth activation layer is "ReLU function", and the feature map set is represented as J12(ii) a Twenty-fifth convolution layer input J12The output is 1 feature map, the number of convolution kernels (filters) is 1, the size (kernel _ size) is 3 × 3, the value of zero padding parameter (padding) is 1, the step size (stride) is 1, the normalization algorithm used in the fifteenth layer is "batch normalization", the activation mode of the twenty-fifth activation layer is "ReLU function", and the feature map set is recorded as J13(ii) a The sixth upsampling layer specified multiple (scale _ factor) is set to 2, and the input is J14The output is 1 characteristic graph with the size of W × H and J14The final significance prediction graph is obtained;
① _3, inputting the RGB image and depth image of the original stereo image in the training set as input into the constructed convolutional neural network for training to obtain the significance detection image corresponding to the original stereo image, and recording the set formed by the significance detection images obtained after training as the set
Step ① _4, computing a set of trained saliency detection mapsCorresponding to a true human eye gaze image { GiThe value of the loss function between the sets of (x, y) } is noted as
Step ① _5, repeatedly executing step ① _3 and step ① _4m times to obtain a convolutional neural network classification training model, obtaining n × m loss function values in total, then finding out the loss function value with the minimum value from the n × m loss function values, and then correspondingly taking the weight vector and the bias item corresponding to the minimum loss function value as the optimal weight vector and the optimal bias item of the convolutional neural network classification training model, and correspondingly marking as WBestAnd BBest(ii) a Wherein m is more than 1, and m is 50 in the experiment;
the test stage process comprises the following specific steps:
step ② _ 1: orderRepresenting a saliency stereo RGB image to be detected and a corresponding depth image; a (x ', y') representsThe pixel value of the pixel point with the middle coordinate position (x ', y') is represented by WWidth of (A), H' representsThe height of the glass is that x 'is more than or equal to 1 and less than or equal to W', and y 'is more than or equal to 1 and less than or equal to H';
step ② _2The R channel component, the G channel component and the B channel component are input into a convolutional neural network training model which is constructed, and W is utilizedBestAnd BBestMaking a prediction to obtainCorresponding saliency detection images, denotedWhereinTo representThe pixel value of the pixel point with the middle coordinate position (x ', y');
to further verify the feasibility and effectiveness of the method of the invention, experiments were performed.
A deep learning library PyTorch1.1.0 based on python is used for constructing a convolutional neural network architecture of the asymmetric multi-modal fusion significance detection method based on the attention mechanism. The data sets NUS and NCTU are used to analyze the detection effect of the significant images (600 and 475 stereo images respectively) detected by the method. In this experiment, 4 common objective parameters of the significance evaluation detection method were used as evaluation indexes: linear correlation coefficient (abbreviated CC), Kullback-Leibler Divergence coefficient (abbreviated KLDiv), AUC parameter (the operator Under the receiver operating characterization measurement, abbreviated AUC), and Normalized scan path significance (abbreviated NSS) to evaluate the detection performance of significance detection images.
The method is utilized to detect each three-dimensional image in the two data sets NUS and NCTU to obtain a significance detection image corresponding to each three-dimensional image, and the linear correlation coefficient CC, the Kullback-Leibler divergence coefficient KLdiv, the AUC parameter and the standardized scanning path significance NSS which reflect the significance detection effect of the method are listed in Table 1.
TABLE 1 evaluation results obtained by the method of the invention
As is clear from the data shown in Table 1, the detection results of the saliency detection images obtained by the method of the present invention are good. The objective evaluation result is consistent with the result of human eye subjective perception, which is enough to explain the feasibility and effectiveness of the method. Fig. 4(a) shows a human eye gaze image corresponding to the 1 st original stereo image of the same scene in the NCTU data set; FIG. 4(b) shows a saliency detection image resulting from detection of the original stereo image shown in FIG. 4(a) using the method of the present invention; fig. 5(a) shows a human eye gaze image corresponding to the 2 nd original stereo image of the same scene in the NCTU data set; FIG. 5(b) shows a saliency detection image resulting from detection of the original stereo image shown in FIG. 5(a) using the method of the present invention; fig. 6(a) shows a human eye gaze image corresponding to the 3 rd original stereo image of the same scene in the NUS data set; FIG. 6(b) shows a saliency detection image resulting from detection of the original stereo image shown in FIG. 6(a) using the method of the present invention; fig. 7(a) shows a human eye gaze image corresponding to the 4 th original stereo image of the same scene in the NUS data set; fig. 7(b) shows a saliency detection image obtained by detecting the original stereo image shown in fig. 7(a) by the method of the present invention.
Comparing fig. 4(a) and 4(b), comparing fig. 5(a) and 5(b), comparing fig. 6(a) and 6(b), and comparing fig. 7(a) and 7(b), it can be seen that the prediction accuracy of the saliency detection image obtained by the method of the present invention is improved, and the prominent technical effect is obtained.
Claims (5)
1. An asymmetric multi-mode fusion significance detection method based on an attention mechanism is characterized by comprising a training stage and a testing stage;
the specific steps of the training phase process are as follows:
step 1.1): collecting and selecting RGB (Red Green blue) images and depth images of n original stereo images with target objects and forming a training set with a real human eye annotation image obtained by labeling, and adopting an HHA (Hilbert-Huang-Advance analysis) method to combine the depth images in the training setAll processed as a set H having three channels as the original stereo imagei;
Step 1.2): constructing a convolutional neural network;
step 1.3): inputting the RGB image and the depth image of the original stereo image in the training set as input into the constructed convolutional neural network for training to obtain a significance detection image corresponding to the original stereo image, and recording a set formed by the significance detection images obtained after training as
Step 1.4): set formed by significance detection graphs obtained by computational trainingCorresponding to a true human eye gaze image { GiThe value of the loss function between the sets of (x, y) } is noted as
Step 1.5) continuously and repeatedly executing the step 1.3) and the step 1.4) for iteration for m times to obtain a convolutional neural network classification training model, obtaining n × m loss function values in total, then finding out the loss function value with the minimum value from the n × m loss function values, and then reserving the weight vector and the bias item of the convolutional neural network corresponding to the minimum loss function value as the optimal weight vector W of the trained convolutional neural networkBestAnd an optimum bias term BBest;
The test stage process comprises the following specific steps:
step 2.1): combining an RGB map and a depth map for detecting a target objectThe R channel component, the G channel component and the B channel component are input into the trained convolutional neural network, and the optimal weight vector W is utilizedBestAnd an optimum bias term BBestMaking a prediction to obtainCorresponding saliency detection imagesWhereinTo representAnd the pixel value of the pixel point with the middle coordinate position of (x ', y').
2. The asymmetric multi-modal fusion saliency detection method based on attention mechanism as claimed in claim 1, characterized in that:
the convolutional neural network in the step 1.2) comprises an input layer and a hidden layer, wherein the output of the hidden layer is the output of the convolutional neural network:
the input end of the input layer inputs the RGB image and the depth map of the original stereo image, the output end of the input layer outputs the R channel component, the G channel component and the B channel component of the RGB image of the original stereo image and the coding map of the depth map, and the output quantity of the input layer is the input quantity of the hidden layer; the depth map is processed in an input layer in a HHA coding mode and then has a coding map with three channels to form the depth map like an RGB map, and the RGB map and the depth map of the original stereo image have the same width and height and are both W and H;
the hidden layer comprises the following components: ten neural network blocks, a channel attention module, an internal perception module, a spatial attention module SAM and four decoding blocks;
for the processing of the depth map:
the 1 st neural network block is composed of a first convolution layer, a first activation layer, a second convolution layer, a second activation layer and a first maximum pooling layer which are connected in sequence, the input is a coding graph of a depth map output by an input layer, and the output is a first depth feature map set D formed by 64 processed feature maps1The width of each figure isHas a height of
The 2 nd neural network block consists of a third convolution layer, a third activation layer, a fourth convolution layer, a fourth activation layer and a second maximum pooling layer, 64 feature maps output by the 1 st neural network block are input, and a second depth feature map set D is formed by outputting 128 feature maps2The width of each figure isHas a height of
The input of the 3 rd neural network block is 128 feature maps output by the 2 nd neural network block, and the output is 256 feature maps which form a third depth feature map set D3The width of each figure isHas a height of
The 4 th neural network block has 256 feature maps output by the 3 rd neural network block as inputs, and the output is 512 feature maps forming a fourth depthFeature map set D4The width of each figure isHas a height of
The input of the 5 th neural network block is 512 feature maps output by the 4 th neural network block, and the output is 512 feature maps to form a fifth depth feature map set D5The width of each figure isHas a height of
Processing the depth map from the 1 st neural network block to the 5 th neural network block to obtain five depth feature map sets respectively D1、D2、D3、D4、D5;
For the processing of the RGB map:
the 6 th neural network block consists of an eleventh convolutional layer, a first normalization layer, an eleventh activation layer and a sixth maximum pooling layer, the input is a three-channel original RGB map, and the output is a first RGB feature map set R formed by 64 processed feature maps1The width of each figure isHas a height of
The input of the 7 th neural network block is 64 feature maps output by the 6 th neural network block, and the output is 256 feature maps forming a second RGB feature map set D2The width of each figure isHas a height ofThe 7 th neural network block consists of three continuous convolution blocks; each convolution block is formed by connecting four continuous convolution layers, the input of the fourth convolution layer is the output of the third convolution layer and the output of the previous convolution block, and the output is 256 characteristic graphs after addition;
the 8 th neural network block consists of four continuous convolution blocks, the input is 256 characteristic maps output by the 7 th neural network block, and the output is 512 characteristic maps to form a third RGB characteristic map set R3The width of each figure isHas a height of
The 9 th neural network block consists of six continuous convolution blocks, 512 feature maps output by the 8 th neural network block are input, 1024 feature maps are output to form a fourth RGB feature map set R4The width of each figure isHas a height of
The 10 th neural network block consists of three continuous convolution blocks, 1024 characteristic graphs output by the 9 th neural network block are input, 2048 characteristic graphs are output to form a fifth RGB characteristic graph set R5The width of each figure isHas a height of
Processing the RGB map respectively through the 6 th neural network block to the 10 th neural network block to obtain five RGB feature map sets, wherein the five RGB feature map sets are R respectively1、R2、R3、R4、R5;
Then, the first depth feature map set D1And a first set R of RGB feature maps1Outputting 128 characteristic graphs as a first characteristic graph set a after channel overlapping operation after the processing of the channel attention modules CAM; second depth feature map set D2And a second set R of RGB signatures2All processed by respective channel attention modules CAM, and output 384 characteristic graphs after channel overlapping operation as a second characteristic graph set b; third depth feature map set D3And a third set of RGB feature maps R3After being processed by the respective channel attention modules CAM, the data processing module outputs 768 characteristic graphs after channel overlapping operation, and the 768 characteristic graphs are used as a third characteristic graph set c; fourth depth feature map set D4And a fourth set of RGB feature maps R4After being processed by the respective channel attention modules CAM, the channel data processing module outputs 1536 feature graphs as a fourth feature graph set d after channel data stacking operation;
fifth RGB feature map set R5Obtaining a perception feature map set F, a perception feature map set F and a fifth depth feature map set D after IPM processing5The output of the spatial attention module SAM and the fourth feature map set d are input into a 1 st decoding block after being overlapped by the channel number, the output of the 1 st decoding block and the third feature map set c are input into a 2 nd decoding block after being overlapped by the channel number, the output of the 2 nd decoding block and the second feature map set b are input into a 3 rd decoding block after being overlapped by the channel number, the output of the 3 rd decoding block and the first feature map set a are input into a 4 th decoding block after being overlapped by the channel number, and the output of the 4 th decoding block is used as the output of a hidden layer, namely the final significance prediction map.
3. The asymmetric multi-modal fusion saliency detection method based on attention mechanism as claimed in claim 2, characterized in that:
the channel attention module CAM specifically comprises: input as a feature graph set Xi,Xi∈(D1,D2,D3,D4,R1,R2,R3,R4) First, after the matrix shape adjustment operation (reshape), a first adjustment diagram RE (X) is obtainedi) (ii) a Then, the first adjustment diagram RE (X)i) Performing matrix transposition (transpose) to obtain a second adjustment diagram RET(Xi) (ii) a Then, the second adjustment chart RE is usedT(Xi) And a first adjustment chart RE (X)i) Matrix multiplication is carried out to obtain a third adjustment chart M (X)i) Then, the attention feature map S (X) is obtained through the processing of the softmax functioni) (ii) a Then the third adjustment diagram RE (X)i) And attention to the feature map S (X)i) Matrix multiplication is carried out, and then the matrix shape is adjusted to obtain a fourth adjustment diagram SR (X)i) (ii) a Finally, the fourth adjustment diagram SR (X)i) Multiplying by range parameter α and inputting feature map set XiPerforming an addition operation to finally output a fifth adjustment diagram O (X)i) As the output of the channel attention module CAM;
the input of the internal perception module IPM is a fifth RGB feature map set R output by the 5 th neural network block of the RGB map5The output is 1024 characteristic graphs as a perception characteristic graph set F; the internal sensing module IPM comprises a 1 st expansion volume block, a 2 nd expansion volume block, a 3 rd expansion volume block, a 4 th expansion volume block and a first up-sampling layer, wherein the output of the 1 st expansion volume block is input into the 2 nd expansion volume block, and the output of the 1 st expansion volume block and the output channel of the 2 nd expansion volume block are overlapped and then input into the 3 rd expansion volume block after being overlapped with the output channel of the 1 st expansion volume block; the output of the 3 rd expansion volume block and the input of the 3 rd expansion volume block are input to the fourth expansion volume block after being overlapped with the output of the 3 rd expansion volume block through the channel number, the output of the fourth expansion volume block is directly input to the first up-sampling layer, and the output of the first up-sampling layer is used as the output of the internal sensing module IPM.
4. The asymmetric multi-modal fusion saliency detection method based on attention mechanism as claimed in claim 2, characterized in that: the SAM is mainly composed of a sixteenth convolutional layer, a sixth convolutional layer, a sixteenth active layer and a second up-sampling layer, wherein the sixteenth convolutional layer is input into a fifth depth feature map set D output by a 5 th neural network block of a depth map5The output of the soft max activation function is subjected to matrix multiplication with the perception feature map set F, and then is multiplied by the range parameter β to obtain a feature map set S4Feature map set S4A fifth depth feature map set D output by the 5 th neural network block of the final sum depth map5Output attention feature set S of additive operation together5As output of the spatial attention module SAM.
5. The asymmetric multi-modal fusion saliency detection method based on attention mechanism as claimed in claim 2, characterized in that: for the 1 st decoding block, the decoding block is mainly formed by sequentially connecting a first fusion layer, a seventeenth convolutional layer, a seventh return layer, a seventeenth active layer, an eighteenth convolutional layer, an eighth return layer, an eighteenth active layer and a third upper sampling layer; for the 2 nd decoding block, the decoding block is mainly formed by sequentially connecting a second fusion layer, a nineteenth convolution layer, a ninth returning layer, a nineteenth active layer, a twentieth convolution layer, a tenth returning layer, a twentieth active layer and a fourth upper sampling layer; for the 3 rd decoding block, the decoding block is mainly formed by sequentially connecting a third fusion layer, a twenty-first convolution layer, an eleventh returning layer, a twenty-first activation layer, a twenty-second convolution layer, a twelfth returning layer, a twenty-second activation layer and a fifth upper sampling layer; the 4 th decoding block is mainly formed by sequentially connecting a fourth fusion layer, a twenty-third convolution layer, a thirteenth merging layer, a twenty-third active layer, a twenty-fourth convolution layer, a fourteenth merging layer, a twenty-fourth active layer, a twenty-fifth convolution layer, a fifteenth merging layer, a twenty-fifth active layer and a sixth up-sampling layer.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010291052.4A CN111563418A (en) | 2020-04-14 | 2020-04-14 | Asymmetric multi-mode fusion significance detection method based on attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010291052.4A CN111563418A (en) | 2020-04-14 | 2020-04-14 | Asymmetric multi-mode fusion significance detection method based on attention mechanism |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111563418A true CN111563418A (en) | 2020-08-21 |
Family
ID=72067830
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010291052.4A Withdrawn CN111563418A (en) | 2020-04-14 | 2020-04-14 | Asymmetric multi-mode fusion significance detection method based on attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111563418A (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111768375A (en) * | 2020-06-24 | 2020-10-13 | 海南大学 | Asymmetric GM multi-mode fusion significance detection method and system based on CWAM |
CN112465746A (en) * | 2020-11-02 | 2021-03-09 | 新疆天维无损检测有限公司 | Method for detecting small defects in radiographic film |
CN112509046A (en) * | 2020-12-10 | 2021-03-16 | 电子科技大学 | Weak supervision convolutional neural network image target positioning method |
CN112597996A (en) * | 2020-12-28 | 2021-04-02 | 山西云时代研发创新中心有限公司 | Task-driven natural scene-based traffic sign significance detection method |
CN112837262A (en) * | 2020-12-04 | 2021-05-25 | 国网宁夏电力有限公司检修公司 | Method, medium and system for detecting opening and closing states of disconnecting link |
CN112861733A (en) * | 2021-02-08 | 2021-05-28 | 电子科技大学 | Night traffic video significance detection method based on space-time double coding |
CN113033630A (en) * | 2021-03-09 | 2021-06-25 | 太原科技大学 | Infrared and visible light image deep learning fusion method based on double non-local attention models |
CN113222003A (en) * | 2021-05-08 | 2021-08-06 | 北方工业大学 | RGB-D-based indoor scene pixel-by-pixel semantic classifier construction method and system |
CN113283435A (en) * | 2021-05-14 | 2021-08-20 | 陕西科技大学 | Remote sensing image semantic segmentation method based on multi-scale attention fusion |
CN113657534A (en) * | 2021-08-24 | 2021-11-16 | 北京经纬恒润科技股份有限公司 | Classification method and device based on attention mechanism |
CN114445442A (en) * | 2022-01-28 | 2022-05-06 | 杭州电子科技大学 | Multispectral image semantic segmentation method based on asymmetric cross fusion |
CN115222629A (en) * | 2022-08-08 | 2022-10-21 | 西南交通大学 | Single remote sensing image cloud removing method based on cloud thickness estimation and deep learning |
CN118297950A (en) * | 2024-06-06 | 2024-07-05 | 北斗数字信息产业发展(辽宁)有限公司 | Stereoscopic image quality evaluation method and device based on stereoscopic vision perception mechanism |
-
2020
- 2020-04-14 CN CN202010291052.4A patent/CN111563418A/en not_active Withdrawn
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111768375B (en) * | 2020-06-24 | 2022-07-26 | 海南大学 | Asymmetric GM multi-mode fusion significance detection method and system based on CWAM |
CN111768375A (en) * | 2020-06-24 | 2020-10-13 | 海南大学 | Asymmetric GM multi-mode fusion significance detection method and system based on CWAM |
CN112465746A (en) * | 2020-11-02 | 2021-03-09 | 新疆天维无损检测有限公司 | Method for detecting small defects in radiographic film |
CN112465746B (en) * | 2020-11-02 | 2024-03-05 | 新疆天维无损检测有限公司 | Method for detecting small defects in ray film |
CN112837262A (en) * | 2020-12-04 | 2021-05-25 | 国网宁夏电力有限公司检修公司 | Method, medium and system for detecting opening and closing states of disconnecting link |
CN112509046A (en) * | 2020-12-10 | 2021-03-16 | 电子科技大学 | Weak supervision convolutional neural network image target positioning method |
CN112597996A (en) * | 2020-12-28 | 2021-04-02 | 山西云时代研发创新中心有限公司 | Task-driven natural scene-based traffic sign significance detection method |
CN112597996B (en) * | 2020-12-28 | 2024-03-29 | 山西云时代研发创新中心有限公司 | Method for detecting traffic sign significance in natural scene based on task driving |
CN112861733A (en) * | 2021-02-08 | 2021-05-28 | 电子科技大学 | Night traffic video significance detection method based on space-time double coding |
CN112861733B (en) * | 2021-02-08 | 2022-09-02 | 电子科技大学 | Night traffic video significance detection method based on space-time double coding |
CN113033630A (en) * | 2021-03-09 | 2021-06-25 | 太原科技大学 | Infrared and visible light image deep learning fusion method based on double non-local attention models |
CN113222003B (en) * | 2021-05-08 | 2023-08-01 | 北方工业大学 | Construction method and system of indoor scene pixel-by-pixel semantic classifier based on RGB-D |
CN113222003A (en) * | 2021-05-08 | 2021-08-06 | 北方工业大学 | RGB-D-based indoor scene pixel-by-pixel semantic classifier construction method and system |
CN113283435B (en) * | 2021-05-14 | 2023-08-22 | 陕西科技大学 | Remote sensing image semantic segmentation method based on multi-scale attention fusion |
CN113283435A (en) * | 2021-05-14 | 2021-08-20 | 陕西科技大学 | Remote sensing image semantic segmentation method based on multi-scale attention fusion |
CN113657534A (en) * | 2021-08-24 | 2021-11-16 | 北京经纬恒润科技股份有限公司 | Classification method and device based on attention mechanism |
CN114445442A (en) * | 2022-01-28 | 2022-05-06 | 杭州电子科技大学 | Multispectral image semantic segmentation method based on asymmetric cross fusion |
CN114445442B (en) * | 2022-01-28 | 2022-12-02 | 杭州电子科技大学 | Multispectral image semantic segmentation method based on asymmetric cross fusion |
CN115222629A (en) * | 2022-08-08 | 2022-10-21 | 西南交通大学 | Single remote sensing image cloud removing method based on cloud thickness estimation and deep learning |
CN118297950A (en) * | 2024-06-06 | 2024-07-05 | 北斗数字信息产业发展(辽宁)有限公司 | Stereoscopic image quality evaluation method and device based on stereoscopic vision perception mechanism |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111563418A (en) | Asymmetric multi-mode fusion significance detection method based on attention mechanism | |
CN110555434B (en) | Method for detecting visual saliency of three-dimensional image through local contrast and global guidance | |
CN111080629B (en) | Method for detecting image splicing tampering | |
CN110175986B (en) | Stereo image visual saliency detection method based on convolutional neural network | |
CN110619638A (en) | Multi-mode fusion significance detection method based on convolution block attention module | |
CN107944442B (en) | Based on the object test equipment and method for improving convolutional neural networks | |
CN110059728B (en) | RGB-D image visual saliency detection method based on attention model | |
CN113449727A (en) | Camouflage target detection and identification method based on deep neural network | |
CN106462771A (en) | 3D image significance detection method | |
CN110929736A (en) | Multi-feature cascade RGB-D significance target detection method | |
CN110705566B (en) | Multi-mode fusion significance detection method based on spatial pyramid pool | |
CN110210492B (en) | Stereo image visual saliency detection method based on deep learning | |
CN113112416B (en) | Semantic-guided face image restoration method | |
CN110827312B (en) | Learning method based on cooperative visual attention neural network | |
CN112149662A (en) | Multi-mode fusion significance detection method based on expansion volume block | |
CN113449691A (en) | Human shape recognition system and method based on non-local attention mechanism | |
CN110458178A (en) | The multi-modal RGB-D conspicuousness object detection method spliced more | |
CN115588190A (en) | Mature fruit identification and picking point positioning method and device | |
CN114463492A (en) | Adaptive channel attention three-dimensional reconstruction method based on deep learning | |
CN113610905B (en) | Deep learning remote sensing image registration method based on sub-image matching and application | |
CN116883679B (en) | Ground object target extraction method and device based on deep learning | |
CN107909565A (en) | Stereo-picture Comfort Evaluation method based on convolutional neural networks | |
CN115202477A (en) | AR (augmented reality) view interaction method and system based on heterogeneous twin network | |
CN111539434B (en) | Infrared weak and small target detection method based on similarity | |
CN117495718A (en) | Multi-scale self-adaptive remote sensing image defogging method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20200821 |