CN115115860A

CN115115860A - Image feature point detection matching network based on deep learning

Info

Publication number: CN115115860A
Application number: CN202210856359.3A
Authority: CN
Inventors: 罗欣; 赖广龄; 吴禹萱; 韦祖棋; 宋依芸; 常乐; 许文波
Original assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Current assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Priority date: 2022-07-20
Filing date: 2022-07-20
Publication date: 2022-09-27

Abstract

The invention designs a fusion network based on an improved SuperPoint network and an improved SuperGlue network, wherein the network utilizes a full convolution network to extract image feature points, uses a sub-pixelation module to improve the coordinate precision of the feature points by utilizing neighborhood window information, utilizes an attention mechanism to simulate the process of matching the feature points of human beings after the image feature points and the feature vectors are jointly coded, and adopts a Sinkhorn algorithm to solve the matching relation. The invention designs a self-adaptive space constraint layer, utilizes the space constraint relation to carry out parallel screening and calculation of a plurality of methods on the rough matching point pairs, can self-adaptively judge the space relation among the images and extracts the matched characteristic point pairs from the input images.

Description

Image feature point detection matching network based on deep learning

Technical Field

The invention belongs to the field of computer image processing, and relates to a method for detecting and matching feature points of images based on a deep learning method and outputting a matching feature point pair and a spatial relationship matrix between two images.

Background

The feature point refers to a key point containing texture structure information in the image and a descriptor corresponding to the key point, and the feature point detection is divided into two steps of key point detection and descriptor calculation. The commonly used key point detection methods include Laplacian method, Harris corner detection method, gaussian difference detection method, FAST corner detection method, and the like. The SIFT operator detects key points in the DOG scale pyramid and calculates a 128-dimensional descriptor containing gradient direction information, and the descriptor has scale invariance, rotation invariance and affine invariance and has excellent positioning accuracy, but the calculation process is complex, so the calculation speed is slow. Rublee proposes that improved FAST is used as a key point detector, and an ORB operator of an improved BRIEF descriptor is combined, and the binary descriptor is coded by using neighborhood pixels, so that the method has very high operation speed on the premise of ensuring scale invariance, and is applied to a plurality of real-time tasks. With the rapid development of deep learning technology, researchers also make continuous attempts and exploration in the field of feature point detection. SuperPoint is an automatic-supervision deep learning feature point extraction algorithm, has good image understanding capability and feature point description capability, and has far higher operation speed on a GPU than the traditional feature extraction algorithm. SuperGlue is an attention GNN-based matching network adapted to SuperPoint, which simulates the process of image matching by humans using an attention mechanism. SuperPoint has the disadvantage that the feature point coordinates are integers rather than floating point numbers, and therefore is limited in tasks with higher precision requirements, and SuperGlue has the disadvantage of lacking explicit spatial constraints when using GNN to simulate human vision. At present, the mainstream feature point detection task taking three-dimensional reconstruction as a downstream task still depends on a manual feature detection method represented by SIFT.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention discloses a feature point detection matching method based on deep learning. The invention designs a fusion network based on an improved SuperPoint network and an improved SuperGlue network, wherein the network utilizes a full convolution network to extract image feature points, uses a sub-pixelation module to improve the coordinate precision of the feature points by utilizing neighborhood window information, utilizes an attention mechanism to simulate the process of matching the feature points of human beings after the image feature points and the feature vectors are jointly coded, and adopts a Sinkhorn algorithm to solve the matching relation. The invention designs a self-adaptive space constraint layer, utilizes the space constraint relation to carry out parallel screening and calculation of a plurality of methods on the rough matching point pairs, can self-adaptively judge the space relation among the images and extracts the matched characteristic point pairs from the input images.

The invention adopts the following technical steps:

step 1: and extracting the fine characteristic point coordinates of the image to be matched by using a SuperPoint network with improved sub-pixel precision.

Step 2: extracting characteristic points and descriptor joint coding vectors according to sub-pixel precision characteristic point coordinates

And step 3: and (3) respectively executing the steps 1 and 2 to the two pictures to be matched to obtain two groups of feature point combined vectors, inputting the two groups of vectors into a SuperGlue attention map neural network and an optimal matching layer to obtain a matching relation matrix, wherein the matrix describes a rough matching relation and a corresponding confidence coefficient between feature points. And sorting the matching relations according to the confidence coefficient from high to low.

And 4, step 4: and inputting the coordinates and the matching relation matrix of all the feature points in the two pictures into a space constraint layer designed by the invention, wherein the space constraint layer comprises parallel basic matrix constraint and homography matrix constraint, and calculating a homography matrix score and a reprojected H-F space constraint model according to the bidirectional reprojection error.

And 4.1, taking the basic matrix as a fitting model, and fitting all the matching point pairs by a minimum median method to obtain a fitting matrix F1 and a matching point pair set corresponding to F1 when the projection error threshold is 2 pixels.

Step 4.2: and (3) performing progressive consistent sampling fitting on all the matching point pairs by taking the basic matrix as a fitting model, and dynamically allocating iteration times to be 150-300 according to the average confidence coefficient of the matching point pairs which are 30% before the confidence coefficient ranking to obtain a fitting matrix F2 and a matching point pair set corresponding to F1 when the projection error threshold is 2 pixels.

And 4.3, performing minimum median fitting on all the matching point pairs by taking the homography matrix as a fitting model to obtain a fitting matrix H1 and a matching point pair set corresponding to H1 when the projection error threshold is 2 pixels.

Step 4.4: and (3) carrying out progressive consistent sampling fitting on all matching point pairs by taking the homography matrix as a fitting model, and dynamically allocating iteration times to be 150-300 according to the average confidence coefficient of the matching point pairs which are 30% before the confidence coefficient ranking to obtain a fitting matrix H2 and a matching point pair set corresponding to H2 when the projection threshold is 2 pixels. Steps 4.1, 4.2, 4.3, 4.4 are run concurrently to save time.

Step 4.5: the homography matrix represents the relation between the points and the projection points, and the distance between the projection points and the matching points is regarded as a projection error. And calculating the average bidirectional reprojection error according to the matching point pair sets corresponding to H1 and H1, calculating the average bidirectional reprojection error corresponding to H2 in the same way, taking the smaller one of the average bidirectional reprojection errors, keeping the corresponding matrix as H, and keeping the corresponding average bidirectional reprojection error as SH.

Step 4.6: and (3) representing the relation between the point and the projection polar line by using a basic matrix, taking the point-line distance between the projection polar line and the matching point as a projection error, calculating an average bidirectional reprojection error according to matching point pairs corresponding to F1 and F1, calculating an average bidirectional reprojection error corresponding to F2 in the same way, taking the smaller of the average bidirectional reprojection errors, keeping the corresponding matrix as F, and keeping the corresponding average bidirectional reprojection error as SF.

Step 4.7: and when SH/(SF + SH) >0.4, reserving the matrix H and outputting the matching point pair set corresponding to H, otherwise, reserving the matrix F and outputting the matching point pair set corresponding to F.

Compared with the prior art, the invention has the beneficial effects that:

(1) in the task of feature detection and matching oriented to three-dimensional reconstruction, the network has more excellent precision performance

(2) Compared with the traditional feature point detection matching methods such as SIFT and ORB, the method has better robustness under the scenes of strong illumination change, large visual angle change and the like.

Drawings

FIG. 1 is a diagram of a network architecture of the present invention

FIG. 2 is a block diagram of a sub-pixelization module according to the present invention

FIG. 3 is a block diagram of a descriptor decoder according to the present invention

FIG. 4 is a diagram of a space constrained layer structure according to the present invention

Detailed description of the preferred embodiment

The invention is further described below with reference to the accompanying drawings.

The invention designs a rapid characteristic point detection matching network framework, improves the sub-pixel precision of a SuperPoint network for characteristic point detection, and adds an adaptive space constraint dynamic progressive consistent sampling module for characteristic point matching to the SuperGlue network. Fig. 1 shows a network structure of the present invention.

Firstly, extracting fine characteristic point coordinates of an image to be matched by using a SuperPoint network with improved sub-pixel precision. The method mainly comprises the following three steps:

1. and inputting the image to be matched into an encoder of a VGG structure of SuperPoint, and extracting a feature map of the image, wherein the encoder comprises 8 convolution layers, 3 pooling layers, a plurality of BN layers and an activation function layer. The convolution layers have the convolution kernel number of 64, 128 and 128 in sequence, are used for extracting features, and the three 2 multiplied by 2 maximum pooling layers sample H multiplied by W pictures to be H/8 in height and W/8 in width. An H multiplied by W image is coded by a coder to obtain an H/8 multiplied by W/8 multiplied by 128 characteristic map;

2. inputting the characteristic diagram obtained in the step 1 into a characteristic point decoder of SuperPoint, wherein the decoder comprises 2 convolutional layers with 256 and 65 channels, a plurality of BN layers and activation function layers, and outputs a tensor of W/8 xH/8 x65. Each 65-dimensional tensor represents 65 cases that the ith pixel in an 8 x 8 pixel window where the original images do not overlap is a feature point or the feature point is not contained in the pixel window. Obtaining normalized probability distribution of 65 conditions through Softmax layer classification, reducing the size to H multiplied by W multiplied by 1 through a Reshape layer to obtain an H multiplied by W scoring graph, wherein each pixel value of the scoring graph is distributed between 0 and 1 and represents the probability that each pixel point on an input image I is a feature point;

3. in the original SuperPoint, the strategy for extracting feature points is to apply non-maximum suppression (NMS) to each NxN window in the output score map, each window only retains one maximum, then threshold judgment is performed on the whole map, and points higher than the threshold are regarded as feature points. This output mode has only integer-level coordinate accuracy and is not differentiable. The invention adds a sub-pixel correction module combining the neighborhood information of the characteristic points to SuperPoint, and the main flow is shown in figure 2. Inputting the score chart obtained in the step 2 into a coordinate sub-pixelization module designed by the invention, wherein the process comprises the following steps: and adopting non-maximum suppression for each non-overlapping 4 x 4 pixel window, and setting the pixel values except the maximum value in each pixel window as 0 to obtain a coarse characteristic point diagram. Points in the coarse feature point diagram which are larger than a certain threshold value are regarded as coarse feature points, for all M coarse feature points, a pixel window of 5 multiplied by 5 is taken in the score map by taking the coordinate of each coarse feature point as the center, deviation expectation of each pixel in the pixel window relative to the center is calculated respectively in the x direction and the y direction by using a Softargmax method, and the deviation expectation and the coarse feature point coordinates are added to obtain M sub-pixilated fine feature point coordinates.

Then, feature points and descriptor joint coding vectors are extracted according to the sub-pixel precision feature point coordinates, as shown in fig. 3. The descriptor decoder receives the H/8 xW/8 x 128 feature graph, outputs an H/8 xW/8 x 256 initial descriptor matrix after multiple convolution, the initial descriptor matrix is interpolated to H xW x 256 by the original SuperPoint network by using a bicubic interpolation method, and then 256 channels are normalized by adopting L2 regularization, 256-dimensional descriptors are calculated for each pixel of the original image I, and actually, it is not necessary to calculate descriptors of non-feature point pixels. The improved descriptor decoder performs M times of bilinear interpolation in the initial descriptor matrix according to M sub-pixel coordinates output by the sub-pixel module to obtain M256-dimensional vectors. The vectors are normalized by L2 to calculate the final 256-dimensional descriptor.

Secondly, the core idea of SuperGlue is to convert the feature point matching problem into the optimal transportation problem of the feature point and descriptor joint coding vector, and use the Sinkhorn algorithm to iteratively solve. In the process, the attention GNN is used for simulating the characteristics of repeated browsing when the human eyes are matched, and the joint matching performance of the position coordinates and the visual descriptors is enhanced by using a cross-attention and self-attention mechanism. The SuperGlue is not subjected to explicit and strict spatial relation constraint in the matching process, the SuperGlue is improved aiming at the problem, a spatial constraint layer is added into the SuperGlue, and the matching precision of the SuperGlue is improved by using a dynamic progressive consistent sampling module with self-adaptive H-F spatial constraint. Fig. 4 is a structural view of the space constraint layer according to the present invention. Respectively executing the two steps to two pictures to be matched to obtain two groups of feature point combined vectors, inputting the two groups of vectors into a SuperGlue attention-driven neural network and an optimal matching layer to obtain a matching relation matrix, describing a rough matching relation and a corresponding confidence coefficient between feature points, and sequencing the matching relations according to the confidence coefficient from high to low; and finally, inputting the coordinates and the matching relation matrixes of all the feature points in the two pictures into a space constraint layer designed by the invention, wherein the space constraint layer comprises parallel basic matrix constraint and homography matrix constraint, and calculating a homography matrix score and a reprojected H-F space constraint model according to the bidirectional reprojection error.

Claims

1. An image feature point detection matching network based on deep learning is characterized by comprising the following steps:

the invention adopts the following technical steps:

Step 1.1: and inputting the image to be matched into an encoder of a VGG structure of SuperPoint, and extracting a feature map of the image, wherein the encoder comprises 8 convolution layers, 3 pooling layers, a plurality of BN layers and an activation function layer. The convolutional layers are used for extracting features, the number of convolutional kernels is 64, 128 and 128 in sequence, and three 2 x 2 maximum pooling layers are used for down-sampling. An H multiplied by W image is coded by a coder to obtain an H/8 multiplied by W/8 multiplied by 128 characteristic map.

Step 1.2: and (3) inputting the feature map obtained in the step 1.1 into a feature point decoder of SuperPoint, wherein the decoder comprises 2 convolutional layers with 256 and 65 channels respectively, a plurality of BN layers and activation function layers, and outputs a tensor of W/8 multiplied by H/8 multiplied by 65. The normalized probability distributions of 65 cases are obtained by Softmax layer classification, and the size is reduced to H multiplied by W multiplied by 1 by a Reshape layer, so that a H multiplied by W score map is obtained.

Step 1.3: inputting the score map obtained in the step 1.2 into a coordinate sub-pixelization module designed by the invention, wherein the process comprises the following steps: and adopting non-maximum suppression for each non-overlapping 4 x 4 pixel window, and setting the pixel values except the maximum value in each pixel window as 0 to obtain a coarse characteristic point diagram. Taking a 5 × 5 pixel window in the score map by taking the coordinate of each coarse feature point as the center for all M coarse feature points, respectively calculating deviation expectation of each pixel in the pixel window relative to the center for the x direction and the y direction by using a Softargmax method, and adding the deviation expectation and the coarse feature point coordinates to obtain M sub-pixilated fine feature point coordinates.

Step 2.1: inputting the characteristic diagram obtained in the step 1.1 into a descriptor decoder of SuperPoint, and outputting an H/8 xW/8 x 256 descriptor matrix after multiple convolution.

Step 2.2: and (3) performing M times of bicubic interpolation on the descriptor matrix obtained in the step (2.1) according to the M sub-pixel level coordinates output in the step (1.3) to obtain M256-dimensional vectors, and performing L2 regularization on the vectors to obtain 256-dimensional descriptors corresponding to the sub-pixel feature points.

Step 2.3: and (3) combining the fine feature point coordinates obtained in the step (1.2) with the descriptor vectors obtained in the step (2.2) to obtain feature point combined vectors.

And step 3: and (3) respectively executing the steps 1 and 2 to two pictures to be matched to obtain two groups of feature point joint vectors, and inputting the two groups of vectors into the SuperGlue attention-force diagram neural network and the optimal matching layer to obtain a matching relation matrix. And sorting the matching relations according to the confidence coefficient from high to low.

And 4, step 4: and inputting the coordinates and the matching relation matrix of all the feature points in the two pictures into a space constraint layer designed by the invention, and calculating a homography matrix score and a reprojected H-F space constraint model according to the bidirectional reprojection error.

Step 4.5: and calculating the average bidirectional reprojection error according to the matching point pair sets corresponding to H1 and H1, calculating the average bidirectional reprojection error corresponding to H2 in the same way, taking the smaller one of the average bidirectional reprojection errors, keeping the corresponding matrix as H, and keeping the corresponding average bidirectional reprojection error as SH.

Step 4.6: and calculating the average bidirectional reprojection error according to the matching point pair sets corresponding to the F1 and the F1, calculating the average bidirectional reprojection error corresponding to the F2 in the same way, taking the smaller of the average bidirectional reprojection errors, keeping the corresponding matrix as F, and keeping the average bidirectional reprojection error as SF.

2. The method of claim 1, wherein step 1 uses a sub-pixelization coordinate module to extract refined feature point coordinates, so that the network has more accurate feature extraction capability.

3. The method as claimed in claim 1, wherein after the descriptor matrix is obtained in step 2, the refined coordinates obtained in step 1 are substituted, the corresponding sub-pixel descriptors are obtained by bilinear interpolation and L2 regularization, and the coordinates and the descriptors are combined to obtain the joint coding vector.

4. The method of claim 1, wherein after the matching relationship matrix is obtained in step 3, the matching point pairs are sorted according to the confidence level.

5. The method as claimed in claim 1, wherein the spatial constraint layer designed in step 4 uses a basic matrix and a homography matrix as fitting models in parallel, first, fitting is performed according to the iteration number of a progressive consistent sampling module dynamically set according to the degree of the first 30% confidence, then, fitting is performed in parallel by using a minimum median method, the optimal fitting result is retained by using the bidirectional reprojection error of the interior point, and a correct fitting model with a self-adaptive discriminant formula is designed.