Disclosure of Invention
The invention aims to provide a single-view image camera calibration method with high reliability, good accuracy and good practicability.
The second purpose of the invention is to provide a system for realizing the calibration method of the single-view image camera.
The single-view image camera calibration method provided by the invention comprises the following steps:
s1, acquiring an existing image data set;
s2, extracting feature vectors corresponding to the images based on the geometric features of the images from the image data set obtained in the step S1, so as to construct a training data set;
s3, constructing a single-view image camera calibration preliminary model comprising an encoder network and a decoder network;
The encoder network is used for extracting geometric features and context information in the input image and adaptively focusing on key areas in the input image;
The decoder network is used for mapping the output characteristics of the encoder network and the characteristics of the input image to the parameter space of the single-view image camera so as to predict the internal parameters and the external parameters of the single-view image camera;
S4, training the single-view image camera calibration preliminary model constructed in the step S3 by adopting the training data set constructed in the step S2 to obtain a single-view image camera calibration model;
S5, adopting the single-view image camera calibration model constructed in the step S4 to finish parameter calibration of the target single-view image camera.
The step S2 specifically comprises the following steps:
for the image dataset acquired in step S1, the line L existing in the image is extracted by using LSD (LINE SEGMENT Detector) algorithm, which is expressed as WhereinFor the image of the i-th sheet,The number of detected line segments;
each line segment is processed according to the linear equation of the line segment in the image plane Representation of whereinFor the image coordinates, a is the normal vector component in the direction of the x axis of the straight line, b is the normal vector component in the direction of the y axis of the straight line, and c is the origin point of the straight lineIs a distance of (2);
conversion to obtain an upper triangular matrix And encoded as a 6-dimensional vectorAs a feature vector for the image.
The step S3 comprises the following steps:
constructing an encoder network based on the spatial domain geometry and the frequency domain geometry;
A decoder network is constructed based on a self-attention mechanism.
The construction of the encoder network specifically comprises the following steps:
The constructed encoder network comprises an encoding module, a frequency domain module, an attention module and an reasoning module which are sequentially connected in series; the input of the encoder network is an image in the training dataset;
The coding module is used for carrying out preliminary feature extraction and representation learning on the input image; the encoding module comprises a first encoder layer, a second encoder layer, a third encoder layer and a fourth encoder layer; the input of the first encoder layer and the output of the first encoder layer are added to be used as the input of the second encoder layer; the input of the second encoder layer and the output of the second encoder layer are added to be used as the input of the third encoder layer; the input of the third encoder layer and the output of the third encoder layer are added to be used as the input of the fourth encoder layer; the input of the fourth encoder layer and the output of the fourth encoder layer are added to be used as the output of the encoding module; the first encoder layer, the second encoder layer, the third encoder layer and the fourth encoder layer have the same structure and comprise a first convolution layer, a second convolution layer, a third convolution layer, a batch normalization layer and a ReLU activation function layer; the first convolution layer is Is a convolution layer of (2); the second convolution layer isIs a convolution layer of (2); the third convolution layer isIs a convolution layer of (2);
The frequency domain module is used for extracting geometrical feature distribution of the image under different frequencies; the frequency domain module comprises a first normalization layer, a fast Fourier transform layer, a complex parameter layer, a fast inverse Fourier transform layer and a second normalization layer which are sequentially connected in series; the output of the coding module is normalized by the first normalization layer and then is subjected to fast Fourier transform by the fast Fourier transform layer; the output of the fast Fourier transform layer is processed by a complex parameter layer to control the importance of different frequency components; the output of the complex parameter layer is subjected to fast inverse Fourier transform through a fast inverse Fourier transform layer, and finally normalized through a second normalization layer, so that the output of the frequency domain module is obtained; wherein the processing procedure of the complex parameter processing layer is expressed as Where w represents a weight parameter,Representing the real tensor after fourier transformation,Representing the imaginary tensor after Fourier transformation, i representing the imaginary unit;
The attention module is used for extracting and enhancing geometric features related to single-view image camera calibration in the image; the attention module comprises a linear transformation layer, a first zooming point lamination layer and a geometric feature perception layer which are sequentially connected in series; the linear transformation layer is composed of a full connection layer and is used for carrying out linear transformation on the input characteristic data; a first scale point overlay is used to capture the relative importance and degree of correlation between features, the process of which is expressed as WhereinAndRepresenting the matrix obtained by the linear transformation layer,Representation ofS represents a softmax function; the geometric feature perception layer is used for extracting geometric information feature representation of input features, the geometric feature perception layer comprises a first branch, a second branch and GeLU activation function sublayers, after the input features pass through the first branch and the second branch respectively, the output of the first branch and the output of the second branch are added, and then processed by the GeLU activation function sublayers and multiplied by the output of the scaling point lamination to obtain the output of the geometric feature perception layer, the first branch comprises sequentially connected in seriesA convolution layer, a batch normalization layer and a sigmoid activation function layer, and a second branch comprisesA convolution layer, a batch normalization layer and a sigmoid activation function layer;
the reasoning module is used for carrying out nonlinear change on the output characteristics of the attention module and enhancing the characteristic extraction capacity of the model; the reasoning module comprises a first full-connection layer, a ReLu activation function layer and a second full-connection layer which are sequentially connected in series.
The construction of the decoder network specifically comprises the following steps:
The built decoder network comprises a convolution layer, a first attention layer, a second attention layer and a feedforward neural network layer which are sequentially connected in series; the input of the decoder network is the output of the encoder network and the feature vector corresponding to the image in the training data set;
The convolution layer is Is a convolution layer of (2);
The first attention layer and the second attention layer have the same structure and comprise a linear projection sublayer, a second zoom point product sublayer, a softmax sublayer and a third zoom point product sublayer; the linear projection sublayer is used for carrying out linear projection on the input characteristics to obtain a Q value matrix Q, K value matrix K and a V value matrix V through transformation; the second scaling dot product sub-layer is used for calculating the correlation of the Q value matrix Q and the K value matrix K, and the processing procedure of the second scaling dot product sub-layer is expressed as WhereinAndRepresenting the matrix obtained by linearly projecting the sub-layers,Representation ofDimension size of (2); the softmax sublayer is used for carrying out normalization processing on input data; the third scaling dot product sub-layer is used for extracting global context information, and the processing procedure of the third scaling dot product sub-layer is expressed asWhereinRepresenting a matrix obtained by linear projection of the sub-layer;
The feedforward neural network layer adopts a feedforward neural network to transform the obtained global context information into an internal reference space and an external reference space of the single-view image camera, so that parameter calibration of the single-view image camera is realized.
The training of step S4 specifically includes the following steps:
the following formula is adopted as the logarithmic space loss function :Wherein w is the width of the image; in is the true value of the internal parameter of the single-view image camera; A predicted value of an internal reference of the single-view image camera; Is a set threshold value;
when predicting vanishing point constraint of camera external parameters, the following formula is adopted as a first similarity loss function :Wherein vp is the true value of the vanishing point coordinates; A predicted value of vanishing point coordinates; Is the norm of the vector;
When the horizon loss of the camera external parameters is predicted, the following formula is adopted as a second similarity loss function :Wherein n is the number of selected terminal points; hor is the true value of the horizon; a predicted value for the horizon; calculating functions for coordinates of left and right endpoints of the horizon; Manhattan distance for vector;
Finally, the total loss function LL is constructed as ,As a first weight to be used,As a result of the second weight being set,Is a third weight.
The invention also provides a system for realizing the single-view image camera calibration method, which comprises a data acquisition module, a training set construction module, a model training module and a camera calibration module; the data acquisition module, the training set construction module, the model training module and the camera calibration module are sequentially connected in series; the data acquisition module is used for acquiring the existing image data set and uploading the data information to the training set construction module; the training set construction module is used for extracting feature vectors corresponding to the images from the acquired image data set according to the received data information, so as to construct a training data set, and uploading the data information to the model construction module; the model construction module is used for constructing a single-view image camera calibration preliminary model comprising an encoder network and a decoder network according to the received data information, and uploading the data information to the model training module; the model training module is used for training the constructed single-view image camera calibration preliminary model by adopting the constructed training data set according to the received data information to obtain a single-view image camera calibration model, and uploading the data information to the camera calibration module; and the camera calibration module is used for completing parameter calibration of the target single-view image camera by adopting the constructed single-view image camera calibration model according to the received data information.
According to the single-view image camera calibration method and system, the feature vectors are extracted based on the geometric features of the images, the geometric features related to the targets in the images are effectively extracted, richer and targeted feature information is provided for the training process of the model, and the constructed calibration network is guided to carry out corresponding training, so that the scheme of the invention can obtain more accurate calibration results in scenes without presetting markers, and the method is higher in reliability, better in accuracy and better in practicability.
Detailed Description
FIG. 1 is a schematic flow chart of the calibration method of the present invention: the invention discloses a single-view image camera calibration method, which comprises the following steps:
s1, acquiring an existing image data set;
In specific implementation, a google street view data set and an HLW data set can be obtained, wherein the data set comprises pictures and parameters of a corresponding single-view image camera;
S2, extracting feature vectors corresponding to the images based on the geometric features of the images from the image data set obtained in the step S1, so as to construct a training data set; the method specifically comprises the following steps:
for the image dataset acquired in step S1, the line L existing in the image is extracted by using LSD (LINE SEGMENT Detector) algorithm, which is expressed as WhereinFor the image of the i-th sheet,The number of detected line segments;
each line segment is processed according to the linear equation of the line segment in the image plane Representation of whereinFor the image coordinates, a is the normal vector component in the direction of the x axis of the straight line, b is the normal vector component in the direction of the y axis of the straight line, and c is the origin point of the straight lineIs a distance of (2);
conversion to obtain an upper triangular matrix And encoded as a 6-dimensional vectorAs a feature vector corresponding to the image;
s3, constructing a single-view image camera calibration preliminary model comprising an encoder network and a decoder network;
The encoder network is used for extracting geometric features and context information in the input image and adaptively focusing on key areas in the input image; the key region is a key region containing remarkable geometric clues;
The decoder network is used for mapping the output characteristics of the encoder network and the characteristics of the input image to the parameter space of the single-view image camera so as to predict the internal parameters and the external parameters of the single-view image camera;
The specific implementation method comprises the following steps:
Constructing an encoder network based on the spatial domain geometry and the frequency domain geometry; the method specifically comprises the following steps:
The constructed encoder network comprises an encoding module, a frequency domain module, an attention module and an reasoning module which are sequentially connected in series; the input of the encoder network is an image in the training dataset;
The coding module is used for carrying out preliminary feature extraction and representation learning on the input image; the encoding module comprises a first encoder layer, a second encoder layer, a third encoder layer and a fourth encoder layer; the input of the first encoder layer and the output of the first encoder layer are added to be used as the input of the second encoder layer; the input of the second encoder layer and the output of the second encoder layer are added to be used as the input of the third encoder layer; the input of the third encoder layer and the output of the third encoder layer are added to be used as the input of the fourth encoder layer; the input of the fourth encoder layer and the output of the fourth encoder layer are added to be used as the output of the encoding module; the first encoder layer, the second encoder layer, the third encoder layer and the fourth encoder layer have the same structure and comprise a first convolution layer, a second convolution layer, a third convolution layer, a batch normalization layer and a ReLU activation function layer; the first convolution layer is Is a convolution layer of (2); the second convolution layer isIs a convolution layer of (2); the third convolution layer isIs a convolution layer of (2);
The frequency domain module is used for extracting geometrical feature distribution of the image under different frequencies, wherein the geometrical features can show special distribution in the frequency domain; the frequency domain module comprises a first normalization layer, a fast Fourier transform layer, a complex parameter layer, a fast inverse Fourier transform layer and a second normalization layer which are sequentially connected in series; the output of the coding module is normalized by the first normalization layer and then is subjected to fast Fourier transform by the fast Fourier transform layer; the output of the fast Fourier transform layer is processed by a complex parameter layer to control the importance of different frequency components; the output of the complex parameter layer is subjected to fast inverse Fourier transform through a fast inverse Fourier transform layer, and finally normalized through a second normalization layer, so that the output of the frequency domain module is obtained; wherein the processing procedure of the complex parameter processing layer is expressed as Where w represents a weight parameter,Representing the real tensor after fourier transformation,Representing the imaginary tensor after Fourier transformation, i representing the imaginary unit;
The attention module is used for extracting and enhancing geometric features related to single-view image camera calibration in the image; the attention module comprises a linear transformation layer, a first zooming point lamination layer and a geometric feature perception layer which are sequentially connected in series; the linear transformation layer is composed of a full connection layer and is used for carrying out linear transformation on the input characteristic data; the first scaling dot lamination consists of matrix multiplication and scaling operations for obtaining a weight matrix to capture the relative importance and degree of correlation between features, and its processing is expressed as WhereinAndRepresenting the matrix obtained by the linear transformation layer,Representation ofS represents a softmax function; the geometric feature perception layer is used for extracting geometric information feature representation of input features, the geometric feature perception layer comprises a first branch, a second branch and GeLU activation function sublayers, after the input features pass through the first branch and the second branch respectively, the output of the first branch and the output of the second branch are added, and then processed by the GeLU activation function sublayers and multiplied by the output of the scaling point lamination to obtain the output of the geometric feature perception layer, the first branch comprises sequentially connected in seriesA convolution layer, a batch normalization layer and a sigmoid activation function layer, and a second branch comprisesA convolution layer, a batch normalization layer and a sigmoid activation function layer;
The reasoning module is used for carrying out nonlinear change on the output characteristics of the attention module and enhancing the characteristic extraction capacity of the model; the reasoning module comprises a first full-connection layer, a ReLu activation function layer and a second full-connection layer which are sequentially connected in series;
constructing a decoder network based on a self-attention mechanism; the method specifically comprises the following steps:
The built decoder network comprises a convolution layer, a first attention layer, a second attention layer and a feedforward neural network layer which are sequentially connected in series; the input of the decoder network is the output of the encoder network and the feature vector corresponding to the image in the training data set;
The convolution layer is Is a convolution layer of (2);
The first attention layer and the second attention layer have the same structure and comprise a linear projection sublayer, a second zoom point product sublayer, a softmax sublayer and a third zoom point product sublayer; the linear projection sublayer is used for carrying out linear projection on the input characteristics to obtain a Q value matrix Q, K value matrix K and a V value matrix V through transformation; the second scaling dot product sub-layer is used for calculating the correlation of the Q value matrix Q and the K value matrix K, and the processing procedure of the second scaling dot product sub-layer is expressed as WhereinAndRepresenting the matrix obtained by linearly projecting the sub-layers,Representation ofDimension size of (2); the softmax sublayer is used for carrying out normalization processing on input data; the third scaling dot product sub-layer is used for extracting global context information, and the processing procedure of the third scaling dot product sub-layer is expressed asWhereinRepresenting a matrix obtained by linear projection of the sub-layer;
the feedforward neural network layer adopts a feedforward neural network to transform the obtained global context information into an internal reference space and an external reference space of the single-view image camera, so that parameter calibration of the single-view image camera is realized;
S4, training the single-view image camera calibration preliminary model constructed in the step S3 by adopting the training data set constructed in the step S2 to obtain a single-view image camera calibration model; the training specifically comprises the following steps:
the following formula is adopted as the logarithmic space loss function :Wherein w is the width of the image; in is the true value of the internal parameter of the single-view image camera; A predicted value of an internal reference of the single-view image camera; Is a set threshold value;
when predicting vanishing point constraint of camera external parameters, the following formula is adopted as a first similarity loss function :Wherein vp is the true value of the vanishing point coordinates; A predicted value of vanishing point coordinates; Is the norm of the vector;
When the horizon loss of the camera external parameters is predicted, the following formula is adopted as a second similarity loss function :Wherein n is the number of selected terminal points; hor is the true value of the horizon; a predicted value for the horizon; calculating functions for coordinates of left and right endpoints of the horizon; Manhattan distance for vector;
finally, the total loss function L is constructed as ,As a first weight to be used,As a result of the second weight being set,Is a third weight;
S5, adopting the single-view image camera calibration model constructed in the step S4 to finish parameter calibration of the target single-view image camera.
The scheme of the invention can predict the internal parameters and the external parameters of the camera under the condition of no need of markers by sensing the geometric clues in the image, so that the method can be used in wider scenes, and has better practicability and higher accuracy.
FIG. 2 is a visual schematic diagram of the calibration method of the present invention:
Fig. 2 (a) corresponds to a feature representation of learned vanishing points, and it can be found that the points of interest of the network are concentrated on parallel lines on both sides of the road surface street, and the network extracts geometrical information closely related to the vanishing points by capturing these parallel line segments and their extending directions. The network effectively learns the characteristic representation related to vanishing points by utilizing the convergence trend of the parallel lines;
FIG. 2 (b) corresponds to a characteristic representation of the learned horizon, the thermodynamic diagram highlighting the contour of the building bottom and the boundary between the sky and the ground; by focusing on capturing the distinction between building bottom and ground objects, the network is able to locate the position of the horizon in the image using a characteristic representation of the geometric boundary;
FIG. 2 (c) corresponds to a feature representation of the learning internal reference, with thermodynamic diagrams focused near the center of the image, in the form of concentric circles; this feature distribution represents geometrical properties related to perspective projection, and the network provides important constraints for computing internal parameters of the image by capturing features around the center.
As can be seen from fig. 2, the network of the present invention has the ability to perceive the geometric cues corresponding to different prediction targets, and adaptively focuses on the key geometric cues in the image for a specific prediction target, so that a more accurate calibration result can be obtained on the basis of having a certain interpretation.
Fig. 3 is a schematic diagram of calibration results of the calibration method of the present invention, and fig. 3 shows comparison results of experimental results and true values, and prediction results of a horizon and a vanishing point, respectively, in which a green line represents the prediction results of the horizon, a first line of red lines represents the true values of the horizon, and a second line of red lines represents the prediction results of the vanishing point; as can be seen from FIG. 3, the calibration method of the present invention is generally close to a true value, shows higher accuracy and consistency, and again illustrates the effectiveness and superiority of the calibration method of the present invention.
The effect of the calibration method of the present invention is further described below with reference to the examples:
Comparing the method provided by the invention with the existing calibration method (traditional method, 2018 method, 2020 method and 2021 method); the traditional method is a method proposed by Hyunjoon Lee in Automatic Upright Adjustment of Photographs With Robust Camera Calibration in 2014; the 2018 method is Yannick Hold-Geoffroy method proposed in A Perceptual Measure for DEEP SINGLE IMAGE CAMERA calisation in 2018; the method in 2020 is Jinwoo Lee the method proposed in Neural Geometric Parser for SINGLE IMAGE CAMERA calisation in 2020; the method of 2021 is Jinwoo Lee (authors) in 2021 (time) in CTRL-C: camera calibration TRansformer with Line-Classification;
Experiments were performed on the disclosed google street view dataset, containing 13214 training sets and 1333 test sets, when compared. In the experiment, the average error of internal reference calibration, the average error of vanishing point prediction and the AUC value of horizon prediction are taken as evaluation criteria. All experimental results were obtained on the test set.
Specific comparative data are shown in table 1:
table 1 comparative experimental data comparative schematic table
As can be seen from the experimental results in Table 1, the calibration method provided by the invention obtains optimal results on various indexes, is superior to the traditional method and other deep learning methods, and proves the effectiveness and superiority of the method provided by the invention.
FIG. 4 is a schematic diagram of functional modules of the system of the present invention: the system for realizing the single-view image camera calibration method comprises a data acquisition module, a training set construction module, a model training module and a camera calibration module; the data acquisition module, the training set construction module, the model training module and the camera calibration module are sequentially connected in series; the data acquisition module is used for acquiring the existing image data set and uploading the data information to the training set construction module; the training set construction module is used for extracting feature vectors corresponding to the images from the acquired image data set according to the received data information, so as to construct a training data set, and uploading the data information to the model construction module; the model construction module is used for constructing a single-view image camera calibration preliminary model comprising an encoder network and a decoder network according to the received data information, and uploading the data information to the model training module; the model training module is used for training the constructed single-view image camera calibration preliminary model by adopting the constructed training data set according to the received data information to obtain a single-view image camera calibration model, and uploading the data information to the camera calibration module; and the camera calibration module is used for completing parameter calibration of the target single-view image camera by adopting the constructed single-view image camera calibration model according to the received data information.
In addition, the single-view image camera calibration method and the single-view image camera calibration system can be used for single-view image cameras, wherein the single-view image cameras comprise the single-view image camera calibration method, and the single-view image camera calibration method is adopted for single-view image camera calibration. When the method is specifically applied, the single-view image camera calibration model obtained after training in the calibration method is integrated into the single-view image camera; when the single-view image camera calibration model is used, a single-view image camera integrated with the single-view image camera calibration model is adopted, photographing is carried out at will, the obtained photo is input into the single-view image camera calibration model, the single-view image camera calibration model can realize parameter calibration on the single-view image camera through the input photo, and the calibrated parameters are input into the single-view image camera to complete the calibration process.