CN113763470B

CN113763470B - RGBD visual inertia simultaneous positioning and map construction with point-line feature fusion

Info

Publication number: CN113763470B
Application number: CN202110914560.8A
Authority: CN
Inventors: 赵良玉; 朱叶青; 金瑞
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2021-08-10
Filing date: 2021-08-10
Publication date: 2024-06-07
Anticipated expiration: 2041-08-10
Also published as: CN113763470A

Abstract

The invention discloses a method, a device, equipment and a medium for simultaneously positioning RGBD visual inertia and constructing a map based on point-line feature fusion. The RGBD visual inertia simultaneous positioning and map construction method, device, equipment and medium based on the point-line feature fusion solve the problems of efficient state estimation and high-precision three-dimensional map construction of the autonomous robot under illumination change and low-texture indoor scenes, and have the advantages of high precision, high efficiency and the like.

Description

RGBD visual inertia simultaneous positioning and map construction with point-line feature fusion

Technical Field

The invention relates to a system and a method for simultaneously positioning RGBD visual inertia and constructing a map based on point-line feature fusion, and belongs to the technical field of simultaneous positioning and map construction of autonomous robots.

Background

One of the goals of robots is autonomous operation in the real world, while positioning and map creation (Simultaneous Localization AND MAPPING, SLAM) systems are key technologies for autonomous robots.

Efficient state estimation and accurate three-dimensional map construction techniques of autonomous robots in illumination-changing and weakly textured indoor scenes face significant challenges. On one hand, in the indoor scene with illumination change and weak texture, the visual inertia simultaneous localization and map creation (SLAM) method based on the feature point method greatly reduces the localization precision due to the fact that a large number of effective feature points are difficult to extract, and even the system is completely invalid due to poor texture video, motion blur and the like.

Most of traditional line feature detection methods adopt EDLines algorithm as a line feature detection tool, but the algorithm is easy to detect excessive short line segment features when facing complex background or noisy images, so that calculation resources for line segment detection, description and matching are wasted, and the positioning accuracy is obviously reduced. In addition, the algorithm generally has the condition that too many adjacent similar line segments and long line segments are easily divided into a plurality of short line segments, so that the subsequent line segment matching task is complicated, and the uncertainty of the SLAM system is increased.

The three-dimensional dense point cloud map construction is very dependent on the accuracy of a pose estimation algorithm, and if pose estimation is not accurate enough, the constructed three-dimensional dense point cloud map is very easy to overlap and twist

For the reasons, the inventor conducts intensive research on the existing autonomous robot state estimation and three-dimensional dense point cloud map construction method, and provides an RGBD vision inertia simultaneous positioning and map construction system and method based on point-line feature fusion.

Disclosure of Invention

In order to overcome the problems, the inventor conducts intensive research and designs a RGBD visual inertia simultaneous localization and map construction method based on point-line feature fusion, which comprises a front-end visual odometer, a rear-end optimization and a three-dimensional map construction process, and further, in the front-end visual odometer process, the input information of the visual inertia simultaneous localization and map construction system is an image acquired by an RGBD camera and a detection value of an IMU.

Further, the front-end visual odometer is based on dotted line features, including feature detection and matching, IMU pre-integration and visual inertial alignment,

The feature detection and matching comprises the extraction and tracking of point features and the extraction and tracking of line features.

Further, the extraction and tracking of the line features comprises the sub-steps of: s11, suppressing the length of the line segment; s12, extracting line characteristics through near line combination and short line splicing; s13, tracking the line characteristics through an optical flow method. Preferably, step S13 comprises the sub-steps of: s131, matching anchor points; s132, matching points with lines; s133, matching the lines. Preferably, during the front-end visual odometer, there is also the step of: s14, parameterizing the characteristics of points and lines; and in the parameterization process of the point-to-point and line characteristics, error constraint is carried out on the line characteristics.

Preferably, the IMU pre-integration value is taken as a constraint between two consecutive camera frame images.

Preferably, in the back-end optimization process, all state variables in the sliding window are optimized by minimizing the sum of all measurement residuals.

The invention also provides a RGBD visual inertia simultaneous localization and map construction device based on the point-line feature fusion, which comprises a front-end visual odometer module, a rear-end optimization module and a three-dimensional map construction module,

The front-end visual inertial odometer module comprises a feature detection and matching sub-module, an IMU pre-integration sub-module and a visual inertial alignment sub-module, wherein the IMU pre-integration sub-module takes an IMU pre-integration value as a constraint between two frames of images, and the rear-end optimization module optimizes all state variables in a sliding window by minimizing the sum of all measurement residual errors.

The invention also provides an electronic device, comprising: at least one processor; and

A memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform any one of the methods described above.

The invention also provides a computer readable storage medium storing computer instructions for causing the computer to perform any one of the methods described above.

The invention has the beneficial effects that:

(1) According to the RGBD visual inertia simultaneous positioning and map construction system based on point-line feature fusion, provided by the invention, the problem of large correlation error of the front-end data of the visual inertia odometer based on the feature points in the weak texture scene is solved by introducing the line features, and the estimation precision of the motion between the adjacent frame images of the weak texture scene is effectively improved; (2) According to the RGBD visual inertia simultaneous localization and map construction system based on point-line feature fusion, provided by the invention, the problems of low quality of line segment extraction in the traditional algorithm are optimized while the rapid extraction is ensured by utilizing the length inhibition, near line merging and broken line splicing strategies, and the mismatching rate of the line features of the system is reduced; (3) According to the RGBD visual inertia simultaneous positioning and map construction system based on the point-line feature fusion, the light flow method is used for tracking the line features, so that the efficiency of a line feature tracking algorithm is remarkably improved; (4) According to the RGBD visual inertia simultaneous positioning and map construction system based on point-line feature fusion, high-efficiency state estimation and accurate three-dimensional point cloud construction can be realized in an indoor scene with illumination change and weak textures by constructing a high-precision dense point cloud map.

Drawings

FIG. 1 is a flow chart of a method for RGBD visual inertia simultaneous localization and mapping in accordance with a preferred embodiment of the present invention;

FIG. 2 shows the comparison of RGBD visual inertia simultaneous localization and line segment length suppression before and after detection in a mapping method according to a preferred embodiment of the present invention;

FIG. 3 is a schematic diagram of RGBD visual inertia simultaneous localization and mapping method in which multiple closer line segments are combined into one line segment according to a preferred embodiment of the present invention;

FIG. 4 is a schematic diagram showing the combination of RGBD visual inertia simultaneous localization and mapping of multiple shorter line segments into a long line segment in accordance with a preferred embodiment of the present invention;

FIG. 5 is a schematic diagram showing the process of matching points in an I ₁ frame image in an I ₀ frame image in a RGBD visual inertia simultaneous localization and mapping method according to a preferred embodiment of the present invention;

FIG. 6 is a schematic diagram showing a process of RGBD visual inertia simultaneous localization and mapping matching points to lines in a method of mapping, according to a preferred embodiment of the present invention;

FIG. 7 is a schematic diagram of a match line to line process in an RGBD visual inertia simultaneous localization and mapping method for point-line feature fusion in accordance with a preferred embodiment of the present invention;

FIG. 8 is a schematic diagram of the line feature re-projection errors in the RGBD visual inertial simultaneous localization and mapping method in accordance with a preferred embodiment of the present invention;

FIG. 9-a shows the results of the cafline feature detection in example 1; FIG. 9-b shows the results of the coffee shop point feature detection in example 1; FIG. 9-c shows the results of the cafe line detection and tracking in example 1;

FIG. 10-a shows the home 1 line feature detection result in example 1; FIG. 10-b shows the results of home 1-point feature detection in example 1; FIG. 10-c shows the results of home 1 straight line detection and tracking in example 1;

FIG. 11-a shows the home 2 line feature detection result in example 1; fig. 11-b shows the home 2-point feature detection result in example 1; FIG. 11-c shows the home 2 straight line detection and tracking results in example 1;

FIG. 12-a shows corridor line feature detection results in example 1; FIG. 12-b shows corridor point feature detection results in example 1; FIG. 12-c shows corridor line detection and tracking results in example 1;

FIG. 13-a shows a comparison of the results in the corridor scenario in example 1 with the true trajectory (Ground truth) and errors; FIG. 13-b shows the results of the corridor scenario in example 1 compared to the true trajectory (Ground truth) and error in top view; FIG. 13-c shows the results of the corridor scenario in example 1 compared with the true trajectory (Ground truth) and error under side view;

FIG. 14 shows the results of example 2 and comparative examples 3 to 5 in a scene Lab 14; FIG. 15 shows a translational error bin diagram on scene Lab 1; FIG. 16 shows a rotational error bin diagram on scene Lab 1;

FIG. 17 shows the position drift in scene Lab3 for example 2; fig. 18 shows the position drift in the scene Lab6 of example 2; fig. 19 shows three-dimensional maps obtained in comparative example 3 and example 2; fig. 20 shows three-dimensional maps obtained in comparative example 3 and example 2; fig. 21 shows three-dimensional maps obtained in comparative example 3 and example 2.

Detailed Description

The invention is further described in detail below by means of the figures and examples. The features and advantages of the present invention will become more apparent from the description.

The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The conventional visual inertia simultaneous localization and map creation (SLAM) method is mostly based on point feature detection and description, however, in indoor scenes with illumination change and weak textures, the localization accuracy is greatly reduced due to the fact that a large number of effective feature points are difficult to extract, and even the system is completely disabled due to poor textures such as video and motion blur.

In view of the above, the inventor introduces line feature and Inertial Measurement (IMU) data in the traditional SLAM method, and under the condition of low texture, the line feature detection is more robust than the point feature detection, and the inertial measurement data can still move in a good frequency under the condition of small feature quantity in an image sequence, so that the SLAM result of the robot is still more stable when the visual angle is significantly changed.

The RGBD visual inertia simultaneous positioning and map construction method based on the point-line feature fusion comprises a front-end visual odometer, a rear-end optimization and a three-dimensional map construction process.

In the front-end visual odometer process, the visual inertia is simultaneously positioned and input information of the map creation system is an image acquired by an RGBD camera and a detection value of the IMU.

The RGBD camera is a camera capable of simultaneously acquiring and outputting RGB information and depth information, is one of cameras commonly used for image recognition, and the IMU is a detection unit commonly used for a robot and is used for measuring a sensor for acceleration and angular velocity of movement of the robot.

Further, the front-end visual odometer is based on point and line features, including feature detection and matching, IMU pre-integration, visual inertial alignment, as shown in fig. 1.

In the invention, the method for extracting and tracking the point features is not particularly limited, and the method for extracting and tracking the point features in the traditional SLAM can be adopted, for example, the Shi-Tomasi corner points are extracted as the feature points, the KLT optical flow method is adopted to realize the tracking of the feature points, and the points with larger difference are tracked and eliminated based on the reverse optical flow method.

Wherein, the extraction of the Shi-Tomasi corner point adopts the method in paper J.Shi,C.Tomasi,Good features to track,in：1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition,IEEE,1994,pp.593-600.; the KLT optical flow method adopts the method in paper B.D.Lucas and T.Kanade,An iterative image registration technique with an application to stereo vision,in：Proceedings of the 7th International Joint Conference on Artificial Intelligence(IJCAI),IEEE,1981,pp.24-28; the reverse optical flow method uses the method in paper Baker S,Matthews I.Lucas-kanade 20years on：A unifying framework[J].International journal of computervision,2004,56(3)：221-255.

In the invention, compared with the traditional SLAM system, the extraction and tracking of the line characteristic are added in the front-end visual odometer, and the additional constraint is provided for the system by adding the line characteristic, so that the accuracy of system state estimation is improved.

In the prior art, the extraction methods of the line features include an LSD method, an FLD method, a EDlines method and the like, wherein the detection speeds of the LSD method and the FLD method are low, and the timeliness requirement of SLAM cannot be met; although the EDlines method has higher detection speed and higher detection efficiency, when facing a low-texture scene or a noise image, a large number of shorter line features are always detected, so that calculation resources of line detection, description and matching are wasted, and the positioning accuracy is obviously reduced.

In the SLAM system formed by fusing the point and the line features, since the line features are only used for providing additional constraint to improve the precision of state estimation, the EDLines method is improved in the invention, so that only obvious line features in a scene are detected while the original detection efficiency is maintained, and a more effective line feature detection result is obtained.

Specifically, in the present invention, the extraction and tracking of line features includes the sub-steps of:

S11, suppressing the length of the line segment; s12, extracting line characteristics through near line combination and short line splicing; s13, tracking line features through an optical flow method; preferably, after step S12, there may also be step S14, point, line feature parameterization.

Because the longer line segment is more stable, the detection can be repeated in a plurality of frames, the positioning accuracy is also increased along with the increase of the number of the longer line segments, and in the step S11, the line segments with the length lower than the threshold value are deleted through the length inhibition of the line segments, so that the quality of the extracted line segments is effectively improved. The comparison result of the detection effects before and after the line segment length inhibition is shown in fig. 2.

Specifically, the line segment length suppression is expressed as:

Wherein len _i represents the length of the ith line segment in the image, len _min represents the threshold value of the shortest line characteristic length, W _I is the width of the image, H _I is the height of the image, and the symbol Represents a rounding-up, η is a scaling factor, (x _sp,y_sp) represents the start coordinates of the line segment, and (x _ep,y_ep) represents the end coordinates of the line segment.

In the invention, the minimum value in the image width and the image height is taken as one of key indexes set by the threshold value, so that the line segment length inhibition can adapt to different image sizes.

Preferably, the scaling factor η is 0.01 to 0.2, more preferably 0.1, and through a lot of experiments, the inventor finds that the scaling factor can maximally guarantee the quality of line segment detection.

Considering that longer line segment features tend to come from a continuous region with a larger gradient where the detection accuracy is more reliable, it is necessary to merge adjacent line segments so that the obtained line segments are longer.

In step S12, the more recent line segments are combined into one line segment, i.e. the near line combination, as shown in FIG. 3, and

A plurality of shorter line segments are spliced into a long line segment, i.e., a short line splice, as shown in fig. 4.

As shown in FIG. 3, in the present invention, L _i、L_j、L_k represents the line segments to be spliced, and their end points are respectively

As shown in FIG. 4, in the present invention, L _u、L_v represents two line segments to be spliced, and their end points are respectively

Preferably, in step S12, the following sub-steps are included for the near line merging and the short line splicing:

S121, sequencing the wire segments; s122, screening the wire segments; s123, filtering the screened line segment groups to obtain a merging candidate line segment set and a splicing candidate line segment set; s124, merging or splicing the line segments.

In step S121, the line segments are ordered by line segment length, preferably in descending order from long to short, to obtain a line segment sequence { L ₁,L₂,..,L_n }, where L ₁ represents the line segment with the longest line segment length.

Since longer line segments tend to come from image areas with continuously strong gradients, it is more reliable to start with the longest line segment L ₁.

Further, the line segments outside the L ₁ are represented as a remaining line segment set

L= { L ₂,L₃,...,L_n } (two)

Where n is the total number of segments.

In step S122, the screening includes an angle screening, which may be expressed as:

Wherein, L _α represents a candidate line segment group obtained by angle screening, L _m represents different line segments in the rest line feature set, and the angle of the line segment L _m in the horizontal direction is θ _m,θ_min, which is the angle threshold for measuring the similarity of the line segments.

The inventors have determined through a number of experiments that θ _min is optimal at pi/90.

In step S123, the obtained candidate line segment group L _α is filtered, where the filtering in the near line merging process is expressed as:

the merging candidate line segment set can be obtained through the filtering process

The filtering of the stub stitching process is expressed as:

the filtering process can obtain a spliced candidate line segment set

In the fourth and fifth formulas, d _min represents a threshold value for measuring the line segment proximity, and the threshold value is preferably 5 to 15 pixels, more preferably 9 pixels, through a large number of experimental analyses.

Through the filtering process, line segments with angles and spatial distances close to L ₁ can be collected into a group, and a candidate line segment set L _β＝{L_β1,L_β2 is obtained.

In step S124, a line segment L ₁ is added to the filtered candidate line segment set L _β to form a new line segment set { L ₁,L_β }, and the head end point and the tail end point which are farthest from each other in the line segment set are selected as the start point and the end point of the new line segment, so as to synthesize a new line segment L _M.

The resulting line segment L _M angle theta _M is recalculated,

If theta _M＜θ_min is satisfied, merging or splicing the segment set { L ₁,L_β } and replacing the segment set { L ₁,L_β } with the synthesized segment L _M;

If theta _M＜θ_min is not satisfied, the combination or the splicing is abandoned, the angle difference between the front angle and the back angle of the combination is overlarge, and the synthetic result deviates from the original line segment.

Repeating the above process, and merging and splicing all the line segments in the image.

The step S12 can obtain a line segment set in each frame, and how to determine the correspondence between the line segments in two continuous frames of images, so as to track the line segments is one of the problems to be solved by the invention.

In step S13, the points on the line segment are matched by an optical flow method, and then the matched points are counted to realize the tracking of the line segment.

Optical flow methods are commonly used for tracking point features, and in the present invention, they are modified to enable tracking of line segments.

Specifically, the method comprises the following substeps: s131, matching anchor points; s132, matching points with lines; s133, matching the lines.

In step S131, the anchor points are interval points in the line segments, and one line segment is characterized by a plurality of anchor points.

In the present invention, two consecutive frame images are denoted as I ₀ and I ₁, and further, a line segment set in an I ₀ frame image is denoted as:

L_{0 Frame(s)}＝⁰l_i ⁰l_i＝(⁰p_i,⁰q_i),0≤i≤M₀, Wherein M ₀ represents the total number of line segments in the I ₀ frame image, ⁰l_i represents the I-th line segment in the I ₀ frame image, and ⁰p_i,⁰q_i represents the two end points of the I-th line segment in the I ₀ frame image;

Likewise, the set of line segments in the I ₁ frame image is represented as:

L_{1 Frame(s)}＝¹l_k|¹l_k＝(¹p_k,¹q_k),0≤k≤M₁, Where M ₁ represents the total number of line segments in the I ₁ frame image, ¹l_k represents the kth line segment in the I ₁ frame image, and ¹p_k,¹q_k represents the two endpoints of the kth line segment in the I ₁ frame image.

For each line segment ⁰l_i＝(⁰p_i,⁰q_i in the I ₀ frame image), the start point, the end point, and several points in the direction of the start point and the end point of the line segment are selected as anchor points, and as shown in fig. 5, the anchor point set is ⁰Q_i＝⁰λ_i,j|0≤j＜N_i,⁰λ_i,j, where N _i represents the j-th anchor point on the I-th line segment, and N _i represents the total number of anchor points on the I-th line segment, preferably 12.

Further, the vector in the direction of the start point and the end point is:

⁰a_i＝[⁰a_i,0,⁰a_i,1]^T＝(⁰q_i-⁰p_i)/||⁰q_i-⁰p_i||.

Further, in the I ₁ frame image, a matching point ¹λ_i,j of each anchor point ⁰λ_i,j in the I ₀ frame image in a normal direction ⁰n_i may be obtained, where the normal direction is:

⁰n_i＝[-⁰a_i,1,⁰a_i,0]^T

In step S122, in the frame of I ₁, the shortest distance from each matching point ¹λ_i,j to a different line segment ¹l_k is obtained, a distance set d _k is obtained, the shortest distance in the distance set is denoted as d _min, the line segment corresponding to the shortest distance is denoted as ¹l_closett(i,j), as shown in fig. 6,

Further, if d _min＜d_threshold, then consider that the matching point ¹λ_i,j belongs to line segment ¹l_closest(i,j);

Otherwise, the matching point ¹λ_i,j is not considered to belong to the line segment ¹l_closest(i,j), and the next anchor point is judged;

where d _threshold is an empirical threshold, preferably 1 pixel.

In step S133, as shown in fig. 7, for each anchor point on any line segment ⁰l_i,0≤i＜M₀ in the I ₀ frame image, there is a unique matching point ¹λ_i,j in the I ₁ frame image, which falls on line segment ¹l_closest(i,j). For line segment ⁰l_i within the I ₀ frame, the number of anchor points it matches within the I ₁ frame is calculated, let Z _i be the number of anchor points that match line segment ¹l_match(i), if Z _i/N_i > R, define ⁰l_i and ¹l_match(i)0≤match(i)＜M₁ match each other, where R is an empirical threshold, preferably 0.6 pixels.

The inventors have found that, although a line feature based solution is a good choice in low texture scenes lacking point features, line feature occlusion or false detection tends to occur in repeated or low texture scenes. In the present invention, the above-described problems are solved by line characteristic error constraint.

In step S14, the point-line characteristics are parameterized, and in the parameterization process, error constraints are applied to the line characteristics.

Specifically, in the present invention, the error of the point feature and the line feature is constrained as follows:

Wherein, as shown in figure 8, For the point projection error term,The error term is a line projection error term; t _Iw represents the state of the RGBD camera in the image of the I frame in the world coordinate system, X _w,j represents the j point observed by the I frame, and X _i,j is the projection of X _w,j on the plane; p _w,k,Q_w,k denotes both end points of the I-th line segment in the I-th frame, P _i,k,q_i,k is a point on the image plane corresponding to P _w,k,Q_w,k,For the projection endpoint of P _w,k,Q_w,k on the image plane,For p _i,k to projected line segment/>, in world coordinate systemDistance ofFor q _i,k to projected line segment/>, under world coordinate systemIs a distance of (2); pi () is a camera projection model, pi ^h () is homogeneous coordinates of the camera projection model, and further:

Where O is a point coordinate in the RGBD camera coordinate system, P _w is a point coordinate in the world coordinate system, R _w represents a coordinate rotation, t _w represents a coordinate translation, f _x,f_y represents a focal length of the RGBD camera, and c _x,c_y represents a principal point of the RGBD camera.

Further, the coordinate of the camera coordinate system corresponding to P _w,k is O _p,Q_w,k and the coordinate of the camera coordinate system corresponding to O _q can be obtained by the following formula:

Line segment The jacobian matrix for the reprojection error for endpoint P _w,k,Q_w,k is:

Wherein, Is the component of O _p on the x, y and z coordinate axes of the camera coordinate system; /(I)Is the component of O _q on the x, y and z coordinate axes of the camera coordinate system;

further, the method comprises the steps of,

L _p is the camera optical center directionI _q is the camera optical center pointingIs a vector of (a).

In the conventional SLAM, most of the SLAM adopts a BA-based optimization method, and the IMU pre-integration process needs to update the frame speed and rotation every time the state is adjusted, and needs to re-integrate after each iteration, which makes the transmission strategy very time-consuming.

IMU measurements typically include three orthogonal axis accelerometers and a three axis gyroscope for measuring acceleration and angular velocity relative to the inertial frame. Under the influence of noise, IMU measurements typically contain an additive white noise n and a time-varying bias b, denoted as:

where a _t represents the ideal value of three orthogonal axis accelerometers without noise, Representing acceleration bias,Rotation g ^w representing world to body coordinate system represents gravitational acceleration in world coordinate system, n _a represents accelerometer noise, ω _t represents ideal value for three-axis gyroscope without bias,Representing gyroscope bias, n _w represents gyroscope noise.

Further, accelerometer noise n _a and gyroscope noise n _w may be represented as gaussian white noise versions:

Further, acceleration biasing The derivative of (2) and the derivative of the gyroscope bias can be expressed as gaussian white noise forms:

in the invention, the IMU pre-integration value is used as a constraint between two frames of images and is used together with the camera re-projection residual error in the process of local window optimization.

Specifically, b _k、b_k+1 is two continuous image frames, b _k corresponds to time t _k,b_k+1, and the following constraint is performed between time t _k+1,t_k and time t _k+1 by using IMU pre-integration value:

Wherein, Representing the translation of b _k frames to b _k+1 frames,Representing the speed of b _k frames to b _k+1 frames,Representing the rotation of b _k frames to b _k+1 frames,Representing the rotation of τ frames to b _k frames,Representing the rotation of the frame from time t to bk, τ represents an intermediate variable in time during the double integration process.

Further, Ω (·) can be represented as:

ω≡represents an antisymmetric matrix.

In the invention, the effect of correcting the influence of the bias variation is achieved by calculating the jacobian matrix of each pre-integration term when the pre-integration calculation is performed between two continuous camera frames. Further, if the estimated deviation change is large, the pre-integration process is re-propagated under the new deviation value, through the process, a large amount of calculation amount is saved, the calculation efficiency is improved, and the positioning and map construction speed is improved.

The visual inertia alignment adopts a loose coupling scheme of vision and an IMU, firstly, a structural motion method (SFM) is utilized to estimate the posture of a camera and the space position of a three-dimensional point, and then the spatial position is aligned with IMU pre-integration data to obtain the gravity direction, the scale factor, the gyroscope bias and the corresponding speed of each frame.

According to the present invention, in the back-end optimization process, unlike the conventional SLAM method, all state variables in the sliding window are optimized by minimizing the sum of all measurement residuals, and the process can be expressed as:

Wherein, For IMU measurement residual errors from I frame to I+1 frame in the body coordinate system, χ _b is the set of all IMU pre-integrals in the sliding window,Is the reprojection error of the point feature,X _p is the set of point features observed in the I frame, x _l is the set of line features observed in the I frame, e _prior is a priori information, and ρ () is a Cauchy function, which plays a role in suppressing outliers.

In the three-dimensional mapping process, the local point cloud is spliced into a final point cloud map by means of the estimated camera pose, which is the same as the traditional SLAM method.

The accurate pose estimation plays a vital role in the precision of the point cloud image. If the pose estimation is not accurate enough, the final point cloud may overlap or warp. In the invention, as the characteristic tracking nodes are adopted to track the characteristics on the RGB image and acquire the distance information from the depth image, another scale information source is acquired by utilizing the depth image, thereby improving the robustness and the accuracy of the system and ensuring that the reconstructed three-dimensional cloud image is more reliable.

Further, in the invention, three-dimensional points are projected from each depth image, and are synchronized with key frames, so that corresponding gestures are estimated, a cloud picture is constructed, the system is compact, and the system is more suitable for small-sized robot integration. Moreover, the point cloud image of the type can be directly used for semantic segmentation or recognition and further used for obstacle avoidance and path planning.

On the other hand, the invention also discloses a device which comprises a front-end visual odometer module, a rear-end optimization module and a three-dimensional graph building module.

The front-end visual inertial odometer module comprises a feature detection and matching sub-module, an IMU pre-integration sub-module and a visual inertial alignment sub-module.

The feature detection and matching submodule is used for extracting and tracking point features and extracting and tracking line features.

Further, in the feature detection and matching sub-module, the line features are extracted and tracked through steps S11 to S13.

The IMU pre-integration sub-module takes the IMU pre-integration value as a constraint between two frames of images, and is used together with a camera re-projection residual error in the optimization process of a local window, wherein the constraint is expressed as:

The visual inertia alignment sub-module estimates the posture of the camera and the space position of the three-dimensional point by using a structural motion method (SFM), and then aligns with IMU pre-integration data to obtain the gravity direction, the scale factor, the gyroscope bias and the corresponding speed of each frame.

The back-end optimization module optimizes all state variables in the sliding window by minimizing the sum of all measurement residuals, the process of which can be expressed as:

The three-dimensional map building module splices the local point clouds into a final point cloud map by means of the estimated camera gestures.

The various embodiments of the methods and apparatus described above in this invention may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present invention may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the methods and apparatus described herein may be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The methods and apparatus described herein may be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called as a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the traditional physical hosts and VPS service (Virtual PRIVATE SERVER or VPS for short). The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present disclosure may be performed in parallel, sequentially, or in a different order, so long as the desired result of the technical solution of the present disclosure is achieved, and the present disclosure is not limited herein.

Examples

Example 1

Simulation experiments of simultaneous localization and map creation of the sense of inertia are performed using a public dataset, wherein the dataset is OpenLORIS-Scene, and the dataset is collected in 5 scenes of offices, hallways, families, cafes and markets, which are very typical autonomous robot application scenes.

The simultaneous localization and map creation of the sense of inertia comprises a front-end visual odometer, a rear-end optimization and a three-dimensional map creation process.

In the front-end visual odometer process, point features and line features are extracted and tracked, wherein when the point features are extracted and tracked, shi-Tomasi corner points are extracted as feature points, the KLT optical flow method is adopted to track the feature points, and points with larger differences are tracked and eliminated based on the reverse optical flow method;

The line feature extraction tracking includes the sub-steps of:

S11, suppressing the length of the line segment; s12, extracting line characteristics through near line combination and short line splicing; s13, tracking line features through an optical flow method; s14, parameterizing the point and line characteristics.

In step S11, the line segment length suppression is expressed as:

Eta takes a value of 0.1.

Step S12 comprises the following sub-steps: s121, sequencing the wire segments; s122, screening the wire segments; s123, filtering the screened line segment groups to obtain a merging candidate line segment set and a splicing candidate line segment set; s124, merging or splicing the line segments.

In step S121, the segments are ordered in the order of length of the segments from long to short, and in step S122, the filtering is angle filtering, which is expressed as:

Wherein, θ _min is pi/90.

The filtering of the stub stitching process is expressed as:

d _min denotes a threshold value for measuring the line segment proximity, which is 9 pixels.

The step S13 includes the following sub-steps: s131, matching anchor points; s132, matching points with lines; s133, matching the lines.

Wherein the total number of anchor points in sub-step S131 is 12.

In step S14, parameterizing the point and line features, and in the parameterizing process, performing error constraint on the line features, and constraining errors of the point features and the line features as follows:

In the front-end visual odometer process, the IMU pre-integration value is used as the constraint between two frames of images, and is used together with the camera re-projection residual error in the local window optimization process, so that the following constraint is carried out:

In the front-end visual odometer process, visual inertia alignment adopts a loose coupling scheme of vision and an IMU, firstly, a structural motion method (SFM) is utilized to estimate the posture of a camera and the spatial position of a three-dimensional point, and then the spatial position is aligned with IMU pre-integral data to obtain the gravity direction, the scale factor, the gyroscope bias and the corresponding speed of each frame.

In the back-end optimization process, all state variables in the sliding window are optimized by minimizing the sum of all measurement residuals, which process can be expressed as:

in the three-dimensional mapping process, the local point clouds are spliced into a final point cloud map by means of the estimated camera gestures.

FIGS. 9-12 are point and line feature detection in a plurality of low texture and illumination variation scenarios in OpenLORIS-Scene datasets, where FIG. 9 is a cafe detection result, FIG. 10 is a home 1 detection result, FIG. 11 is a home 2 detection result, and FIG. 12 is a hallway detection result; FIGS. 9-a, 10-b, 11-c, 12-d show the results of line feature detection, and FIGS. 9-a, 10-b, 11-c, 12-d show the results of point feature detection.

It can be seen that the indoor scenes are mostly artificial structure scenes, have abundant linear structures such as edges and right angles, are not abundant in characteristics (family 1 and family 2) in low-texture scenes (cafes and corridors) with strong illumination changes, and are not enough in point characteristic detection, but the line characteristic detection algorithm can still obtain abundant line characteristics. Compared with point detection, the point-line characteristic complementation method can provide more abundant and robust characteristic information for subsequent motion estimation.

9-A, 10-b, 11-c, 12-d show straight line detection and tracking under different scenes, more effective information can be obtained by utilizing a straight line feature matching algorithm, the geometric structure can be described more intuitively, and richer and more robust information is provided for subsequent vision measuring ranges to estimate camera motion according to comprehensive features.

Fig. 13-a is a comparison of the results in a corridor scenario with the true trajectory (Ground truth) and errors, where the color is represented by the blue-to-red color with the error increasing gradually, and fig. 13-b, 13-c are the results in a top view and a side view.

As the speed of motion increases, the viewing angle may vary significantly due to the introduction of more motion blur, and the illumination conditions may be challenging to track the features and performance may degrade. As can be seen from the figure, this embodiment enables accurate alignment of the real trajectory, and especially drift free tracking can be achieved when the camera is rotating fast, which means that adding depth information to the visual inertial system improves the pose estimation considerably.

Example 2

And (3) performing a positioning mapping experiment in an indoor environment, and verifying positioning accuracy according to the data of the motion capture system OptiTrack.

The experiment was carried out as an unmanned aerial vehicle equipped with intel REALSENSE D435i depth camera, intel NUC on-board computer and pixhawkv5+ flight control. The D435i depth camera can capture 640 x 800 resolution color images, and the camera is internally provided with a six-axis IMU sensor. In the experiment, the sampling frequencies of the camera and IMU were set to 30Hz and 200Hz, respectively. And the onboard computer processes the image and IMU data acquired by the depth inertial camera in real time.

14 Different scenes (Lab 1-Lab 14) are built indoors, and the 14 scenes have the characteristics of low texture, illumination change, rapid acceleration and angle rotation.

The onboard computer locates the pictures acquired by the depth camera in a map construction as in example 1.

Comparative example 1

The same experiment as in example 1 was performed, except that the ORB-SLAM2 method in paper R.Mur-Artal,J.D.Tardós,ORB-SLAM2：An open-source SLAMsystem for monocular,stereo,and RGB-Dcameras,IEEE Transactions on Robotics 33(5)(2017)1255-1262. was used, ORB-SLAM2 is a point-based monocular SLAM system, which can be operated in large-scale, small-scale, indoor and outdoor environments by adding RGBD interfaces.

Comparative example 2

The same experiment as in example 1 was performed, except that the PL-VINS method in paper Q.Fu,J.Wang,H.Yu,I.Ali,F.Guo,Y.He,H.Zhang,PL-VINS：Real time monocular visual-inertial SLAMwith point and line features(2020).arXiv：2009.07462. was used, which was developed on the basis of point-based VINS-M0no, a monocular visual inertial system optimization method with point and line features.

Comparative example 3

The same experiment as in example 2 was performed, except that the VINS-RGBD method in paper Z.Shan,R.Li,S.Schwertfeger,RGBD-inertial trajectory estimation and mapping for ground robots,Sensors 19(10)(2019)2251. was used.

Comparative example 4

The same experiment as in example 2 was performed, except that the ORB-SLAM2 method in paper R.Mur-Artal,J.D.Tardós,ORB-SLAM2：An open-source SLAMsystem for monocular,stereo,and RGB-Dcameras,IEEE Transactions on Robotics 33(5)(2017)1255-1262. was used.

Comparative example 5

The same experiment as in example 2 was performed, except that the PL-VINS method in paper Q.Fu,J.Wang,H.Yu,I.Ali,F.Guo,Y.He,H.Zhang,PL-VINS：Real time monocular visual-inertial SLAMwith point and line features(2020).arXiv：2009.07462. was used.

Experimental example

Experimental example 1

Comparative example 1 and comparative examples 1 and 2 were evaluated using Root Mean Square Error (RMSE), maximum error (max error) and mean error (mean error), and the results are shown in table one.

List one

As can be seen from the first table, the RMSE in the embodiment 1 is obviously lower than that in the scenes of the comparative examples 1, market1-1 and corridor-4, the comparative example 2 has tracking faults and serious drift, a final track cannot be obtained, and simulation results show that the embodiment 1 has better robustness.

In addition, in the case of rapid motion, i.e., in the sparse feature sequences Office 1-7, example 1 still maintains a high accuracy of positioning of about 0.18m, which is about 44% higher than comparative example 1.

In addition, the positioning error fluctuation of the embodiment 1 is small, the consistency of positioning estimation is good, and the robust accurate positioning is realized in a low-texture environment.

Experimental example 2

The results of the RMSE (units: meters) of comparative example 2 and comparative examples 3 to 5 are shown in Table II.

Watch II

From Table II, it can be seen that example 1 maintains advantages in terms of low indoor texture, accuracy of illumination variation, and that example 1 has minimal overall ripple, high stability, and root mean square error kept within 0.44 meters.

The results of example 2 and comparative examples 3-5 in scene Lab14 are shown in FIG. 14, and it can be seen from the graph that the positioning trace of example 2 is closer to OptiTrack, and the positioning error is significantly smaller than that of comparative examples 3-5.

Fig. 15 shows a translational error bin diagram on the scene Lab1, and fig. 16 shows a rotational error bin diagram on the scene Lab 1. As can be seen from the graph, the upper and lower set value distributions of example 2 are smaller than those of comparative examples 3 to 5.

Fig. 17 shows the position drift of embodiment 2 in the scene Lab3, from which it can be seen that the triaxial error is concentrated within 300mm, and fig. 18 shows the position drift of embodiment 2 in the scene Lab6, from which it can be seen that the triaxial error is concentrated within 200mm, illustrating that the variation of embodiment 2 is smooth and the positioning accuracy and stability are good.

Fig. 19 to 21 show three-dimensional maps obtained in comparative example 3 and example 2, in which the left column shows the results obtained in comparative example 3 and the right column shows the results obtained in example 2, and it can be seen from the figures that comparative example 3 has a large error in the horizontal and vertical directions, and double shadows and bending phenomena occur during the splicing process, resulting in lower map construction accuracy. In embodiment 2, the map stitching effect is greatly improved, and meanwhile, the problems of double shadows and bending are solved, which also shows that the accumulated error in the map construction process is reduced, and embodiment 2 can well complete the construction of the indoor map.

The invention has been described above in connection with preferred embodiments, which are, however, exemplary only and for illustrative purposes. On this basis, the invention can be subjected to various substitutions and improvements, and all fall within the protection scope of the invention.

Claims

1. The RGBD visual inertia simultaneous positioning and map construction method based on point-line feature fusion comprises front-end visual odometer, rear-end optimization and three-dimensional map construction process, and is characterized in that,

In the front-end visual odometer process, the visual inertia is simultaneously positioned and input information of the map creation system is an image acquired by an RGBD camera and a detection value of an IMU;

The front-end visual odometer is based on dotted line features, including feature detection and matching, IMU pre-integration and visual inertial alignment,

The feature detection and matching comprises the extraction and tracking of point features and the extraction and tracking of line features;

the extraction and tracking of the line features comprises the following sub-steps:

s11, suppressing the length of the line segment;

s12, extracting line characteristics through near line combination and short line splicing;

s13, tracking line features through an optical flow method;

In S11, the line segment length suppression is expressed as:

Wherein len _i represents the length of the ith line segment in the image, len _min represents the threshold value of the shortest line characteristic length, W _I is the width of the image, H _I is the height of the image, and the symbol Representing an upward rounding, η being a scaling factor, (x _sp,y_sp) representing the start coordinates of the line segment and (x _ep,y_ep) representing the end coordinates of the line segment;

in step S12, the method comprises the following sub-steps of performing near line merging and short line splicing:

S121, sequencing the wire segments; s122, screening the wire segments; s123, filtering the screened line segment groups to obtain a merging candidate line segment set and a splicing candidate line segment set; s124, merging or splicing the wire segments;

In step S121, the line segments are ordered according to the line segment lengths, so as to obtain a line segment sequence { L ₁,L₂,...,L_n }, where L ₁ represents the line segment with the longest line segment length;

Wherein L _α represents a candidate line segment group obtained by angle screening, L _m represents different line segments in the residual line feature set, the angle of the line segment L _m in the horizontal direction is theta _m,θ_min, which is an angle threshold for measuring the similarity of the line segments, and L is a set of line segments outside L ₁;

In S123, the filtering of the near line merging process is expressed as:

The filtering of the stub stitching process is expressed as:

Where d _min denotes a threshold value for measuring line segment proximity, Representing a set of merging candidate line segments,Representing a set of segment candidates to be spliced, L _i、L_i、L_k represents segments to be spliced, and their end points are/>, respectively L _u、L_v represents two line segments to be spliced, and the endpoints of the line segments are/>, respectively

Combining the line segments with angles and space distances close to L ₁ into a group to obtain a candidate line segment set L _β＝{L_β1,L_β2;

In S124, adding line segment L ₁ to the candidate line segment set L _β after screening to form a new line segment set { L ₁,L_β }, selecting the head end point and the tail end point which are farthest from each other in the line segment set as the start point and the end point of the new line segment, and synthesizing a new line segment L _M;

repeating the above process, and merging and splicing all line segments in the image;

step S13 comprises the following sub-steps:

S131, matching anchor points;

s132, matching points with lines;

s133, matching the lines;

In step S131, the anchor points are interval points in line segments, one line segment is represented by a plurality of anchor points, two continuous frame images are represented as I ₀ and I ₁, for each line segment ⁰l_i＝(⁰p_i,⁰q_i in the I ₀ frame image, a start point, an end point of the line segment, and a plurality of points in the direction of the start point and the end point are selected as anchor points, and in the I ₁ frame image, a matching point ^lλ_i,j in a normal direction ⁰n_i of each anchor point ⁰λ_i,j in the I ₀ frame image is obtained, where the normal direction is:

⁰n_i＝[-⁰a_i,1,⁰a_i,0]^T

the vector of the starting point and the ending point is:

⁰a_i＝[⁰a_i,0,⁰a_i,1]^T＝(⁰q_i-⁰p_i)/||⁰q_i-⁰p_i||

In step S132, in the frame of I ₁, the shortest distance from each matching point ¹λ_i,j to different line segments ¹l_k is obtained, a distance set d _k is obtained, the shortest distance in the distance set is denoted as d _min, the line segment corresponding to the shortest distance is denoted as ¹l_closest(i,j), and if d _min＜d_threshold, the matching point ¹λ_i,j is considered to belong to the line segment ¹l_closest(i,j);

Wherein d _threshold is an empirical threshold;

In step S133, for the line segment ⁰l_i in the I ₀ frame, calculating the number of anchor points that it matches in the I ₁ frame, setting Z _i as the number of anchor points that match with the line segment ¹l_match(i), if Z _i/N_i > R, then ⁰l_i and ¹l_match(i)0≤match(i)＜M₁ match each other, where R is an empirical threshold, and M ₁ represents the total number of line segments in the I ₁ frame image;

In the front-end visual odometer process, the steps are also provided:

S14, parameterizing the characteristics of points and lines;

In the parameterization process of the point and line characteristics, error constraint is carried out on the line characteristics, and the error constraint of the point characteristics and the line characteristics is as follows:

Wherein, For the point projection error term,The error term is a line projection error term; t _Iw represents the state of the RGBD camera in the I-frame image in the world coordinate system, and X _w,j represents the j-th point observed by the I-frame; p _w,k,Q_w,k denotes two end points of the ith line segment in the I frame, P _i,k,q_i,k is a point on the image plane corresponding to P _w,k,Q_w,k,Is the projection endpoint of P _w,k,Q_w,k on the image plane,For p _i,k to projected line segment/>, in world coordinate systemDistance ofFor q _i,k to projected line segments in world coordinate systemIs a distance of (2); pi () is the camera projection model, pi ^h () is the homogeneous coordinates of the camera projection model.

2. The RGBD visual inertia simultaneous localization and mapping method based on point-line feature fusion of claim 1, wherein,

The IMU pre-integration value is taken as a constraint between two consecutive camera frame images.

3. The RGBD visual inertia simultaneous localization and mapping method based on point-line feature fusion of claim 1, wherein,

In the back-end optimization process, all state variables in the sliding window are optimized by minimizing the sum of all measurement residuals.

4. An electronic device, comprising:

At least one processor; and

A memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-3.

5. A computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-3.