CN116342800B

CN116342800B - Semantic three-dimensional reconstruction method and system for multi-mode pose optimization

Info

Publication number: CN116342800B
Application number: CN202310181777.1A
Authority: CN
Inventors: 孙庆伟; 晁建刚; 陈炜; 林万洪; 许振瑛; 胡福超
Original assignee: Peoples Liberation Army Strategic Support Force Aerospace Engineering University; China Astronaut Research and Training Center
Current assignee: Peoples Liberation Army Strategic Support Force Aerospace Engineering University; China Astronaut Research and Training Center
Priority date: 2023-02-21
Filing date: 2023-02-21
Publication date: 2023-10-24
Anticipated expiration: 2043-02-21
Also published as: CN116342800A

Abstract

The application belongs to the technical field of semantic three-dimensional reconstruction, and particularly relates to a semantic three-dimensional reconstruction method and a semantic three-dimensional reconstruction system for multi-mode pose optimization, wherein the method comprises the following steps: collecting RGB images and depth images of a target area; performing semantic segmentation on the RGB image to obtain a final mask image of the RGB image; acquiring the global pose of the RGB image by utilizing the RGB image, the depth image and the final mask map; and acquiring a global semantic three-dimensional point cloud according to the global pose of the depth image and the RGB image. The technical scheme provided by the application has low computational force requirements and wide applicability, and improves the real-time performance and efficiency of semantic three-dimensional reconstruction.

Description

Semantic three-dimensional reconstruction method and system for multi-mode pose optimization

Technical Field

The application belongs to the technical field of semantic three-dimensional reconstruction, and particularly relates to a semantic three-dimensional reconstruction method and system with multi-mode pose optimization.

Background

The semantic three-dimensional reconstruction is to carry out semantic annotation on a three-dimensional reconstruction model, and divide structures with different category attributes from an integral model, so that the downstream task is convenient to use. One of the important factors for determining the quality of the semantic three-dimensional reconstruction is the calculation of the pose of the camera, and the relation between the quality of the reconstruction model and the distribution of semantic information is achieved. The positioning in the current algorithm adopts pure geometric characteristics, and a complex screening mechanism needs to be designed to determine matching points. The semantic three-dimensional reconstruction can be divided into three parts of semantic segmentation, three-dimensional reconstruction and semantic mapping, wherein the three-dimensional reconstruction comprises pose calculation and mapping.

In the existing algorithm, all parts are relatively independent, data communication is carried out in a loose coupling mode, for example, the three-dimensional reconstruction of semantics mostly adopts a dense ICP algorithm based on GPU to calculate the pose of a camera, meanwhile, GPU resources are required to be occupied by semantic segmentation, the running efficiency has to be compromised, the whole framework is too heavy, and the overall optimization scale is large. Therefore, the existing semantic three-dimensional reconstruction algorithm adopts a semantic segmentation and reconstruction method which is poor in instantaneity, has high calculation force requirements, causes a large burden on the overall operation of the algorithm, and can be operated smoothly by a high-performance computer.

Disclosure of Invention

In order to overcome the problems existing in the related art at least to a certain extent, the application provides a semantic three-dimensional reconstruction method and a semantic three-dimensional reconstruction system for multi-mode pose optimization.

According to a first aspect of an embodiment of the present application, there is provided a semantic three-dimensional reconstruction method for multi-modal pose optimization, the method comprising:

collecting RGB images and depth images of a target area;

performing semantic segmentation on an RGB image to obtain a final mask image of the RGB image;

acquiring the global pose of the RGB image by using the RGB image, the depth image and the final mask map;

And acquiring a global semantic three-dimensional point cloud according to the depth image and the global pose of the RGB image.

Preferably, the semantic segmentation of the RGB image to obtain a final mask map of the RGB image includes:

performing semantic segmentation on the RGB image by using an STDC network based on deep learning to obtain semantic class probability of each pixel point in the RGB image;

coding the category corresponding to the category with the highest probability in the semantic category probability of each pixel point as a mask of each pixel point to obtain an initial mask map of the RGB image;

and performing edge optimization on the initial mask map of the RGB image by using a super-pixel segmentation method gSLICr to obtain a final mask map of the RGB image.

Preferably, the obtaining the global pose of the RGB image by using the RGB image, the depth image and the final mask map includes:

extracting ORB characteristic points of the RGB image;

obtaining a depth value of the ORB characteristic point in the depth image, and performing projection coordinate conversion on the two-dimensional coordinates of the ORB characteristic point by utilizing the depth value of the ORB characteristic point in the depth image to obtain the coordinates of the ORB characteristic point in a three-dimensional space;

and acquiring the global pose of the RGB image by using a sparse SLAM method based on the coordinates of the ORB feature points in the three-dimensional space and the final mask map.

Preferably, the obtaining the global pose of the RGB image by using the sparse SLAM method based on the coordinates of the ORB feature points in the three-dimensional space and the final mask map includes:

extracting ORB characteristic points with the same BRIEF descriptors in two adjacent RGB images;

removing outer points in ORB characteristic points with the same BRIEF descriptors in two adjacent frames of RGB images by using the final mask map, and enabling the ORB characteristic points with the same BRIEF descriptors in the two adjacent frames of RGB images after removing the outer points to be matching points;

acquiring a first frame pose of two adjacent RGB images by using a PnP algorithm based on coordinates of a matching point in a previous RGB image in the two adjacent RGB images in a three-dimensional space and two-dimensional coordinates of a matching point in a next RGB image in the two adjacent RGB images;

and acquiring the global pose of the RGB image by using a beam adjustment method based on the first frame pose.

Preferably, the removing outliers in the ORB feature points with the same BRIEF descriptor in two adjacent frames of RGB images by using the final mask map includes:

determining masks corresponding to ORB feature points in ORB feature points with the same BRIEF descriptors in each group of BRIEF descriptors in two adjacent frames of RGB images according to the final mask map;

When the masks corresponding to two ORB feature points in the ORB feature points with the same BRIEF descriptors in a certain group are different, the ORB feature points with the same BRIEF descriptors in the group are outliers, and the outliers are eliminated.

Preferably, the acquiring the global pose of the RGB image by using the beam adjustment method based on the first frame pose includes:

grouping all RGB images according to a preset group number;

optimizing the first frame pose by using a beam adjustment method to obtain a second frame pose, and obtaining the relative pose between two adjacent second frame poses by using the second frame pose;

the second frame pose corresponding to the first group of two adjacent frames of images in each group of RGB images is made to be the key pose, and all the key poses are optimized by utilizing a beam adjustment method to obtain the optimized key pose;

based on the relative pose, updating the second frame pose by using the optimized key pose, wherein all updated second frame poses and all optimized key poses form a global pose.

Preferably, the obtaining a global semantic three-dimensional point cloud according to the global pose of the depth image and the RGB image includes:

Generating a point cloud of the depth image;

based on a voxblox frame, acquiring a global three-dimensional model represented by TSDF by using a weighted average algorithm according to the global pose of the RGB image and the point cloud of the depth image;

and acquiring three-dimensional point clouds in the global three-dimensional model by using a marking cube algorithm, wherein all three-dimensional point clouds in the global three-dimensional model form global semantic three-dimensional point clouds.

Preferably, the method further comprises:

when the global three-dimensional model represented by the TSDF is obtained by using a weighted average algorithm, the semantic class probability of the TSDF in the global three-dimensional model is updated by using a Bayesian method.

According to a second aspect of an embodiment of the present application, there is provided a multi-modal pose optimized semantic three-dimensional reconstruction system, the system comprising:

the acquisition module is used for acquiring RGB images and depth images of the target area;

the semantic segmentation module is used for carrying out semantic segmentation on the RGB image to obtain a final mask image of the RGB image;

the pose module is used for acquiring the global pose of the RGB image by utilizing the RGB image, the depth image and the final mask image;

and the semantic three-dimensional reconstruction module is used for acquiring a global semantic three-dimensional point cloud according to the depth image and the global pose of the RGB image.

According to a third aspect of an embodiment of the present application, there is provided a computer apparatus comprising: one or more processors;

the processor is used for storing one or more programs;

when the one or more programs are executed by the one or more processors, the semantic three-dimensional reconstruction method for multi-modal pose optimization is achieved.

According to a fourth aspect of embodiments of the present application, there is provided a computer readable storage medium having stored thereon a computer program which, when executed, implements the above-described multi-modal pose-optimized semantic three-dimensional reconstruction method.

The technical scheme provided by the application has at least one or more of the following beneficial effects:

the application provides a semantic three-dimensional reconstruction method and a semantic three-dimensional reconstruction system for multi-mode pose optimization, comprising the following steps: collecting RGB images and depth images of a target area; performing semantic segmentation on the RGB image to obtain a final mask image of the RGB image; acquiring the global pose of the RGB image by utilizing the RGB image, the depth image and the final mask map; and acquiring a global semantic three-dimensional point cloud according to the global pose of the depth image and the RGB image. The technical scheme provided by the application has low computational force requirements and wide applicability, and improves the real-time performance and efficiency of semantic three-dimensional reconstruction.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart illustrating a multi-modal pose optimized semantic three-dimensional reconstruction method according to an exemplary embodiment;

FIG. 2 is a schematic diagram illustrating a global pose acquisition process according to an exemplary embodiment;

FIG. 3 is a schematic diagram illustrating the use of a Bayesian approach to update the semantic category probabilities of TSDF in a global three-dimensional model according to an exemplary embodiment;

FIG. 4 is a block diagram illustrating the primary structure of a multi-modal pose optimized semantic three-dimensional reconstruction system according to an exemplary embodiment.

Detailed Description

The following describes the embodiments of the present invention in further detail with reference to the drawings.

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As disclosed in the background art, the semantic three-dimensional reconstruction is to perform semantic annotation on a three-dimensional reconstruction model, and divide structures with different category attributes from an integral model, so that the downstream task is convenient to use. One of the important factors for determining the quality of the semantic three-dimensional reconstruction is the calculation of the pose of the camera, and the relation between the quality of the reconstruction model and the distribution of semantic information is achieved. The positioning in the current algorithm adopts pure geometric features, a complex screening mechanism needs to be designed to determine matching points, and the high-level features of semantics are not used. The semantic three-dimensional reconstruction can be divided into three parts of semantic segmentation, three-dimensional reconstruction and semantic mapping, wherein the three-dimensional reconstruction comprises pose calculation and mapping.

In order to improve the problems, the real-time performance and the efficiency of the semantic three-dimensional reconstruction are improved, and the computational power requirement of the semantic three-dimensional reconstruction is reduced.

The above-described scheme is explained in detail below.

Example 1

The invention provides a semantic three-dimensional reconstruction method for multi-mode pose optimization, which can be used for a terminal but is not limited to the method, as shown in fig. 1, and comprises the following steps:

step 101: collecting RGB images and depth images of a target area;

step 102: performing semantic segmentation on the RGB image to obtain a final mask image of the RGB image;

step 103: acquiring the global pose of the RGB image by utilizing the RGB image, the depth image and the final mask map;

step 104: and acquiring a global semantic three-dimensional point cloud according to the global pose of the depth image and the RGB image.

In some embodiments, an RGB image may be acquired by an RGB camera, an RGB-D depth camera acquiring a depth image, but is not limited to; or obtaining RGB images and depth images through a binocular camera; or the RGB image is acquired by an RGB camera, and then the depth image of the RGB image is calculated by using the deep learning.

Further, step 102 includes:

step 1021: performing semantic segmentation on the RGB image by using an STDC network based on deep learning to obtain semantic class probability of each pixel point in the RGB image;

Step 1022: coding the category corresponding to the category with the highest probability in the semantic category probability of each pixel point as a mask of each pixel point to obtain an initial mask map of the RGB image;

for example, assuming that the semantic categories A, B and C of a certain pixel point of the RGB image are three, the category codes corresponding to the semantic categories A, B and C are 1, 2 and 3, the probability of the semantic category a is 0.7, the probability of the semantic category B is 0.2, and the probability of the semantic category C is 0.1, the mask of the pixel point is the category code 1 of the semantic category a;

step 1023: and performing edge optimization on the initial mask map of the RGB image by using a super-pixel segmentation method gSLICr to obtain a final mask map of the RGB image.

It can be understood that after the semantic segmentation is performed on the RGB image through the STDC network based on deep learning, the obtained initial mask map may have the problems of incomplete image or saw tooth, rough image edge, and the like, and the edge optimization is performed on the initial mask map of the RGB image by using the super-pixel segmentation method gSLICr, so that the mask map outline is clearer. Therefore, the STDC network based on the deep learning is combined with the rapid super-pixel segmentation method gSLICr to carry out semantic segmentation, so that the segmentation result is more accurate, the contour is clearer and the segmentation quality is improved while the speed is ensured.

It should be noted that, the STDC network is to embed a Short-Term Dense Concatenate (STDC) module in U-Net, so that a large receptive field can be obtained with few parameters. In order to improve accuracy of Detail segmentation, STDC designs a larger channel in a shallow network, designs a Detail guide module to learn Detail characteristics, and reduces channels of a deep network to reduce calculation redundancy. The STDC network is a fast lightweight semantic segmentation algorithm, can achieve a segmentation speed of 250 frames per second on certain data sets, has small calculation cost, and can accurately classify most pixels.

Super-pixel segmentation is to form irregular pixel blocks by clustering pixels with similar textures, brightness, colors and other features on an image. The pixels in the super-pixel have similar properties, which is equivalent to expressing a large number of pixels in the original image with a small number of image blocks, greatly reduces the scale of subsequent image processing, and has obvious boundaries at the edges of the object. gSLICr is the implementation of SLIC algorithm on GPU, which can achieve a segmentation speed of 250 frames per second.

The invention fuses the two segmentation methods, optimizes the edge of the neural network segmentation result by using the super pixels, and obtains the structured segmentation effect. The STDC network and gSLICr super-pixel based segmentation method is fast, experiments prove that the pose calculation speed is tens of frames per second, and the three-dimensional reconstruction speed is tens of frames per second. Therefore, the application of the semantic segmentation result on pose calculation and semantic mapping does not affect the overall speed.

It should be noted that, the "STDC network based on deep learning" and the "super pixel segmentation method gSLICr" modes related to the embodiments of the present invention are well known to those skilled in the art, and therefore, detailed implementation manners thereof are not described too much.

Further, step 103 includes:

step 1031: ORB characteristic points of the RGB image are extracted;

in some embodiments, ORB feature points in RGB images may be extracted using, but are not limited to, the OpenCV library;

it can be understood that the ORB feature points are pixel points in nature, and the ORB feature points are significant pixel points selected from the pixel points of the RGB image, for example, contour points, bright points in darker areas, or dark points in lighter areas, etc.;

step 1032: obtaining a depth value of the ORB feature point in the depth image, and performing projection coordinate conversion on the two-dimensional coordinates of the ORB feature point by utilizing the depth value of the ORB feature point in the depth image to obtain coordinates of the ORB feature point in a three-dimensional space;

in some optional embodiments, the method for obtaining the coordinates of the ORB feature points in the three-dimensional space by performing projection coordinate transformation on the two-dimensional coordinates of the ORB feature points is as follows:

in the above formula, p= (X, Y, Z) is the coordinate of the ORB feature point in the three-dimensional space, (u, v) is the two-dimensional coordinate of the ORB feature point, Z is the depth value corresponding to the ORB feature point, and K is the camera reference; the semantic category corresponding to the two-dimensional coordinates of the ORB feature points is the same as the semantic category corresponding to the three-dimensional coordinates of the ORB feature points, because the corresponding semantic categories are all the same pixel points;

Step 1033: based on the coordinates of ORB feature points in the three-dimensional space and the final mask map, the global pose of the RGB image is obtained by using a sparse SLAM method.

Further, step 1033 includes:

step 1033a: extracting ORB characteristic points with the same BRIEF descriptors in two adjacent RGB images;

step 1033b: removing outer points in ORB characteristic points with the same BRIEF descriptors in two adjacent frames of RGB images by using a final mask map, and enabling the ORB characteristic points with the same BRIEF descriptors in the two adjacent frames of RGB images after the outer points are removed to be matching points;

specifically, in step 1033b, the removing outliers in the ORB feature points with the same BRIEF descriptor in the two adjacent RGB images by using the final mask map includes:

when the masks corresponding to two ORB feature points in the ORB feature points with the same BRIEF descriptors in a certain group are different, the ORB feature points with the same BRIEF descriptors in the group are outliers, and outliers are removed;

for example, assuming that the ORB feature points in the previous frame RGB image have D1, D2, and D3, the ORB feature points in the next frame RGB image have D4, D5, and D6, and the BRIEF descriptors of ORB feature point D1 are identical to the BRIEF descriptors of ORB feature points D4 and D5, then there are two groups of ORB feature points having identical BRIEF descriptors, namely D1 and D4, D1, and D5; however, if the mask of the ORB feature point D1 is 1, the mask of the ORB feature point D4 is 2, and the mask of the ORB feature point D5 is 1, then the ORB feature points D1 and D4 having the same BRIEF descriptor are outliers, and the ORB feature point D4 is eliminated; ORB characteristic points D1 and D5 with the same BRIEF descriptor in two adjacent frames of RGB images in each group of RGB images after the outer points are removed are used as matching points;

Step 1033c: acquiring a first frame pose of two adjacent RGB images by using a PnP algorithm based on coordinates of a matching point in a previous RGB image in the two adjacent RGB images in a three-dimensional space and two-dimensional coordinates of a matching point in a next RGB image in the two adjacent RGB images;

step 1033d: based on the first frame pose, the global pose of the RGB image is obtained by using a beam adjustment method.

For example, assume that there are 12 frames of RGB images { a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12}, a1-a12 being RGB images;

extracting the ORB characteristic points with the same BRIEF descriptors in two adjacent frames of RGB images, namely extracting the ORB characteristic points with the same BRIEF descriptors in the adjacent frames of RGB images of 11 groups, namely 'a 1 and a2, a2 and a3, a3 and a4, a4 and a5, a5 and a6, a6 and a7, a7 and a8, a8 and a9, a9 and a10, a10 and a11, a11 and a 12';

removing outer points in ORB characteristic points with identical BRIEF descriptors in the 11 groups of adjacent two-frame RGB images by utilizing a final mask map, and enabling the ORB characteristic points with identical BRIEF descriptors in the two adjacent frames of RGB images in each group of RGB images after removing the outer points to be matching points;

acquiring a first frame pose of the 11 sets of adjacent two-frame RGB images by using a PnP algorithm based on coordinates of matching points in a three-dimensional space of a previous frame RGB image in the 11 sets of adjacent two-frame RGB images and two-dimensional coordinates of matching points of a next frame RGB image in the 11 sets of adjacent two-frame RGB images;

Based on the first frame pose, the global pose of the RGB image is obtained by using a beam adjustment method.

It can be understood that the influence of the outlier on the pose calculation accuracy can be reduced by eliminating the outlier, so that the pose accuracy and reliability are improved.

Further, step 1033d includes:

grouping all RGB images according to a preset group number;

based on the relative pose, updating the second frame pose by using the optimized key pose, and forming a global pose by all updated second frame poses and all optimized key poses.

It should be noted that the present invention is not limited to the "preset group number", and may be set by those skilled in the art according to experimental data or expert experience.

In some embodiments, all RGB images may be grouped according to a preset number of frames per group, but not limited to; for example, the preset number of frames per group may be, but is not limited to, 10 frames.

As shown in fig. 2, the RGB images are grouped according to a preset group number, and a PnP algorithm is adopted between two adjacent RGB images to calculate a first frame pose T ₁ . However, the pose of the first frame only considers the matching relation of two adjacent frames of RGB images, and a large error exists, so the pose of the first frame is used as an initial value of subsequent optimization.

And then carrying out local optimization and global optimization on the pose of the first frame. Specifically, the first frame pose T is adjusted by adopting a beam adjustment method (Bundle Adjustment) ₁ Proceeding withOptimizing to obtain the pose T of the second frame ₂ The method comprises the steps of carrying out a first treatment on the surface of the The first Frame RGB image in each group of RGB images is Key Frame (Key Frame), and the second Frame pose between the Key Frame and the adjacent RGB image is Key pose T ₃ Optimizing the key pose by adopting a beam adjustment method (Bundle Adjustment) to obtain an optimized key pose T of each group of RGB images ₄ 。

In the global optimization stage, the invention does not need to project the characteristic points of each frame in the group to the first frame, but directly uses the relative pose and the optimized key pose T of each group of RGB images ₄ Acquiring other poses T except for the optimized key pose ₅ I.e. updated second frame pose, updated second frame pose T ₅ And the optimized key pose T ₄ The global pose is formed. This is because the relative pose T between two adjacent frame poses remains unchanged, and the key poses undergo an optimization change in the global range, so that the poses of each frame in the group also undergo a corresponding change.

For example, assume that there are 4 frames of RGB images { a1, a2, a3, a4}, a1-a4 being RGB images;

acquiring first frame positions of two adjacent frames of RGB images by using a PnP algorithm to obtain 3 first frame positions, namely first frame positions t1, t2 and t3 of a1 and a2, a2 and a3 and a 4;

optimizing the first frame pose t1, t2 and t3 by adopting a beam adjustment method to obtain the second frame pose t4, t5 and t6, namely performing local optimization; and acquiring the relative pose delta t between two adjacent second frame poses, namely the relative poses t4 and t5 and t6, according to the second frame poses ₁ And Deltat ₂ The method comprises the steps of carrying out a first treatment on the surface of the Specifically, Δt ₁ ＝t ₅ /t ₄ ，△t ₂ ＝t ₆ /t ₅ ；

If the preset group number is 1, a group of RGB images A1= { A1, a2, a3 and a4}, enabling second frame pose corresponding to two adjacent frame images in each group of RGB images to be key pose, namely, enabling second frame pose t4 to be key pose, and optimizing all key poses by using a beam adjustment method to obtain optimized key pose t7;

Because the relative pose between two adjacent frame poses is kept unchanged, and the key pose is changed in the global scope, the pose of each second frame in the group should be correspondingly changed, and the result of global optimization of all image frames is obtained, namely: based on the relative pose between two adjacent second frame poses, updating the second frame poses (updating the second frame poses t5 and t6 to obtain updated second frame poses t8 and t9 respectively) by using the optimized key poses, wherein all updated second frame poses and all optimized key poses form global poses (namely, optimized Guan Jianwei pose t7 and updated second frame poses t8 and t 9), and global optimization is realized.

It can be understood that by utilizing the beam adjustment method and adopting the inter-frame, local and global three-layer pose optimization strategies, the accuracy and instantaneity of pose calculation are improved, and therefore the semantic three-dimensional reconstruction method for multi-mode pose optimization is further realized.

It should be noted that, the manner of updating the second frame pose by using the optimized key pose based on the relative pose according to the embodiment of the present invention is well known to those skilled in the art, and therefore, the specific implementation manner thereof is not described too much.

Further, step 104 includes:

step 1041: generating a point cloud of the depth image;

in some embodiments, a point cloud of depth images may be generated by, but is not limited to, back-projecting with a camera;

step 1042: based on the voxblox framework, acquiring a global three-dimensional model represented by TSDF by using a weighted average algorithm according to the global pose of the RGB image and the point cloud of the depth image;

step 1043: and acquiring three-dimensional point clouds in the global three-dimensional model by using a marking cube algorithm, wherein all the three-dimensional point clouds in the global three-dimensional model form global semantic three-dimensional point clouds.

It should be noted that the Voxblox adopts TSDF representation method based on voxel mapping, and provides three fusion methods, namely simple, merged and fast, for points at the same spatial position. The simple method carries out fusion calculation on TSDF values in each voxel, and has high precision and large calculated amount; the fast method regards a plurality of voxels as a group, fuses all TSDF values in the group, reduces the calculation scale, accelerates the calculation speed, and reduces the precision; mered is a compromise between the two approaches. The invention considers the semantic information of all points to be fused later, so that a simple fusion method is adopted for reconstruction. Although each frame of image participates in reconstruction, the data is richer and the fusion is more sufficient, the similarity of pixels of adjacent images is higher, more redundant calculation can occur, the operation cost is increased, and the accuracy and the speed of an algorithm are comprehensively considered.

Aiming at the problems that the semantic three-dimensional reconstruction involves more modules and has high computational power requirements, the invention uses a multi-layer pose calculation method based on sparse SLAM to solve the camera pose, and uses the semantic segmentation result for screening outliers. The pixel classification is realized by using a lightweight real-time semantic segmentation network, the scene three-dimensional modeling is performed by using a voxblox-based real-time three-dimensional reconstruction frame, and finally, the rapid and accurate lightweight semantic three-dimensional reconstruction is realized.

Further, the method further comprises:

It should be noted that, by calculating two-dimensional pixels of the same object photographed on a plurality of frame images, the same TSDF can be theoretically obtained. However, due to measurement errors of the sensor (camera), the actually calculated TSDF does not coincide, and a weighted average algorithm is required for fusion. TSDF (pre-fusion) calculated by different images corresponds to a semantic category probability, and a Bayesian method is used for obtaining the final semantic category after fusion of multiple frames of TSDF.

For example, as shown in fig. 3, for the same TSDF, there are pixels with different poses corresponding to the same, and the semantic segmentation result of the pixels at each pose is not exactly the same, as P (u=c in the figure _i ) I= {1,2,3,4,5}. The result of the multiple segmentation is thereforeFusion updating is performed to obtain the most probable semantic class P (u=c _Fusion ). The invention adopts the same Bayesian updating-based method as the SemanticFusion to fuse semantic information, as shown in the formula:

wherein P represents probability (probability) that the pixel belongs to a certain class, c _i For a certain class, I is a certain frame of image, and K is the total number of frames of the image. P (u=c) _i |I _k ) Representing that pixel u belongs to category c obtained by semantic segmentation under kth frame image _i Probability of (2); p (c) _i |I ₁ ,..., _k-1 ) Representing the result of semantic segmentation of the synthesized previous 1 to k-1 frames, pixel u belonging to c _i Probability of (2); p (c) _i |I ₁ ,..., _k ) Representing the first 1 to k frames of the complex, pixel u belongs to c _i Is a fusion probability of (1); z is a normalization constant. Through continuous updating, semantic categories of the same TSDF can be integrated at all times to obtain final category attributes.

The semantic category under multiple angles is updated and fused by using a Bayesian method, so that the semantic three-dimensional reconstruction of global consistency is finally realized, and the reliability of the semantic three-dimensional reconstruction is improved.

According to the semantic three-dimensional reconstruction method for multi-mode pose optimization, provided by the invention, the RGB image and the depth image of the target area are acquired, semantic segmentation is carried out on the RGB image, the final mask image of the RGB image is obtained, the global pose of the RGB image is obtained by utilizing the RGB image, the depth image and the final mask image, and the global semantic three-dimensional point cloud is obtained according to the global poses of the depth image and the RGB image, so that the requirements on calculation force are lower, the applicability is wide, and the instantaneity and the efficiency of semantic three-dimensional reconstruction are improved.

Example two

A multi-modal pose optimized semantic three-dimensional reconstruction system, as shown in fig. 4, comprising:

and the semantic three-dimensional reconstruction module is used for acquiring a global semantic three-dimensional point cloud according to the global pose of the depth image and the RGB image.

Further, the semantic segmentation module includes:

the first acquisition module is used for carrying out semantic segmentation on the RGB image by utilizing the STDC network based on deep learning to obtain semantic class probability of each pixel point in the RGB image;

the second acquisition module is used for enabling the category corresponding to the category with the highest probability in the semantic category probability of each pixel point to be coded into a mask of each pixel point, and obtaining an initial mask map of the RGB image;

and a third acquisition module, configured to perform edge optimization on the initial mask map of the RGB image by using the super pixel segmentation method gSLICr, so as to obtain a final mask map of the RGB image.

Further, the pose module includes:

The extraction module can be used for extracting ORB characteristic points of the RGB image;

the fourth acquisition module is used for acquiring the depth value of the ORB characteristic point in the depth image, and performing projection coordinate conversion on the two-dimensional coordinates of the ORB characteristic point by utilizing the depth value of the ORB characteristic point in the depth image to obtain the coordinates of the ORB characteristic point in the three-dimensional space;

and a fifth acquisition module, configured to acquire a global pose of the RGB image by using a sparse SLAM method based on coordinates of the ORB feature points in the three-dimensional space and the final mask map.

Further, the fifth obtaining module is specifically configured to:

removing outer points in ORB characteristic points with the same BRIEF descriptors in two adjacent frames of RGB images by using a final mask map, and enabling the ORB characteristic points with the same BRIEF descriptors in the two adjacent frames of RGB images after the outer points are removed to be matching points;

Further, removing outliers in ORB feature points with identical BRIEF descriptors in two adjacent frames of RGB images by using a final mask map, including:

when the masks corresponding to two ORB feature points in the ORB feature points with the same BRIEF descriptors in a certain group are different, the ORB feature points with the same BRIEF descriptors in the group are outliers, and outliers are eliminated.

Further, based on the first frame pose, acquiring the global pose of the RGB image by using a beam adjustment method includes:

grouping all RGB images according to a preset group number;

Further, the semantic three-dimensional reconstruction module is specifically configured to:

generating a point cloud of the depth image;

based on the voxblox framework, acquiring a global three-dimensional model represented by TSDF by using a weighted average algorithm according to the global pose of the RGB image and the point cloud of the depth image;

and acquiring three-dimensional point clouds in the global three-dimensional model by using a marking cube algorithm, wherein all the three-dimensional point clouds in the global three-dimensional model form global semantic three-dimensional point clouds.

Further, the system further comprises:

and the updating module is used for updating the semantic category probability of the TSDF in the global three-dimensional model by using a Bayesian method when the global three-dimensional model represented by the TSDF is acquired by using a weighted average algorithm.

According to the multi-mode pose-optimized semantic three-dimensional reconstruction system provided by the invention, the RGB image and the depth image of the target area are acquired through the acquisition module, the semantic segmentation module performs semantic segmentation on the RGB image to obtain the final mask image of the RGB image, the pose module acquires the global pose of the RGB image by utilizing the RGB image, the depth image and the final mask image, and the semantic three-dimensional reconstruction module acquires the global semantic three-dimensional point cloud according to the global poses of the depth image and the RGB image, so that the requirement on calculation force is low, the applicability is wide, and the instantaneity and the efficiency of semantic three-dimensional reconstruction are improved.

It can be understood that the system embodiments provided above correspond to the method embodiments described above, and the corresponding specific details may be referred to each other, which is not described herein again.

It is to be understood that the same or similar parts in the above embodiments may be referred to each other, and that in some embodiments, the same or similar parts in other embodiments may be referred to.

Example III

Based on the same inventive concept, the invention also provides a computer device comprising a processor and a memory, the memory being for storing a computer program comprising program instructions, the processor being for executing the program instructions stored by the computer storage medium. The processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application SpecificIntegrated Circuit, ASIC), off-the-shelf Programmable gate arrays (FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., which are the computational core and control core of the terminal adapted to implement one or more instructions, in particular to load and execute one or more instructions in a computer storage medium to implement the corresponding method flow or corresponding functions, to implement the steps of a multi-modal pose optimized semantic three-dimensional reconstruction method in the above embodiments.

Example IV

Based on the same inventive concept, the present invention also provides a storage medium, in particular, a computer readable storage medium (Memory), which is a Memory device in a computer device, for storing programs and data. It is understood that the computer readable storage medium herein may include both built-in storage media in a computer device and extended storage media supported by the computer device. The computer-readable storage medium provides a storage space storing an operating system of the terminal. Also stored in the memory space are one or more instructions, which may be one or more computer programs (including program code), adapted to be loaded and executed by the processor. The computer readable storage medium herein may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory. One or more instructions stored in a computer-readable storage medium may be loaded and executed by a processor to implement the steps of a multi-modal pose optimized semantic three-dimensional reconstruction method in the above embodiments.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention, which is intended to be covered by the claims.

Claims

1. A semantic three-dimensional reconstruction method for multi-modal pose optimization, the method comprising:

collecting RGB images and depth images of a target area;

acquiring a global semantic three-dimensional point cloud according to the global pose of the depth image and the RGB image;

the obtaining the global pose of the RGB image by using the RGB image, the depth image and the final mask map includes:

extracting ORB characteristic points of the RGB image;

2. The method of claim 1, wherein the semantically segmenting the RGB image to obtain a final mask map of the RGB image comprises:

3. The method of claim 1, wherein the acquiring the global pose of the RGB image using the sparse SLAM method based on the coordinates of the ORB feature points in three-dimensional space and the final mask map comprises:

4. A method according to claim 3, wherein the removing outliers in ORB feature points with identical BRIEF descriptors in two adjacent frames of RGB images using the final mask map comprises:

5. A method according to claim 3, wherein the obtaining the global pose of the RGB image using the beam adjustment method based on the first frame pose comprises:

grouping all RGB images according to a preset group number;

6. The method of claim 1, wherein the obtaining a global semantic three-dimensional point cloud from the global pose of the depth image and the RGB image comprises:

generating a point cloud of the depth image;

7. The method of claim 6, wherein the method further comprises:

8. A multi-modal pose optimized semantic three-dimensional reconstruction system, the system comprising:

the semantic three-dimensional reconstruction module is used for acquiring a global semantic three-dimensional point cloud according to the depth image and the global pose of the RGB image;

the pose module comprises:

9. A computer device, comprising: one or more processors;

the processor is used for storing one or more programs;

The multi-modal pose optimized semantic three-dimensional reconstruction method according to any one of claims 1 to 7 when the one or more programs are executed by the one or more processors.

10. A computer readable storage medium, characterized in that a computer program is stored thereon, which, when executed, implements the multi-modal pose optimized semantic three-dimensional reconstruction method according to any one of claims 1 to 7.