CN116721210A

CN116721210A - Real-time efficient three-dimensional reconstruction method and device based on neurosigned distance field

Info

Publication number: CN116721210A
Application number: CN202310681075.XA
Authority: CN
Inventors: 张宇; 陈梓怡
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2023-06-09
Filing date: 2023-06-09
Publication date: 2023-09-08

Abstract

The invention discloses a real-time efficient three-dimensional reconstruction method and device based on a neural signed distance field. Rendering by using a volume rendering technology to obtain a rendering depth map and a rendering normal vector map under a specified pose, and optimizing the model by using a real depth map, a real normal vector map and an approximate SDF true value as supervision signals. The invention gives consideration to the reconstruction speed and the reconstruction quality, can perform dense three-dimensional reconstruction in real time, restore the space three-dimensional geometric structure, and reasonably predict the unobserved area.

Description

Real-time efficient three-dimensional reconstruction method and device based on neurosigned distance field

Technical Field

The invention belongs to the technical field of three-dimensional reconstruction, and particularly relates to a real-time and efficient three-dimensional reconstruction method and device based on a neurosigned distance field (Signed Distance Fields, SDF).

Background

Dense three-dimensional reconstruction is an important research topic in computer vision and computer graphics, and is also a key technology for robot positioning and navigation, and the purpose of the dense three-dimensional reconstruction is to scan an indoor scene by using a vision sensor or a depth sensor to acquire data, and finally recover an accurate and complete three-dimensional model of the scene from noisy data.

The common three-dimensional scene reconstruction method can be divided into the following 2 types according to the type of input information: 3D reconstruction based on texture information, 3D reconstruction based on depth information.

The 3D reconstruction based on RGB texture information generally adopts a multi-view stereoscopic vision method, and dense correspondence is established from a plurality of images with known camera poses, so as to generate a three-dimensional point cloud reconstruction result of a corresponding scene. The performance of the multiview stereoscopic algorithm is largely dependent on the richness of the lighting conditions and textures, and furthermore, the multiview stereoscopic algorithm may still fail in areas with similar camera perspectives. Particularly in some indoor scenes with highly similar geometric structures and large non-textured areas (walls and floors), the reconstruction algorithm using only RGB information tends to have poor reconstruction effect. Meanwhile, dense 3D reconstruction based on RGB texture information often requires a long time (in hours), and real-time dense three-dimensional reconstruction cannot be achieved.

In recent years, with rapid development of depth sensors, such as Lidar sensors, RGB-D cameras, etc., 3D reconstruction of scenes has been deeply progressed, and many 3D reconstruction algorithms based on depth information have been developed. The depth sensor can perform three-dimensional perception on the scene and provide geometric information independent of visual characteristics, so that the 3D reconstruction effect can be remarkably improved. RGB-D data with known pose is taken as input, depth frames obtained under different visual angles are fused by using a specific method, and finally, a scene is reconstructed along a view obtained by the track of a camera. Depth sensors are typically noisy and incomplete due to their inherent nature. Therefore, the 3D reconstruction algorithm directly based on the depth sensor has the problems of rough and incomplete reconstruction results and the like. Meanwhile, if the real-time dense 3D reconstruction is to be realized, the computer hardware is often required to be higher, and meanwhile, a larger video memory is occupied.

Therefore, the existing dense three-dimensional reconstruction algorithm cannot realize real-time reconstruction speed, high reconstruction accuracy, complete reconstruction result and low cost of a computing platform at the same time, so that the method is difficult to be directly applied to the field of robot positioning and navigation.

Disclosure of Invention

The invention aims to provide a real-time dense three-dimensional reconstruction technology which can carry out dense three-dimensional reconstruction on a scene in real time, can ensure higher reconstruction precision, reasonably predicts and fills an area which is not observed by a sensor, can balance reconstruction speed, reconstruction quality, memory efficiency and calculation resources, and overcomes the defects of low reconstruction speed, low reconstruction precision, incomplete reconstruction and the like in the existing dense three-dimensional reconstruction technology.

The aim of the invention is realized by the following technical scheme:

according to a first aspect of the present invention, there is provided a method for real-time efficient three-dimensional reconstruction based on a neurosigned distance field, the method comprising the steps of:

s1, acquiring a depth image stream of a known pose of a three-dimensional scene to be reconstructed by using a depth camera, and screening key frames of the depth image stream to construct a key frame set;

s2, modeling a signed distance field of a scene by adopting a mixed scene modeling mode of an explicit discrete voxel grid and a shallow implicit Multi-layer perceptron (MLP) network, wherein the method comprises the following steps of:

dividing a scene into discrete voxel grid structures according to a set resolution, wherein the voxel grids encapsulate the characteristics of geometric distribution of the scene; the discrete voxel grid is regarded as a four-dimensional characteristic tensor, and the four-dimensional characteristic tensor corresponding to the scene is decomposed into a plurality of compact low-rank tensor components by using a tensor decomposition technology;

obtaining geometrical feature tensors of space points in a scene under a set resolution through tri-linear interpolation, encoding the space points, then sending the space points into an MLP (multi-level logical point) network for decoding, and outputting the property of the three-dimensional space of the scene, namely a signed distance value SDF of each space point;

rendering by using a volume rendering technology to obtain a rendering depth map and a rendering normal vector map under a specified pose, and optimizing a model by using a real depth map, a real normal vector map and an approximate SDF true value as supervision signals;

s3, extracting the scene surface by extracting the zero level set of the SDF, thereby realizing the visualization of the three-dimensional reconstruction result.

Further, the three-dimensional scene to be reconstructed is subjected to data recording by using a depth camera, a depth image stream is obtained, and simultaneously, the pose of each depth image is obtained by adopting a SLAM technology (Simultaneous Localization and Mapping, SLAM for short).

Further, performing key frame screening on the acquired depth image stream, including: and maintaining a key frame set, when a frame of depth image is newly input, taking the frame as a current key frame if the relative pose change of the frame is larger than a preset threshold value, and adding the current key frame into the maintained key frame set for online training.

Further, regarding the discrete voxel grid corresponding to the scene as a four-dimensional feature tensor, wherein three dimensions respectively correspond to X, Y, Z axes, the fourth dimension is the number of feature channels stored in the grid, and performing tensor decomposition on the four-dimensional feature tensor comprises: a vector-matrix decomposition technology is selected, four-dimensional characteristic tensors phi (X) are decomposed into a plurality of groups of combinations of vectors V and matrixes M respectively according to an X axis, a Y axis and a Z axis, and X is a three-dimensional space point in a scene; vector b is selected to be combined with other channels.

Further, the MLP network has 4 hidden layers, 128 neural units per layer, employing the ReLU function as an activation function.

Further, feature tensors fed into the MLP network are position-coded, by which the input features are mapped to a high-dimensional space.

Further, the model's loss functions include SDF loss functions, free space loss functions, depth loss functions, and normal vector loss functions.

Further, the calculation method of the SDF loss function and the free space loss function is as follows:

for all sample points on the ray projected from the optical center through the pixel, approximating true SDF supervision from the depth values observed by the depth camera, boundary b for the SDF is provided by calculating the distance of the sample point to the surface point:

applying SDF loss function over a range of cutoff distancesPerforming SDF supervision by adopting a boundary b;

free space outside the cut-off distance, free space loss function is appliedComprising the following steps: when the SDF predicted value is positive and less than boundary b,/is>When the SDF predicted value is positive and greater than or equal to the boundary b, linear constraint is applied; when the SDF prediction is negative, an exponential supervision is applied.

Further, the model's loss function also includes an Eikonal loss function to encourage the model to learn a valid SDF value.

According to a second aspect of the present invention, there is provided a real-time efficient three-dimensional reconstruction apparatus based on a neurosigned distance field, comprising a memory and one or more processors, the memory having executable code stored therein, the processors, when executing the executable code, being adapted to implement the above-mentioned real-time efficient three-dimensional reconstruction method based on a neurosigned distance field.

The beneficial effects of the invention are as follows: the invention adopts a method of combining a discrete four-dimensional voxel grid and a continuous multi-layer perceptron network to model the SDF of a scene, further provides a model light-weight scheme for realizing real-time operation speed and saving storage cost, carries out tensor decomposition on the discrete voxel grid, and decomposes the four-dimensional scene tensor into a plurality of compact low-rank tensor components, thereby realizing dimension reduction of space complexity. And rendering to obtain a rendering depth map and a rendering normal vector map under the specified pose by using a volume rendering technology, and optimizing the model by using a real depth map, a real normal vector map and an approximate SDF true value as supervision signals. At the same time, the use of Eikonal constraints encourages models to learn valid SDF values. The invention gives consideration to the reconstruction speed and the reconstruction quality, can perform dense three-dimensional reconstruction in real time, restore the space three-dimensional geometric structure, reasonably predicts the unobserved area, and can be used for positioning and navigation of the robot.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic view of an overall framework provided by an embodiment of the present invention;

FIG. 2 is a schematic diagram of real-time reconstruction results over time of a scene according to the method of the present invention according to an embodiment of the present invention;

FIG. 3 is a graph comparing the final reconstruction results of the method of the present invention with other reconstruction methods under multiple scenarios provided by an embodiment of the present invention;

fig. 4 is a block diagram of a real-time efficient three-dimensional reconstruction device based on a neurosigned distance field according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is evident that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, are intended to fall within the scope of the present invention.

The embodiment of the invention provides a real-time efficient three-dimensional reconstruction method based on a neurosigned distance field, and fig. 1 is a schematic diagram of an overall framework provided by the embodiment of the invention, wherein the method is realized based on a depth camera, and the detailed explanation of the realization flow of the steps of the method is as follows.

Firstly, recording data of a three-dimensional scene to be reconstructed by using a depth camera to obtain a depth image stream, and simultaneously obtaining the pose of each depth image by using a mature SLAM technology.

In order to realize real-time incremental three-dimensional reconstruction, the input depth images need to be sequentially processed. Since the frame rate at which the sensor acquires images tends to be high, it is not possible to process the input image stream in real time, and therefore it is necessary to process the image stream, selecting a keyframe set as input. The invention maintains a key frame set, when a frame of depth image is newly input, if the relative pose change of the frame is larger than a certain threshold value, the frame is used as the current key frame, and the maintained key frame set is added for online training. Specifically, the relative pose change is decomposed into a relative translational change and a relative rotational change when the relative translational change is greater than a relative translational change threshold t _max Or the relative rotation change is greater than the relative rotation change threshold R _max When this frame is selected as the key frame. Compared with a key frame maintenance strategy based on information gain measurement, the method is simpler and saves time.

The method adopts a mixed scene modeling mode of an explicit discrete voxel grid and a shallow implicit multi-layer perceptron network to model a signed distance field model of a scene. Firstly, selecting a proper resolution a cm according to the size of a scene, dividing a three-dimensional space into n voxels with a size, wherein each voxel is provided with c characteristic channels, setting the size of the scene as x y z, enabling a discrete voxel grid to be regarded as a four-dimensional characteristic tensor, wherein three dimensions respectively correspond to X, Y, Z axes, and one dimension is the number of characteristic channels stored in the grid, and then the size of the four-dimensional characteristic tensor corresponding to the scene is (x/a (y/a) x (z/a) c. On the basis, utilizing a tensor decomposition technology to decompose the four-dimensional characteristic tensor corresponding to the scene into a plurality of compact low-rank tensor components, thereby realizing a more compact scene modeling mode.

Further, the tensor decomposition technique is utilized to decompose the four-dimensional feature tensor corresponding to the scene into a plurality of compact low-rank tensor components, specifically, vector-matrix decomposition (VM decomposition for short) is selected, namely: decomposing the four-dimensional characteristic tensor phi (X) into a plurality of groups of combinations of vectors C and matrixes M according to an X axis, a Y axis and a Z axis respectively:

wherein R represents the number of decomposition groups, x= (x, y, z) is a three-dimensional spatial point in the scene.

The four-dimensional characteristic tensor of the scene has 1 dimension corresponding characteristic channel except 3 coordinate axis channels corresponding to the X axis, the Y axis and the Z axis, and the dimension of the characteristic channel is far lower than that of the coordinate axis channels, so that the invention selects the vector b instead of the matrix to be combined with other channels.

The four-dimensional feature tensor of a scene may be further written as:

as for the MLP network part, the present invention constructs an MLP network with 4 hidden layers, 128 neural units per layer, using the ReLU function as the activation function. In order to learn finer geometries, the present invention performs position coding on the feature tensor Φ (x) fed into the MLP network, and maps the input features to a high-dimensional space through position coding to better fit the geometric distribution containing high-frequency variations.

Based on the above hybrid scene modeling approach, the entire modeling process can be summarized as an encoding-decoding process: and obtaining the geometric feature tensor phi (x) of a three-dimensional space point x= (x,) in the scene by using tri-linear interpolation. And then phi (x) is sent into the shallow MLP network f for decoding so as to obtain a signed distance value SDF (x) of the point:

SDF(x)＝f(x,Φ(x))

the SDF values corresponding to the sampling points in the 3D space are obtained by the above method, and the supervisory signal obtained by the embodiment of the present invention is a 2D depth map, so that the predicted 3D information needs to be converted into 2D information. The invention adopts the neural rendering technology based on SDF, renders the depth image from the implicit SDF, and minimizes the difference between the real depth image and the rendered depth image for model optimization.

The absolute value of the SDF itself reflects the distance between the spatial point and the nearest surface, and then the SDF can be converted into weights through a transformation, so that weighted summation is implemented on the spatial point, the rendering output obtains the depth value of the corresponding pixel, that is, the transformation that the object or scene in the 3D space is projected to the 2D image is completed.

Specifically, suppose that a ray r is emitted from the optical center o of the depth camera, passes through a pixel (u, v) of the depth image, and is back-projected into the 3D space, where there are N sampling points on the ray, and each sampling point has a distance D from the optical center _i X is then _i ＝o+d _i r, i ε {1, …, N }, calculate the ray's rendered pixel depthThe formula is:

wherein omega _i The weight corresponding to the sampling point i has the unbiased and shielding sensing characteristics, and the formula is as follows:

ω _i ＝T _i α _i ,

wherein T is _i Representing the discrete cumulative transmittance corresponding to the sampling point i; alpha _i The sample point volume density, which corresponds to classical volume rendering, can be further expressed as:

wherein SDF (x _i ) Expressed at the space point x _i SDF value at Φ _s (x)＝(1e ^-sx ) ^-1 S is a preset super parameter.

And similarly, a 2D rendering normal vector diagram can be obtained. Since the SDF gradient at spatial points on the surface is equal to the surface normal vector, it is necessary to derive the SDF, obtain the SDF gradient value, and the SDF gradient n for all sampling points _i Weighted summation is carried out, and the normal vector of the rendering pixel of the ray can be obtained through calculationThe formula is as follows:

in order to enable the model of the invention to learn the SDF distribution of the scene, the following loss function is adopted to optimize the scene expression, and supervision is carried out:

wherein,,representing a loss function loss; θ is a list of parameters that need to be optimized, including four-dimensional feature tensors and MLP network parameters; λ is the weight of each loss function.

The meaning of each loss function is as follows:

(1) SDF loss functionAnd a free space loss function->

In order to accelerate model learning to SDF of scene and realize real-time reconstruction, the invention fully utilizes depth information to provide approximate SDF truth value supervision. For all sample points on the ray r projected from the optical center o through the pixel (u, v), the real SDF supervision is approximated from the depth value D (u, v) observed by the depth camera, and a boundary b for signed distance SDF is provided by calculating the distance of the sample point x to the surface point p:

b(x,p)＝D(u,v)-d

where d is the distance from the sampling point x to the optical center o.

Boundary b provides more accurate supervision as the spatial point is closer to the surface, and provides a more unreliable supervision as the spatial point is farther from the surface. The invention thus employs supervision with the above boundary b in a specific range from the surface, called the cutoff distance t, within t the following SDF supervision is applied:

wherein,,for SDF loss function, +.>And (3) representing all sampling points with effective SDF predicted values within the range of the cut-off distance t, wherein SDF (y) is the SDF predicted value corresponding to the sampling point, and b (y) is the boundary value corresponding to the sampling point.

While in the space outside the cutoff distance t (called free space), a free space loss function is applied

When the SDF predictor SDF is positive and less than boundary b,

when SDF predicted valueWhen SDF is positive and equal to or greater than boundary b, a linear constraint is imposed,

when the SDF predictor SDF is negative, an exponential supervision is applied,where β is a super parameter, typically set to 5.

The arrangement is as follows:

(2) Depth loss function

The invention uses the effective depth obtained by the depth camera to monitor:

wherein,,is a set of camera rays from the optical center through the pixels,/->The pixel depth value is rendered for the ray and D (r) is the pixel depth value measured by the depth camera.

(3) Eikonal loss function

The gradient of SDF satisfies Eikonal equationApplication of Eikonal lossFunction->The model is encouraged to learn the SDF better.

Wherein,,representing all sample points with valid SDF predictors.

(4) Normal vector loss function

The invention uses all sampling points on a ray to render the predicted normal vector, and uses the real normal vector to monitor the predicted normal vector.

Wherein,,is a set of camera rays from the optical center through the pixels,/->To render the normal vector, N (r) is the true normal vector. The normal vector priori provides additional geometric constraints, and practice proves that the normal vector priori can obviously improve the reconstruction quality and speed of the scene.

According to the invention, a series of depth images with known poses are taken as input, a three-dimensional reconstruction result of a scene taking a signed distance field as an expression form is output in real time through online training, and the surface of the scene can be extracted by extracting a zero level set of the SDF, so that the visualization of the three-dimensional reconstruction result is realized.

FIG. 2 is a graphical representation of the results of the reconstruction of the method of the present invention over time over a scene in one embodiment; FIG. 3 is a graph comparing the final reconstruction results of the method of the present invention with other reconstruction methods under multiple scenarios in one embodiment.

In general, the present invention proposes a real-time SDF three-dimensional reconstruction method. The mixed scene expression of the explicit compression discrete voxel grid and the implicit MLP network is adopted to describe the signed distance field of the scene, wherein the characteristic voxel grid encodes rich scene geometric information, and the learning of SDF is accelerated; and tensor decomposition is further carried out on the characteristic voxel grid, so that the characteristics are more compact, and the SDF learning speed is further improved. The light scene modeling realizes real-time reconstruction speed and better reconstruction result, consumes less calculation resources, occupies less storage cost and can be easily transplanted into movable robot equipment. In addition, the method can provide distribution of continuous SDF, can reconstruct with any resolution and provides more complete and dense reconstruction result by benefiting from the characteristics of the nerve SDF.

Corresponding to the embodiment of the real-time efficient three-dimensional reconstruction method based on the neurosigned distance field, the invention also provides an embodiment of the real-time efficient three-dimensional reconstruction device based on the neurosigned distance field.

Referring to fig. 4, the real-time efficient three-dimensional reconstruction device based on a neurosigned distance field according to the embodiment of the present invention includes a memory and one or more processors, where the memory stores executable codes, and the processors are configured to implement the real-time efficient three-dimensional reconstruction method based on a neurosigned distance field in the above embodiment when executing the executable codes.

The embodiment of the real-time efficient three-dimensional reconstruction device based on the neurosigned distance field can be applied to any device with data processing capability, and the device with data processing capability can be a device or a device such as a computer. The apparatus embodiments may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor of any device with data processing capability. In terms of hardware, as shown in fig. 4, a hardware structure diagram of an apparatus with data processing capability according to the present invention, where the apparatus for reconstructing a real-time and efficient three-dimensional device based on a neural signed distance field is located, is shown in fig. 4, and in addition to a processor, a memory, a network interface, and a nonvolatile memory shown in fig. 4, any apparatus with data processing capability in the embodiment generally includes other hardware according to an actual function of the any apparatus with data processing capability, which is not described herein.

The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.

For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present invention. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The embodiment of the invention also provides a computer readable storage medium, wherein a program is stored on the computer readable storage medium, and when the program is executed by a processor, the real-time efficient three-dimensional reconstruction method based on the neurosigned distance field in the embodiment is realized.

The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any of the data processing enabled devices described in any of the previous embodiments. The computer readable storage medium may be any external storage device that has data processing capability, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), or the like, which are provided on the device. Further, the computer readable storage medium may include both internal storage units and external storage devices of any data processing device. The computer readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing apparatus, and may also be used for temporarily storing data that has been output or is to be output.

The foregoing description of the preferred embodiment(s) is (are) merely intended to illustrate the embodiment(s) of the present invention, and it is not intended to limit the embodiment(s) of the present invention to the particular embodiment(s) described.

Claims

1. The real-time efficient three-dimensional reconstruction method based on the neurosigned distance field is characterized by comprising the following steps of:

s2, modeling a signed distance field of a scene by adopting a mixed scene modeling mode of an explicit discrete voxel grid and a shallow implicit MLP network, wherein the method comprises the following steps:

2. The method for real-time and efficient three-dimensional reconstruction based on the neurosigned distance field according to claim 1, wherein the three-dimensional scene to be reconstructed is subjected to data recording by using a depth camera to obtain a depth image stream, and simultaneously, the pose of each depth image is obtained by using a SLAM technology.

3. The method of claim 1, wherein performing keyframe filtering on the acquired depth image stream comprises: and maintaining a key frame set, when a frame of depth image is newly input, taking the frame as a current key frame if the relative pose change of the frame is larger than a preset threshold value, and adding the current key frame into the maintained key frame set for online training.

4. The method of claim 1, wherein the discrete voxel grid corresponding to the scene is regarded as a four-dimensional feature tensor, wherein three dimensions respectively correspond to X, Y, Z axes, the fourth dimension is the number of feature channels stored in the grid, and tensor decomposition is performed on the four-dimensional feature tensor, and the method comprises: a vector-matrix decomposition technology is selected, four-dimensional characteristic tensors phi (X) are decomposed into a plurality of groups of combinations of vectors V and matrixes M respectively according to an X axis, a Y axis and a Z axis, and X is a three-dimensional space point in a scene; vector b is selected to be combined with other channels.

5. The method of claim 1, wherein the MLP network has 4 hidden layers, 128 neural units per layer, and a ReLU function is used as an activation function.

6. The method of claim 1, wherein the feature tensors fed into the MLP network are position coded and the input features are mapped to a high dimensional space by the position coding.

7. The method of claim 1, wherein the model's loss functions include SDF loss functions, free space loss functions, depth loss functions, and normal vector loss functions.

8. The method of claim 7, wherein the method of calculating the SDF and free space loss functions is as follows:

9. The method of claim 7, wherein the model's loss function further comprises an ekonal loss function for encouraging the model to learn valid SDF values.

10. A real-time efficient three-dimensional reconstruction device based on a neurosigned distance field, comprising a memory and one or more processors, said memory having executable code stored therein, wherein said processor, when executing said executable code, is configured to implement the real-time efficient three-dimensional reconstruction method based on a neurosigned distance field as claimed in any one of claims 1-9.