Detailed Description
The inventor researches to find that, on the existing global context information modeling of scene point cloud, the representation capability of graph models (graphical models) can be generally utilized to solve, as a more common way is to combine a classifier and a Conditional Random Field (CRF) to estimate the semantic label of each data point. However, the classifier classification recognition phase and the CRF optimization phase are usually operated independently as separate modules without interaction with each other, thereby limiting the information exchange between the modules.
Among them, for the classifier, a three-dimensional voxel convolution neural network is a better choice. The three-dimensional voxel convolution neural network is expanded from a two-dimensional convolution neural network, has good performance in a three-dimensional target classification and identification task, and has the advantages of clear network structure, easy acceleration and the like compared with a point cloud-based deep neural network. However, the voxel neural network requires regularized data input and its labeling results are also rough labels at the voxel level.
In order to solve the above problems in the prior art, embodiments of the present invention provide a three-dimensional point cloud labeling method and apparatus based on fused voxels, where a multi-scale space is constructed on a regularized voxel model based on a voxel convolution neural network to extract multi-scale voxel features, and then the voxel features are expanded to point features by means of feature interpolation, so as to implement finer classification and identification point by point, and further improve the labeling performance. In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
Fig. 1 is a schematic view of an application scenario of a three-dimensional point cloud labeling apparatus 100 based on fused voxels according to an embodiment of the present invention. The electronic terminal 10 includes a three-dimensional point cloud labeling apparatus 100 based on fused voxels, a memory 200, a storage controller 300, and a processor 400. The electronic terminal 10 may be, but is not limited to, an electronic device having a processing function, such as a computer, a Mobile Internet Device (MID), and the like, and may also be a server and the like.
Optionally, the elements of the memory 200, the memory controller 300 and the processor 400 are directly or indirectly electrically connected to each other to realize data transmission or interaction. For example, the components are electrically connected to each other through one or more communication buses or signal lines. The fused voxel-based three-dimensional point cloud labeling apparatus 100 includes at least one software functional module that may be stored in the memory 200 in the form of software or firmware or solidified in the operating system of the electronic terminal 10. The processor 400 accesses the memory 200 under the control of the storage controller 300 for executing executable modules stored in the memory 200, such as software functional modules and computer programs included in the fused voxel-based three-dimensional point cloud labeling apparatus 100.
It will be appreciated that the configuration shown in FIG. 1 is merely illustrative and that the electronic terminal 10 may include more or fewer components than shown in FIG. 1 or may have a different configuration than shown in FIG. 1. The components shown in fig. 1 may be implemented in hardware, software, or a combination thereof.
Further, referring to fig. 2, an embodiment of the present invention further provides a three-dimensional point cloud labeling method based on a fused voxel, which is described below with reference to fig. 2.
Step S11, carrying out voxelization processing on the three-dimensional point cloud data set, and carrying out voxel characteristic extraction in voxels based on the processing result to form a first voxel characteristic matrix;
step S12, the first voxel characteristic matrix is used as the input of the three-dimensional convolution neural network to calculate and obtain the multi-scale characteristics of the voxels, and the multi-scale characteristics are subjected to characteristic series fusion to obtain a second voxel characteristic matrix;
step S13, expanding the voxel characteristics in the second voxel characteristic matrix to each point in the three-dimensional point cloud data set based on a characteristic interpolation algorithm to obtain a point cloud characteristic matrix;
and step S14, inputting the point cloud feature matrix into a multilayer perceptron to realize attribute marking of the three-dimensional point cloud.
In the embodiment, the three-dimensional point cloud is subjected to voxelization processing, the point cloud in the voxels is subjected to feature extraction, a voxel model with the voxel features as elements is input into a three-dimensional convolution neural network to perform multi-scale feature extraction and fusion, and the voxel features are expanded to the point cloud features by using a feature interpolation algorithm, so that the three-dimensional point cloud is marked, and the point cloud marking precision can be effectively improved.
In detail, referring to fig. 3, the process of performing the voxelization process on the point cloud in step S11 can be implemented by the following steps S111 to S113:
step S111, dividing a point cloud coordinate space into a plurality of voxels according to a preset voxel size;
step S112, classifying each point in the three-dimensional point cloud data set into a corresponding voxel according to the grid parameter of the voxel;
step S113, sampling points in each classified voxel so that the number of points in the voxel reaches a first preset value.
In this embodiment, a point cloud voxelization model is introduced in the above steps S111 to S113 to voxelize the point cloud. Specifically, as shown in fig. 4, point cloud voxelization is to divide a point cloud coordinate space into a plurality of voxels according to a given voxel size. Wherein, assuming that the sizes of the input point cloud in three coordinate axes, namely X, Y, Z axis directions, are W, H, E respectively, the size of each voxel is λW、λH、λEThe voxel size of the voxel-processed model is W' ═ W/λW、H'=H/λH、E'=E/λE. In this embodiment, for the convenience of the subsequent convolution operation, W ', H ', and E ' may be integers and powers of 2.
After the point cloud coordinate space is subjected to voxelization grid in step S111, each point in the point cloud may be further classified according to the grid parameter of each voxel, so that each point is classified in each voxel. However, when point cloud classification is performed, due to the influence of measurement errors, distances, shielding and other factors when three-dimensional point cloud data is acquired, the acquired point clouds are often uneven, for example, the point clouds in partial areas are concentrated, and the point clouds in partial areas are sparse. In addition, the point cloud data collection is equivalent to the sampling of the target surface, so that the inside of the target is empty, and no point cloud data exists, so that after the point cloud space is subjected to voxelization, the point cloud distribution inside each voxel is uneven, specifically as shown in fig. 4, wherein the voxel at the lower left corner does not contain the point cloud, and the voxel at the upper right corner contains a small number of points. Therefore, in order to facilitate subsequent uniform voxel feature extraction, the same number of points, such as a first preset value T (T is determined according to the resolution and storage capacity of the point cloud), need to be sampled from each voxel after point cloud segmentation.
It should be noted that, during sampling, if the number of points included in a voxel is greater than a first preset value, randomly sampling the first preset value of points from the current voxel to make the number of points in the voxel reach the first preset value; if the number of points contained in the voxel is smaller than a first preset value, one or more points are randomly selected from the current voxel to be copied so that the number of points in the voxel reaches the first preset value. For example, assuming that the first preset value is T, T points may be randomly sampled for voxels in which the number of points in the voxels exceeds T, and if the number of points in the voxels is less than T, a corresponding number of points are randomly copied to obtain a set of T points, after point cloud blocking and sampling, no more than T ' × H ' × E ' sets of voxels including T points may be obtained, and then feature learning is performed using point cloud data in the voxels to obtain an effective feature expression of each voxel including the point cloud.
Further, the step of performing point cloud feature extraction in the voxel based on the processing result in step S11 to form a first voxel feature matrix includes: calculating the center coordinate of the point cloud in each voxel, and performing center normalization processing on the point cloud data in the voxels based on the center coordinate to obtain an initial data matrix; inputting the initial data matrix into an LGAB module to realize point-by-point local feature description, and performing point-by-point pooling operation on a local feature set in a voxel by adopting maximum pooling to obtain a global feature of the voxel and using the global feature as a first voxel feature matrix.
Specifically, in the embodiment of the present invention, local and global feature fusion modules (LGABs) as shown in fig. 5 are used for stacking to build a feature learning network for voxel sign extraction. Wherein, as shown in FIG. 6, suppose VxFor non-empty voxels containing T points, i.e. Vx={pi=(xi,yi,ziB,) i ═ 1,2,3, …, T, then center normalization is performed before inputting the point cloud data (initial data matrix) into the LGAB module, i.e. the center coordinates (c) of the point cloud within the voxel are first calculatedx,cy,cz) Using the central coordinate (c)x,cy,cz) Carrying out center normalization on the point cloud data to obtain a final initial data matrix of the input dataThat is, the input to the voxel feature extraction module is an initial data matrix of dimension T × 6.
Further, a local feature description point by point can be obtained by using stacked LGAB modules, and a global feature of a voxel can be obtained by performing Pooling operation on a point by point feature set in the voxel by Max-Pooling (MP). As an example of feature extraction for non-empty voxels is given in fig. 6, to reduce the number of parameters, the remaining non-empty voxels may share the same network parameters when feature extraction is performed. In practical implementation, since the LGAB modules can well fuse the common information of the local neighborhoods of the point clouds and the difference information of the local neighborhoods of the point clouds, the cascade connection of a plurality of LGAB modules can be used to effectively extract the point cloud information inside the voxels in this embodiment.
Further, as shown in fig. 7, the process of using the first voxel feature matrix as an input of the three-dimensional convolutional neural network to calculate the multi-scale features of the voxels in step S12 may be implemented by the following steps.
Step S120, converting the voxel characteristic matrix into a 4-dimensional tensor, and respectively inputting the 4-dimensional tensor into three-dimensional convolution neural networks with convolution kernels of different sizes to calculate and obtain voxel characteristics under different scales;
step S121, respectively inputting the voxel characteristics under different scales into three-dimensional deconvolution neural networks with convolution kernels of different sizes to obtain a plurality of voxel characteristics of different scales, wherein the convolution kernel size of each three-dimensional convolution neural network during convolution is the same as the convolution kernel size of each three-dimensional deconvolution neural network during deconvolution.
In detail, since the spatial geometry information is important information of the three-dimensional target, effective feature description of the target can be extracted by directly processing the three-dimensional data, the invention expands the two-dimensional convolutional neural network into the three-dimensional convolutional neural network by taking the huge success of the two-dimensional convolutional neural network in image processing as a reference, that is, when the three-dimensional data is processed by adopting the three-dimensional convolutional neural network, the three-dimensional data needs to be subjected to regularization processing, namely voxelization processing, and two-dimensional convolution operation, pooling operation and the like also need to be expanded to the three-dimensional voxelization data. Wherein, the three-dimensional convolution formula is as follows:
in the formula (1), the reaction mixture is,for the input three-dimensional voxel data,is a three-dimensional convolution kernel template,for which a response is output. Similar to the two-dimensional convolution operation, the three-dimensional convolution operation simply extends the two-dimensional convolution kernel to the three-dimensional convolution kernel, including three dimensions of length, width and height, and accordingly, as shown in fig. 8, the local perception is also converted from a local neighborhood on the two-dimensional plane to a local neighborhood in the three-dimensional space. In practice, due to three-dimensional volumeThe product operation reduces the spatial size of the three-dimensional data, i.e. the spatial size of the three-dimensional feature map is smaller than the size of the input voxel data. However, for three-dimensional data labeling, it is necessary to acquire feature information of each data point, that is, to acquire the feature of each voxel when performing voxel feature extraction, and therefore, it is necessary to reversely map the feature map after convolution operation back to the original input voxel. To solve this problem, in this embodiment, a deconvolution operation (deconvolution) is used to process the obtained three-dimensional voxel feature map to obtain a feature map with the same size as the input voxel data. The core of the deconvolution operation is still the convolution operation, and only the edge 0-complementing operation is performed on the input feature data before the convolution operation so as to ensure the size requirement of the output feature map. For example, a two-dimensional example of convolution and deconvolution operations is shown in fig. 9, where blue marker data is the input data for both operations, gray marker data is the convolution kernel (both are the same size), and green marker data is the response output for both. It will be appreciated that the input to the deconvolution layer is the output of the convolution layer, which has the same dimensions as the input to the convolution layer.
Further, after all the non-empty voxels are processed based on the foregoing operation, a series of voxel characteristics in D dimension can be obtained. Since each voxel feature uniquely corresponds to a voxel coordinate of the three-dimensional space, the acquired feature can be expressed as a 4-dimensional tensor whose size is W ' × H ' × E ' × D, (for empty voxels, a D-dimensional zero vector is used as its feature description). After the feature representation is converted into the feature representation based on the 4-dimensional tensor, a three-dimensional convolutional neural network (3D CNN) can be adopted for further feature optimization. In consideration of the fact that feature extraction under a fixed scale (convolution kernel size) is not enough to completely express the local neighborhood structure information of the voxel, the method adopts multi-scale feature extraction and fusion to extract richer local neighborhood information.
As shown in fig. 10, the specific three-dimensional convolutional neural network structure, that is, the 3D CNN-based multi-scale feature extraction mainly includes a three-dimensional convolution operation (Conv3D), a three-dimensional deconvolution operation (DeConv3D), and a feature concatenation operation (Concat).For the input W ' × H ' × E ' × D dimensional features, three different convolution kernels are used to perform convolution and deconvolution operations respectively, which are marked as Conv3D (fin;fout;ker;st;pad),DeConv3D(fin;fout(ii) a ker, respectively; st; pad) where fin、foutA dimension representing the input-output characteristic matrix, ker; st; the pad represents the size of a convolution kernel, the size of a convolution kernel moving step during convolution and the size of data filling during data expansion, and all represent three-dimensional vectors. To obtain features at different scales, the three convolution kernels may be, but are not limited to, (1; 2; 2); (2; 1; 2); (2; 2; 1), the same convolution kernel is adopted in the convolution and deconvolution operations, and the corresponding edge 0-complementing operation is carried out in the deconvolution operation. In addition, the convolutional layer and the deconvolution layer include convolution and deconvolution operations, and each convolution operation further includes a Normalization layer (BN) and a ReLU activation operation.
The multi-scale feature extraction based on the 3D CNN adopts different convolution kernels to extract and fuse voxel features along three mutually perpendicular directions (namely X, Y, Z directions) of a three-dimensional space, so that the learned features can contain more local structure information, and the point cloud can be expressed more completely.
Further, in step S13, since the feature description of each point needs to be obtained when the marking of the three-dimensional point cloud is implemented, and the aforementioned three-dimensional convolutional neural network can only obtain the feature description of each voxel, the feature description of each point in the input point cloud is obtained by interpolating the voxel features in the present invention. As shown in fig. 11, for a given target point p, the nearest predetermined number (e.g. 8) of neighborhood voxels in the voxel space formed by the second voxel feature matrix are found, and the feature descriptions corresponding to the neighborhood voxels are respectivelyWhere j is 1,2, …,8, the target point p is characterized by:
in the formula (2), the reaction mixture is,representing the center point c in the voxel according to the target point p and the jth neighborhoodjThe euclidean distance between them to obtain a weight parameter,and expressing the voxel characteristics of the jth neighborhood voxel, and repeatedly executing the process on each point in the three-dimensional point cloud, namely expanding the voxel characteristics in the second voxel characteristic matrix to each point in the three-dimensional point cloud data set to obtain the point cloud characteristic matrix.
Further, in step S14, the point cloud feature matrix of each point obtained in step S13 is input to a multi-level perceptron (MLP) to realize the category identification of the point, i.e. the three-dimensional point cloud marker, and please refer to fig. 4 again for the specific network structure thereof.
According to the actual requirement, in order to further optimize the point cloud marking result obtained by the convolutional neural network and improve the marking precision, the method further comprises a step S15.
And step S15, inputting the three-dimensional point cloud with the attribute marking completed into a CRF-RNN network for point cloud attribute marking optimization.
Specifically, in this embodiment, the FC-CRF implemented based on the CNN basic operation is embedded in the point cloud labeling network based on the three-dimensional convolutional neural network, so as to implement an end-to-end three-dimensional point cloud fine labeling network that integrates a coarse label and a back-end optimization, and further improve the accuracy of point cloud labeling, especially the smoothness of a target boundary and a contour.
In the embodiment, the CRF modeling facing the three-dimensional point cloud marker is firstly carried out, then the CRF approximation based on the CNN operation is realized, and finally the CRF optimized three-dimensional point cloud marker is fused.
The traditional semantic labels are modeled as point-by-point classification and recognition, and some semantic labels adopt local features for classification and recognition, and some semantic labels adopt a deep neural network for classification and recognition. However, the point-by-point classification recognition usually brings some obvious unacceptable marking errors, for example, partial points in a certain target may be recognized as other categories, because the point-by-point classification recognition does not consider the adjacency relation between the points, and only uses the small-sized neighborhood information of the partial points to be marked. If the target structure information can be modeled in advance (for example, all targets are continuous, and adjacent points with similar characteristics are marked as targets of the same class), and the marking result is optimized and limited based on the modeling result, some obvious errors can be effectively eliminated, and further a high-precision marking result is obtained. Conditional Random Fields (CRFs) are an efficient method of modeling the continuity of an object and its neighborhood structure information, and have been widely used in two-dimensional image labeling. A conditional Random Field is a model that calculates the conditional probability distribution of one set of output Random variables given another set of Random variables, and is mainly characterized by the assumption that the output variables constitute a Markov Random Field (MRF).
In detail, CRF is a discriminant probabilistic undirected graph model, which can model global context information and cross features in data and can well process sequence data segmentation and labeling. Let X be { X in a given set of random variables1,X2,…,XNP ═ P1,P2,…,PNIn which Xi∈L={l1,l1,…,lMFor three-dimensional point cloud markers, P is the input point cloud containing N points, PjIs the observation vector of the j point, X is the semantic labeling result of the input point cloud, XiThe semantic label of the ith point is one of the M semantic labels, and then the corresponding CRF model can be represented by Gibbs probability distribution, that is:
in the formula (3), G is a probability undirected graph constructed on a random variable set X,Oare groups in figure G wherein,οeach pair of nodes in the set is adjacent,OGthen the set of all groups in G, Z (P) is a normalization function, Λ (x)οP) is an energy function on the radical, also known as a potential function.
For any marking result x epsilon LNThe overall potential function of (a) is:
solving based on the maximum posterior probability algorithm to obtain the optimal marking result:
as can be seen from the solving process of the optimal solution, the maximization of the posterior probability of the marking result is the minimization of the overall potential function. It is easy to see that the conditional random field firstly realizes the modeling of local context information through a group potential function, and then transfers the context information by utilizing a graph structure, thereby realizing the modeling of large-range context information.
For the fully-linked CRF model, each node in graph G is linked to the remaining nodes, as shown in FIG. 12, with the corresponding groupsOFor a group containing a single node or containing pairs of nodes, the overall potential function for x can thus be expressed as:
wherein, for convenience of description, the part of the conditional P in the conditional posterior probability is removed in equation (6), i.e., Λ (x) ═ Λ (x | P),as a function of the cell potential,for a pairwise binary potential function, i, j ═ 1,2, …, N. Cell potential functionIndicating that the ith node in graph G is labeled xiThe cost is usually defined by the probability output of a certain discriminant classifier, and the estimation result often contains more noise and the segmentation result is often discontinuous at the target edge. The paired binary potential function gives that the ith observation point and the jth observation point are marked as x at the same timei、xjThe method has the effects of keeping the consistency of the marks of adjacent observation points and reducing the inconsistency, and can improve the smoothness of the marking result.
The idea of gaussian weighting is used to define a pairwise binary potential function as:
in the formula (7), the function ψ (x)i,xj) Calculating a function for consistency between different labels, w(m)As a weight value, the weight value,for a smooth filter function based on Gaussian kernels, there is a total of MGA Gaussian kernel function, fi,fjFeature vectors representing observation points i, j, respectively, and havingEach gaussian kernel functionBy a symmetrical, positive definite matrix ΛmTo be defined. To this end, the method is oriented to three-dimensional point cloud marksAnd (4) after the modeling of the full-connection CRF is completed, the next step is how to solve to obtain the optimal marking result.
The three-dimensional point cloud marking optimization process based on the fully-connected CRF is a process of maximizing posterior probability phi (X) based on input point cloud data. Solving accurate posterior probability is difficult, the calculation amount is huge, and the approximation method based on the average field can convert the posterior probability phi (X) into a series of products of independent marginal probabilities, namely phi (X) is approximately equal to theta (X) or pii=1Θi(Xi) The arbitrary labeling results x can be obtained by combining the formulae (3) to (7)iMarginal probability of (theta)iComprises the following steps:
an iterative inference algorithm for CRF can be constructed based on equation (8), as shown in algorithm 5.1. The convergence of the iterative algorithm is mainly measured through the difference between the estimated Q and P, the convergence of the iterative algorithm can be obtained through evaluation, the estimation error is very small when the iteration number is 10, and the algorithm is proved to have good convergence.
Algorithm 1: CRF iteration inference algorithm based on average field approximation
1. Initialization: initializing all nodes
While is convergence do
3. Information transmission: computing all Gaussian filtered results
4. Weighted filtering:
5. and (3) consistency detection:
6. adding a univariate potential function:
7. normalization:
8.end while
the following describes how to use the correlation operations in CNN to implement the CRF iterative inference algorithm described above. The biggest problem in reconstructing the algorithm based on the CNN operation is whether the back propagation of errors can be realized, namely, the BP algorithm is adopted for parameter learning training.
(1) Initialization operations
The initialization operation in algorithm 1 is:
wherein,i.e. all values may be summed. Note the bookThen there isZi=∑lexp(Ui(l) It can be seen that this operation is equivalent to U marking all possible results on each scene pointi(l) And performing an activation operation based on the Softmax function. The Softmax function is a commonly used activation in CNN networksThe function, which does not contain any parameters, has error derivatives that can be propagated backwards, and thus can also be learned and trained using the Back-propagation (BP) algorithm.
(2) Information transfer
As shown in Algorithm 1, information transfer in CRF is by MGA Gaussian filter pair thetajAnd performing smooth filtering. The kernel function of the gaussian filter is obtained according to the characteristics of the point cloud, such as coordinate information or color and intensity information of the point, and expresses the association relationship between each scene point. In the fully-connected CRF model, each filter needs to cover all points in the point cloud, and the data amount and the calculation amount are large, so that the method cannot be directly realized. The fast gaussian convolution is realized by a full-freedom-degree-based polyhedral laTTice method (finite-phase laTTice), the calculated amount is O (N), and N is the number of points for filtering, so that the fast gaussian convolution has higher speed and better filtering effect compared with the traditional gaussian convolution. The fast Gaussian convolution based on the full-freedom polyhedral lattice method comprises four stages, namely a polyhedral lattice construction stage, an expansion (splat) mapping stage, a slice (slice) mapping stage, a fuzzy (blu) stage and the like.
In backward propagation, the input (error derivative) of the current convolutional layer is the output of the previous layer filter in the opposite direction through MGAnd outputting the result after the Gaussian filter. In the gaussian convolution based on the full-degree-of-freedom polyhedral lattice method, this back propagation can be achieved by reversing the order of the filters in the blurring stage on the basis of keeping the construction, extended mapping and slice mapping of the polyhedral lattice the same as in the forward propagation. The calculation amount of the implementation method is still O (N), so that the calculation amount is obviously reduced, and the calculation efficiency is improved.
(3) Weighted filtering
The next calculation is the aforementioned M for each semantic tag lGThe output results are weighted and summed. In point cloud labeling, each semantic tag is independent of each other, so that the weighted filtering operation can be performed by MGA convolution kernel of l × lWherein the input of the convolution operation is MGAnd outputting the feature matrix of each channel as a feature matrix containing l channels. In the back propagation, because the input and the output of the operation are known and the convolution kernels are independent of each other, the error derivatives of the convolution kernel parameters and the input data can be calculated, and the learning training of the convolution kernel parameters can be further performed by using the BP algorithm.
(4) Consistency detection
In consistency detection, a PoTTs model is used for performing compatibility calculation on output results of different tags in the last step. The compatibility calculation mainly compares whether the labels of two similar observation points are the same, consistency detection is 0 when the semantic labels of the two points are the same, and a penalty term sigma is introduced when the semantic labels of the two points are different, and the calculation is as follows:
compared with the fixed penalty term sigma, the method adopts the penalty value learned based on data, because the relevance between different labels is different, so that the influence of different labels marked at adjacent points on the whole marking result is different. Therefore, the consistency detection can also be regarded as a convolution layer, the number of input and output channels of the layer is M (the number of labels), the size of the convolution kernel is l × l, and the learned neuron connection weight parameter is the value of the transfer function. This step is also counter-propagating since it is implemented using the basic convolution operation.
(5) Adding a unary potential function
Will be a function of the unary potentialThe output result obtained in the consistency detection is combined element by element to obtain the complete potentialAnd (4) function results. The step of adding the unary potential function does not contain any parameter, so that the error in the output can be simply copied to the input end to realize back propagation.
(6) Normalization
Similar to the initialization procedure, the normalization step can also be implemented by an activation operation based on the Softmax function, and the backward propagation thereof is consistent with the backward propagation based on the Softmax function in the CNN. So far, the basic operation in the CNN network has been adopted to implement each step of a single iteration in the algorithm 1, and the solution algorithm of multiple iterations can be implemented by stacking the above steps.
Based on the above description, in this embodiment, the CRF model is first approximately modeled by using an average field approximation method, and then each step in the average field approximation method is equivalently implemented by using the basic operation in the CNN, that is, a single iteration average field approximation algorithm is implemented. For the iterative mean field approximation method, only related steps need to be stacked, that is, iterative mean field approximation calculation can be realized by using a recursive CNN structure (RNN), which is shown in fig. 13, where input point cloud data is given as P, and a point-by-point univariate potential function is U-Ui(l) The marginal probability obtained by the previous iteration is H1The marginal probability obtained by the current iteration is H2The single mean field approximation is denoted as fΩ(U,P,H1) Ω is its parameter set (including all parameters in weighted filtering and consistency detection) and Ω ═ w(m)ψ (l, l') }. For H1The output of the softmax function with U as input is initialized at the beginning of iteration and is equal to H in the following iteration2The output of (a) is:
wherein T' is the number of iterations.
In the presence of hydrogen to obtain H1Then, based on average field approximation algorithmCarrying out H2The estimation of (2) is as follows:
H2(t')=fΩ(U,P,H1(t')),0<t'≤T'
for the output Y, the estimation result is output only at the last iteration, i.e. Y ═ H2(T')。
Based on the above analysis, it can be known that the error derivative of the whole network structure (denoted as CRF-RNN) with respect to the parameter is calculable, so that it can be solved by the standard BP algorithm, and thus can be embedded into other neural networks for learning and training.
Further, a three-dimensional Point cloud labeling network (Point voxel net + CRF-RNN, PVCRF) fusing a three-dimensional voxel convolutional neural network and CRF rear-end optimization is constructed based on the Point cloud labeling network, and a specific structure thereof is shown in fig. 14, wherein for an input Point cloud, firstly, voxelization is performed according to a scene size of the input Point cloud and a fixed number of Point clouds are randomly selected in a voxel for subsequent feature extraction, then, a simple voxel feature is obtained by performing feature extraction in the voxel based on an LGAB module, after all non-empty voxel features are obtained (the features of empty voxels adopt 0 complementing operation), multi-scale voxel feature extraction and fusion are performed based on the three-dimensional convolutional neural network (Conv3D, DeConv3D), then, a multi-scale voxel feature is expanded to all points by using an interpolation method, further, Point-by-Point features are obtained, and then, the Point features are input into a multi-layer perceptron to obtain a preliminary labeling result, and finally, performing back-end optimization based on a CRF-RNN network structure. The network structure well realizes the information interaction between the classification and identification stage and the CRF optimization stage in the point cloud marking, and has obvious effect on improving the point cloud marking precision.
Based on the description of the three-dimensional point cloud marking method based on the fused voxels, the inventor also verifies the performance of the three-dimensional point cloud marking method based on the fused voxels, and evaluates the performance of the point cloud marking by evaluation indexes such as an Intersection over Union (IoU) and an Overall Accuracy (OA).
(1) Network implementation and parameter setting
In the point cloud data rasterization and sampling phase, different data sets need to be processed differently.
S3 DIS: for the S3DIS dataset, the maximum size range of the scene along the Z, Y, X axis is E-8 m, H-16 m, and W-50 m, respectively. To cover the entire scene, the size of the entire grid is 8 × 16 × 50, and the size of each voxel is λE=0.5m、λH=0.25m、λWThe voxel model size is E ' 16, H ' 64, W ' 256, with the extra voxels empty. And selecting 32 points in each voxel.
vKITTI: for vKITTI data sets, the maximum size ranges of the scene along the Z, Y, X axis are respectively E-33 m, H-193 m and W-148 m. Here the size of each voxel is set to λE=2m、λH=1.6m、λWThe voxel model size is 1.2m, E ' 16, H ' 128, W ' 128. Similarly, 32 points are selected for each voxel.
For the CRF-RNN network, in order to prevent overfitting, gradient fading, and the like, the number of iterations is set to T '5 in the training phase, and T' 10 in the testing phase. The size of the Gaussian filter is consistent with the size of the point cloud data.
The method adopts a two-step training strategy, wherein the first step is to carry out independent training on the Point-VoxelNet, and the second step is to carry out fine adjustment on the network PVCRF combining the Point-VoxelNet and the CRF-RNN. In a Point-VoxelNet network, an Adam optimization algorithm with a momentum value of 0.9 is adopted for training optimization, the initial learning rate is 0.001, and the trained batch size is 16. For the PVCRF network, an Adam optimization algorithm with a momentum value of 0.6 is adopted for training and optimization, the initial learning rate is 0.0001, and the training batch size is 16. During training, an early-stopping strategy is also adopted to obtain the optimal network parameters, the maximum number of training rounds is 100, and the training is stopped if the network parameters are not updated after 10 continuous training rounds. In the testing stage, the algorithm is verified by adopting 6-fold cross validation, and the grouping conditions of the training data and the testing data are shown in table 1.
The three-dimensional voxel neural network structure and the CRF-RNN network structure are realized by adopting a Tensorflow deep learning framework based on Python language. The experimental hardware environment is as follows: intel Core i76700KCPU, 48G memory, GTX1080Ti display card (supporting CUDA 8.0 and cuDNN 5.1).
(2) Analysis of quantitative results
Based on the two data sets, the application effect of the deep neural network model (Point VoxelNet and PVCRF) in three-dimensional Point cloud marking is compared, and comparative analysis is carried out on the application effect of the deep neural network model and the current better marking algorithms of PointNet, MS + CU (2), SEGCloud and 3 DContextNet. Wherein, only rectangular coordinate information of the point cloud, namely XYZ coordinates, is adopted for processing.
TABLE 1
TABLE 2
Table 1 and table 2 respectively show statistical labeling results of different network models on two small data sets, where the PVCRF model provided in this embodiment obtains superior average IoU on S3DIS and vKITTI data sets, respectively 51.8% and 39.1%, and also obtains better results on overall labeling accuracy OA, respectively 81.2% and 82.6%, indicating that the PVCRF model can better obtain complete feature expression of a point cloud, and proving that multi-scale feature extraction based on a three-dimensional voxel space and multi-scale feature extraction based on an euclidean space have a comparable function, and both can provide detailed information of a point cloud scene for three-dimensional point cloud labeling.
Comparing PVCRF with SEGCLloud, SEGCLloud directly adopts a binary voxel model to carry out voxel characteristic learning, and PVCRF adopts a rasterized point cloud voxel model and utilizes the point cloud in the voxel to carry out voxel characteristic learning. Both add back-end optimization based on a fully-connected CRF model. In addition, the PVCRF comprises a multi-scale feature extraction module based on a three-dimensional voxel space, so that the PVCRF obtains higher average IoU, and the fact that the multi-scale feature extraction module can extract feature descriptions with stronger characterization capability by utilizing the voxel interior point cloud data for processing is fully demonstrated.
Compared with Point-voxel Net and PVCRF models, the PVCRF model has better performance in the aspects of average IoU and overall accuracy OA, and the full-connection CRF model can model context information in a larger range and has stronger characterization capability on the adjacency relation of Point clouds.
On different data sets, PointNet is used as a reference, as point clouds in the S3DIS data set are concentrated and have generally high density, the network model PVCRF provided by the embodiment achieves a large performance improvement, and in the vKITTI data set, as the point cloud distribution is generally sparse, the performance improvement needs to be small.
The labeling results obtained by the different algorithms on the S3DIS and vKITTI datasets, as well as the true labeling results, are shown in fig. 15 and 16. From left to right are: inputting color Point cloud, and really marking the result based on the marking results of PointNet, Point-VoxelNet and PVCRF. From the labeled results in fig. 15 and fig. 16, it can be seen that the results obtained based on PVCRF are better than PointNet and Point-VoxelNet, and the labeled results are closer to the real results, thus proving the effectiveness of the multi-scale feature learning and CRF backend optimization in PVCRF. Compared with the Point-VoxelNet and the Point-VoxelNet, the marking result obtained based on the Point-VoxelNet is superior to that of the Point-VoxelNet, which is mainly because the introduction of multi-scale feature learning in the Point-VoxelNet improves the feature learning capability of a network model.
In indoor and outdoor scenes, for targets with high overlapping degree or close connection, the Point-VoxelNet and PVCRF models are still difficult to separate, such as wall panel (board) and window (window) in fig. 15, road (road) and terrain (terrain) in fig. 16. Because multi-scale feature extraction is considered in both the Point-voxel net and PVCRF models, objects larger than the size of the object in the scene can be well marked, such as a table (table) in fig. 15 and a building (building) in fig. 16.
Compared with Point-VoxelNet and PVCRF, the full-connected CRF can extract the large-scale context information in the scene Point cloud, so that the details in the marking result are more prominent, as shown in the segmentation results of the chair (chair) and the sofa (sofa) in the first line of fig. 15 and the road (road) and the terrain (terrain) in fig. 16.
(4) Class confusion matrix analysis
Fig. 17 and fig. 18 respectively show category confusion matrices obtained by two network models of Point-VoxelNet and PVCRF on S3DIS and vKITTI datasets, where a numerical value in a matrix grid is a category label accuracy, and a grid color also represents the accuracy. Comparing the two results, it can be found that for the indoor data set S3DIS, the segmentation accuracy of the Point-VoxelNet and the PVCRF model is equivalent to that of various targets, and the introduction of the CRF model mainly reduces confusion between a table (table) and a floor (floor). For outdoor data set vKITTI, the introduction of CRF models mainly reduces the confusion between buildings (building) and trucks (van).
As can be seen from fig. 17(a), in the S3DIS data set, the recognition accuracy of the Point-VoxelNet network model to the ceiling (ceiling), the floor (floor), the door (door), the column (column), the beam (beam), the window (window), the bookcase (book), the wall board (board) and the chair (chair) is above 52%, the labeling accuracy is low, the sofa (sofa), the wall (wall) and the shadow (cluTTer) are included, the accuracy is between 35% and 46%, and the worst of the 13 types of targets is the table (table), and the accuracy is 22%. The distribution of average class accuracy obtained by PVCRF on the S3DIS data set is similar to the Point-VoxelNet network model, but the general accuracy is improved, wherein the accuracy of the table is improved to 30%, as shown in fig. 18 (a).
Similar comparison results are also found in vKITTI datasets, as shown in FIGS. 17(b) and 18 (b). This comparison demonstrates the accurate understanding and modeling of the CRF model for the target details in the scene. Comparing different results on the two data sets in fig. 17 and fig. 18, the marking accuracy on the vKITTI data set is generally low, which is mainly because the point cloud in the vKITTI data set is generally sparse and uneven relative to the point cloud data in the S3DIS data set, so the neighborhood structure of the point is not so obvious, which further results in insufficient information extraction, and is not beneficial to realizing high-accuracy point cloud marking.
(5) Computational time statistical analysis
Experimental analysis was performed on the computational efficiency of the different algorithms on the S3DIS dataset. The S3DIS data set contains 272 independent point clouds, each point cloud is labeled by using different algorithms and the average test time is counted (only the time consumed by the neural network calculation is counted, and the time consumed by the data preparation stage is not counted), and the statistical results are shown in table 3. As can be seen from the statistical results in table 3, the calculation time of PointNet is about 1.8 s. The calculation of the network model PVCRF for fusing the three-dimensional voxel convolution neural network and the CRF rear-end optimization provided by the embodiment takes the largest time, namely 4.52s, because the calculation amount is obviously increased due to the introduction of the CRF.
TABLE 3
And finally, performing further optimization by adopting FC-CRF realized based on a convolutional neural network to obtain a fine marking result.
Further, referring to fig. 19, the page building apparatus 100 provided in the embodiment of the present invention includes a voxel processing and feature extracting module 110, a multi-scale voxel feature calculating module 120, a feature expanding module 130, and a point cloud marking module 140.
A voxel processing and feature extraction module 110, configured to perform voxel processing on the three-dimensional point cloud data set, and perform voxel feature extraction in voxels based on a processing result to form a first voxel feature matrix; in this embodiment, the step S11 may be executed by the voxel processing and feature extraction module 110, that is, the detailed description of the voxel processing and feature extraction module 110 may refer to the step S11, which is not repeated herein. Optionally, as shown in fig. 19, in the present embodiment, the voxel processing and feature extracting module 110 includes a voxel dividing unit 111, a point cloud classifying unit 112, and a point cloud sampling unit 113.
The voxel dividing unit 111 is used for dividing the point cloud coordinate space into a plurality of voxels according to a preset voxel size; in this embodiment, the step S111 may be executed by the voxel dividing unit 111, that is, for a detailed description of the voxel dividing unit 111, reference may be made to the step S111, which is not described herein again.
A point cloud classifying unit 112, configured to classify each point in the three-dimensional point cloud data set into a corresponding voxel according to a grid parameter of the voxel; in this embodiment, the step S112 may be performed by the point cloud classifying unit 112, that is, the step S112 may be referred to for the detailed description of the point cloud classifying unit 112, and the detailed description is not repeated herein.
And the point cloud sampling unit 113 is configured to sample points in each classified voxel so that the number of points in the voxel reaches a first preset value. In this embodiment, the step S113 may be performed by the point cloud sampling unit 113, that is, the step S113 may be referred to for the detailed description of the point cloud sampling unit 113, and the detailed description is not repeated herein.
A multi-scale voxel characteristic calculation module 120, configured to calculate a multi-scale characteristic of a voxel by using the first voxel characteristic matrix as an input of a three-dimensional convolutional neural network, and perform characteristic tandem fusion on the multi-scale characteristic to obtain a second voxel characteristic matrix; in this embodiment, the step S12 may be executed by the multi-scale voxel characteristic calculating module 120, that is, the step S12 may be referred to for the detailed description of the multi-scale voxel characteristic calculating module 120, and the detailed description is not repeated here.
A feature expansion module 130, configured to expand the voxel features in the second voxel feature matrix to each point in the three-dimensional point cloud data set based on a feature interpolation algorithm to obtain a point cloud feature matrix; in this embodiment, the step S13 can be executed by the feature extension module 130, that is, the detailed description about the feature extension module 130 can refer to the step S13, and the detailed description is not repeated here.
And the point cloud marking module 140 is used for inputting the point cloud feature matrix into the multilayer perceptron to mark the attributes of the three-dimensional point cloud. In this embodiment, the step S14 can be executed by the point cloud marking module 140, that is, the detailed description about the point cloud marking module 140 can refer to the step S14, and the detailed description is not repeated here.
In summary, the method and the device for labeling the three-dimensional point cloud based on the fused voxel provided by the embodiment of the invention extract multi-scale voxel characteristics by constructing a multi-scale space on a regularized voxel model based on a voxel convolution neural network, and then expand the voxel characteristics to point characteristics by using a characteristic interpolation mode, thereby realizing relatively fine classification and point cloud labeling point by point.
In the description of the present invention, the terms "disposed", "connected" and "connected" should be interpreted broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art. In the embodiments provided in the embodiments of the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus and method embodiments described above are illustrative only, as the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to a predetermined number of embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code. The module, segment, or portion of code, comprises one or a predetermined number of elements designed to implement a specified logical function.
It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.