CN111489358B

CN111489358B - Three-dimensional point cloud semantic segmentation method based on deep learning

Info

Publication number: CN111489358B
Application number: CN202010190589.1A
Authority: CN
Inventors: 孙志刚; 江湧; 邓世恒; 肖力; 王卓
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2020-03-18
Filing date: 2020-03-18
Publication date: 2022-06-14
Anticipated expiration: 2040-03-18
Also published as: CN111489358A

Abstract

The invention discloses a three-dimensional point cloud semantic segmentation method based on deep learning, and belongs to the field of three-dimensional point cloud and pattern recognition. The method comprises the following steps: training a semantic segmentation neural network model by using a three-dimensional point cloud training set, wherein the labels are real semantic categories, and the semantic segmentation neural network model comprises the following steps: a feature extraction network and a semantic segmentation network; the feature extraction network is used for extracting global features and local features of the three-dimensional point cloud; the semantic segmentation network is used for fusing global features and local features of the point cloud, and the probability that each point corresponding to the output feature map belongs to each semantic category; and inputting the point cloud to be detected into the trained semantic segmentation neural network model to obtain a point cloud segmentation result. According to the method, the local feature extraction module is used for extracting the local features of the point clouds in multiple scales, the channel attention promotion module is used for promoting the attention of important feature channels, unimportant feature channels are restrained, the weighted multi-class loss function is used for optimizing the training effect, and the precision of the semantic segmentation method is improved.

Description

Three-dimensional point cloud semantic segmentation method based on deep learning

Technical Field

The invention belongs to the field of three-dimensional point cloud and pattern recognition, and particularly relates to a three-dimensional point cloud semantic segmentation method based on deep learning.

Background

The semantic segmentation of the three-dimensional point cloud is the basis of semantic understanding and analysis of a three-dimensional scene, and is a research hotspot in the fields of navigation positioning, mode recognition, unmanned driving and the like. The semantic segmentation algorithm of the three-dimensional point cloud is mainly divided into a traditional feature extraction algorithm and a deep learning algorithm.

The three-dimensional point cloud segmentation algorithm based on traditional feature extraction carries out clustering and classification by extracting features such as boundary gradient, normal vector, surface ratio, texture and the like of the three-dimensional point cloud so as to carry out semantic segmentation.

The three-dimensional point cloud segmentation algorithm based on deep learning is divided into a voxel CNN, a multi-view CNN and a point cloud CNN algorithm. The voxel CNN algorithm needs to convert point cloud into 3D grid and then carries out three-dimensional convolution similar to two-dimension, because one-dimension is added, the complexity of time and space is too high; the multi-view CNN algorithm is that three-dimensional point cloud is mapped into images of multiple visual angles, then the images are segmented by using an image semantic segmentation algorithm and then are fused into the three-dimensional point cloud, and the algorithm ignores the space structure of the point cloud and is difficult to expand into tasks such as scene understanding; the point cloud CNN algorithm directly takes the point cloud as input without conversion, and end-to-end training learning is realized. The point cloud CNN algorithm mainly obtains a semantic segmentation model by training a marked three-dimensional point cloud scene, and completes semantic segmentation by using the semantic segmentation model, and the algorithm has relatively high accuracy and universality and is a hotspot of current research, and in order to perform more accurate semantic understanding, the accuracy of semantic segmentation still has a space for improvement.

Therefore, the existing three-dimensional point cloud semantic segmentation algorithm still has a space for improving the accuracy.

Disclosure of Invention

The invention provides a three-dimensional point cloud semantic segmentation method based on deep learning, aiming at solving the problem of low accuracy of the three-dimensional point cloud semantic segmentation algorithm in the prior art and aiming at improving the accuracy of the current three-dimensional point cloud semantic segmentation algorithm.

In order to achieve the above object, according to a first aspect of the present invention, there is provided a deep learning-based three-dimensional point cloud semantic segmentation method, including the following steps:

s1, training a semantic segmentation neural network model by using a three-dimensional point cloud training set, wherein a training sample is a three-dimensional point cloud, a label is a real semantic category, and the semantic segmentation neural network model comprises the following steps: a feature extraction network and a semantic segmentation network; the feature extraction network is used for extracting global features and local features of the three-dimensional point cloud; the semantic segmentation network is used for fusing global features and local features of the point cloud, and the output feature map corresponds to the probability that each point in the point cloud belongs to each semantic category;

and S2, inputting the point cloud to be detected into the trained semantic segmentation neural network model to obtain a point cloud segmentation result.

Preferably, the feature extraction network takes the point cloud as input and extracts features sequentially through an MLP (64), a first local feature extraction module, a first channel attention promotion module, a second local feature extraction module, a second channel attention promotion module, a connection structure, an MLP (1024) and a maximum pooling layer;

the local feature extraction module is used for extracting multi-scale local features of the point cloud;

the channel attention promoting module is used for extracting global features and point features of the point cloud, focusing attention on a feature channel with large information quantity and inhibiting unimportant channel features;

the connecting structure connects the local features output by the two local feature extraction modules with the point features output by the second channel attention promoting module, and the final features are obtained through MLP (1024) and maximum pooling;

here, the number in parentheses of MLP () represents the number of convolution kernels.

Preferably, the local feature extraction module includes four parallel branches, each parallel branch extracting a local neighborhood feature of one scale, and each branch includes: the system comprises a KNN local neighborhood search module, two multilayer perceptrons 1 × 1MLP and a Softmax function, wherein the local feature extraction module is used for connecting four parallel branch features to obtain the multi-scale local features.

Preferably, the specific structure of the channel attention boosting module is as follows:

a1 x 1mlp (C) for extracting C-dimensional features;

the first branch, the C-dimensional features are processed through a global average pooling layer, a full-connection layer and a Sigmoid function to obtain the weight of each feature channel, and the weight is multiplied by the C-dimensional features to obtain the point features of channel attention improvement;

the second branch is used for obtaining a C-dimensional global feature by performing maximal pooling on the C-dimensional feature, copying the global feature to restore the size of the original C-dimensional feature, and multiplying the global feature by the weight obtained by the first branch to obtain the global feature subjected to channel attention boosting;

Preferably, the semantic segmentation sub-network of the semantic segmentation network obtains the probability of each semantic category of each point in the point cloud through a connection structure, MLP (512), MLP (256), MLP (128), MLP (c), and uses the category with the highest probability as the label of the point, thereby realizing semantic segmentation and obtaining a segmentation result, wherein the number in the parentheses of MLP () represents the number of convolution kernels.

Preferably, in the training stage, the current network weight parameter is calculated and updated, and the weighted multi-class loss function is:

where Loss is the Loss function, f_l(x)(x) Is a Softmax function, a_l(x)(x) For segmenting the value of the network output characteristic diagram corresponding to the position of the point x belonging to the semantic category l (x), K represents the number of the semantic categories, w_l(x)(x) Representing the weight of x points belonging to semantic class l (x), N being the total number of points in the set of points Ω input into the network model, N_l(x)The number of dots of the type l (x) is shown.

To achieve the above object, according to a second aspect of the present invention, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the method according to the first aspect.

Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:

(1) the invention uses a local feature extraction module to extract local features of a point cloud. Local neighborhoods at different scales have different useful information, and a parallel multi-scale network structure is used for extracting multi-scale features and fully fusing the multi-scale features together. The attention of the important characteristic channel is promoted by using a channel attention promotion module, the unimportant characteristic channel is restrained, and the weighted multi-class loss function is used for optimizing the training effect, so that items in different classes have different learning weights; network overfitting to unbalanced class distribution in the training process is reduced; the class with smaller point number is endowed with larger learning weight, the class with larger point number is endowed with smaller learning weight, the attention of the class with smaller point number ratio is improved, and the segmentation effect is improved by continuously reducing the Loss value during training.

(2) The invention directly takes the point cloud as input for training, and uses the neural network to extract the global characteristics of the input point cloud so as to solve the problem of point cloud disorder. And finally, connecting the global features and the point features and obtaining the classification probability of each point through an MLP layer, thereby realizing the semantic segmentation of the three-dimensional point cloud.

(3) The invention extracts and fuses the characteristics of the point cloud through multilayer convolution operation, and uses batch normalization and activation functions to carry out network optimization, thereby improving the training effect and monitoring the accuracy of the training process, loss functions and other data in real time.

(4) The invention preprocesses the S3DIS data set, including segmenting and sampling the room in the data set, so that each point has 9 dimensionalities of information, and the expanded dimensionality information is beneficial to improving the segmentation precision of network training.

Drawings

FIG. 1 is a flowchart of an embodiment of the present invention for implementing a deep learning-based semantic segmentation method for three-dimensional point clouds;

FIG. 2 is a schematic structural diagram of a three-dimensional semantic segmented deep neural network provided by an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a local feature extraction module in a deep neural network according to an embodiment of the present invention;

fig. 4 is a schematic diagram of KNN neighborhood search in the local feature extraction module according to the embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a channel attention boosting module in a deep neural network according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a connection structure of a convolutional layer, a batch normalization layer, and an activation layer in a deep neural network according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

As shown in fig. 1, the invention provides a deep learning-based three-dimensional point cloud semantic segmentation method, which includes:

s1, training a semantic segmentation neural network model by using a three-dimensional point cloud training set, wherein a training sample is a point cloud containing actual three-dimensional coordinates, RGB colors and normalized coordinate information, and a label is a real semantic category.

And (3) carrying out segmentation, sampling and position normalization pretreatment on the indoor three-dimensional point cloud data set, and establishing a training set and a verification set.

The Indoor three-dimensional point cloud data set Stanford large-scale 3D inoor Spaces database (S3DIS) comprises three-dimensional point clouds of 271 rooms, the 271 rooms are divided into six areas of Area1-Area6, Area1-Area 5 are used as training areas, and Area6 is used as a verification Area. Sampling and position normalization are carried out on each room, each room is divided into a plurality of cuboids with 1m x 1m height, when the actual point number N of each cuboid is larger than 4096, sampling is carried out randomly, N is smaller than or equal to 4096, samples are copied randomly, the point number of each cuboid is guaranteed to be 4096, and then position normalization is carried out on each point in each cuboid relative to the corresponding room:

X^norm＝X/X_{room_max}

Y^norm＝Y/Y_{room_max}

Z^norm＝Z/Z_{room_max}

wherein (X, Y, Z) is the coordinates of a point, X_{room_max}、Y_{room_max}、Z_{room_max}Respectively corresponding to the maximum values of the middle points of the room on the X axis, the Y axis and the Z axis (X)^norm,^norm,Z^norm) Is the coordinates of the normalized point.

Further, for each point in the dataset, dimension 9, includes: actual three-dimensional coordinates, RGB color values, normalized coordinates, i.e., [ X, Y, Z, R, G, B, X^norm,Y^norm,Z^norm]。

Further, the semantic categories of the point cloud in the S3DIS dataset are classified into 13 categories, which include: ceiling, floor, wall, beam, column, window, door, table, chair, sofa, bookcase, panel, and others.

As shown in fig. 2, the semantic segmentation neural network model is composed of a feature extraction sub-network and a semantic segmentation sub-network. The network model of the invention is added with a multi-scale local feature extraction module to extract multi-scale local features of the point cloud, a channel attention promotion module to promote the attention of important feature channels, and a weighted multi-class loss function aided training, thereby improving the effect of three-dimensional semantic segmentation.

Feature extraction subnetwork

The feature extraction sub-network mainly comprises: two local feature extraction modules, two channel attention boosting modules, a connection structure, two multi-layer perceptrons 1 × 9MLP (64) and 1 × 1MLP (1024), and a Max Pooling layer Max Pooling. The feature extraction sub-network directly takes the point cloud as input and sequentially passes through an MLP (64), a local feature extraction module 1, a channel attention promotion module 1, a local feature extraction module 2, a channel attention promotion module 2, a connection structure, an MLP (1024) and a Max Pooling layer to extract features. N in the network represents the number of input points, taking 4096, the input to the network is 4096 × 9, i.e., 4096 points, each point having a dimension of 9.

The local feature extraction module is used for extracting multi-scale local features of the point cloud. As shown in fig. 3, the local feature extraction module includes four parallel branches, each parallel branch extracts a local domain feature of one scale, each branch includes a KNN local neighborhood search, two multi-layer perceptrons 1 × 1MLP and a Softmax function, and the module connects the four parallel branch features to obtain the multi-scale local feature. KNN (k) represents KNN neighborhood search, the numbers in parentheses represent the number of points included in each point neighborhood, the diagram of local neighborhood search is shown in fig. 4, and one point feature and the edge of each point feature in the local neighborhood constitute a local neighborhood feature. The first branch extracts point characteristics through a1 x 1MLP (O), and the output characteristic dimension is O; the second branch searches K neighborhood of the point cloud through KNN (K), firstly extracts the feature through 1X 1MLP (O), then extracts the weight of each side in the local feature through 1X 1MLP (1) and a Softmax function, matrix multiplication is carried out on the output of the MLP (O) and the weight to obtain the local feature under the scale of the local neighborhood, and the output feature dimension is O; the third branch searches a 2K neighborhood of the point cloud through KNN (2K), and then outputs a characteristic dimension O as same as the second branch; and in the fourth branch, searching a 3K neighborhood of the point cloud through KNN (3K), and outputting a characteristic dimension O subsequently as in the second branch. And finally, connecting the output features of the four branches to obtain a multi-scale local feature, wherein the dimension of the output feature is 4 x O. Two local feature extraction modules are used in the feature extraction sub-network, wherein O in the first local feature extraction module is 64, O in the second local feature extraction module is 128, and K in the two local feature extraction modules is 16, so that the local features extracted by the local feature extraction modules have four scales of information, namely 1, 16, 32 and 48.

As shown in fig. 5, the channel attention boosting module is used to extract high-dimensional features of the point cloud and focus attention on the feature channels with large information amount, suppressing those unimportant channel features. The channel attention boosting module contains two branches and outputs two features, one global feature and one point feature. Firstly, extracting C-dimensional features through 1 x 1MLP (C); then, from bottom to top, in the first branch, the C-dimensional features are subjected to Average Pooling through a global Average Pooling layer, a full connection layer FC and a Sigmoid function to obtain the weight of each feature channel, and the weight is multiplied by the C-dimensional features to obtain the point features of channel attention boost; and in the second branch, the C-dimensional feature is processed by a Max Pooling with the largest pool to obtain a C-dimensional global feature, the global feature is copied and restored to the size of the original C-dimensional feature, and then the C-dimensional global feature is multiplied by the weight obtained by the first branch to obtain the global feature with the improved channel attention. Two channel attention boosting modules are used in the feature extraction sub-network, C in the first channel attention boosting module takes 64, and global features and point features with dimension of 64 are output; the second channel attention boosting module takes C for 128 and outputs global and point features with dimensions of 128.

The connection structure of the feature extraction sub-network connects the local features output by the two local feature extraction modules with the point features output by the second channel attention boosting module, and then the final global features are obtained through a1 x 1MLP (1024) and a Max Pooling.

The point features and the global features are connected, the local features of the point cloud are not fully utilized, and the local structure information between the points is helpful for improving the precision of semantic segmentation.

Semantic segmentation sub-network

The semantic segmentation sub-network mainly comprises: a connecting structure and four 1 × 1 multi-layer perceptrons MLP (512), MLP (256), MLP (128) and MLP (c), wherein c is the semantic category number 13. MLP is a multilayer perceptron, four 1 × 1 MLPs (512), 256, 128, and (c) represent convolution layers with convolution kernels of 1 × 1, and the number in the parentheses of MLP () represents the number of convolution kernels, which is also the dimension of the output feature. N in the network represents the number of points input, 4096. The semantic segmentation sub-network obtains a segmentation result through the connection structure, MLP (512), MLP (256), MLP (128), MLP (c).

The connection structure of the semantic segmentation sub-network connects the final global feature and the 64-dimensional and 128-dimensional global features from the two channel attention boosting modules with a plurality of levels of point features (896-dimensional features output by the connection structure in the feature extraction sub-network, 1024-dimensional features output by MLP (1024), 1024-dimensional global features output by Max Pooling), and then sequentially passes through 1 × 1MLP (512), 1 × 1MLP (256), 1 × 1MLP (128) and 1 × 1MLP (c), wherein c is a semantic category number to obtain the score of each point in the point cloud in each semantic category, and the category with the largest score is taken as a label of the point, so that the semantic segmentation is realized.

Further, as shown in fig. 6, the convolution layer and the deconvolution layer in the semantic segmentation model are subjected to batch normalization after convolution or deconvolution, and then the ReLU activation function is used. The batch normalization layer is used for solving gradient explosion or disappearance in backward propagation and relieving overfitting, and the activation layer is used for increasing nonlinearity of the neural network model so that the neural network model can approximate any function.

Calculating loss values by using weighted multi-class loss functions based on real classes to obtain prediction errors, performing back propagation by using the prediction errors, calculating and updating current network weight parameters, and calculating and updating network weights for multiple times by using a training set to obtain final network weight parameters, thereby obtaining the trained three-dimensional point cloud semantic segmentation deep neural network model.

The weighted multi-class loss function is:

in the training of the three-dimensional point cloud semantic segmentation deep neural network model, loss values are calculated by using a loss function based on real categories to obtain a prediction error, back propagation is performed by using the prediction error, and current network weight parameters are calculated and updated, wherein the weighted multi-category loss function is as follows:

wherein, Loss is a weighted Softmax cross entropy Loss function which can measure the difference between the predicted value and the true value. The smaller the value of the cross entropy, the better the model prediction effect. f. of_l(x)(x) Is a Softmax function, a_l(x)(x) To segment the value of the network output feature map for which the corresponding x point position belongs to class l (x), K representing the number of semantic classes, the Softmax function makes the sum of the multi-class prediction probabilities for each position in the feature map 1. w is a_l(x)(x) Representing the weight of x points belonging to semantic class l (x), N being the total number of data concentration points, N_l(x)The number of dots of the type l (x) is shown. w is a_lThe weight means that a category having a relatively small number of points is given a large learning weight, and a category having a relatively large number of points is given a small learning weight. And improving the segmentation effect by continuously reducing the value of Loss during training.

The training set consists of the Area1-Area 5 in the preprocessed S3DIS data set. The parameters of the training model are configured as: the point number N of the input point cloud is 4096, and the nine channels comprise XYZ coordinate information, RGB color information and position normalization information. An adam optimizer is adopted during model training, the initial learning rate is 0.001, the momentum is 0.9, the batch is 24, the delay rate is 0.5, the delay step size is 300000, and the maximum iteration number is 50. And monitoring parameters such as loss, IoU, call and the like in the training process, comparing IoU with the historical maximum value IoU after each iteration is finished, if the IoU is larger than the historical maximum value, saving the current training model, and updating the historical maximum value IoU, so that the saved training model is the training model IoU the highest after the training is finished.

And S2, inputting the point cloud to be detected into the trained three-dimensional point cloud semantic segmentation neural network model to obtain a point cloud segmentation result.

When the trained three-dimensional point cloud semantic segmentation model is used for semantic segmentation, the semantic category of each point in the point cloud can be effectively acquired, the accuracy of the semantic segmentation method is improved, training is performed on an S3DIS data set Area1-Area 5, verification is performed on the Area6, the precision reaches 90.14%, and the mIOU reaches 72.83%.

In practical application, the method can more accurately perform semantic segmentation of the three-dimensional point cloud, can realize higher precision compared with the conventional method, and is suitable for complex scenes.

It will be understood by those skilled in the art that the foregoing is only an exemplary embodiment of the present invention, and is not intended to limit the invention to the particular forms disclosed, since various modifications, substitutions and improvements within the spirit and scope of the invention are possible and within the scope of the appended claims.

Claims

1. A three-dimensional point cloud semantic segmentation method based on deep learning is characterized by comprising the following steps:

s1, training a semantic segmentation neural network model by using a three-dimensional point cloud training set, wherein a training sample is a three-dimensional point cloud, a label is a real semantic category, and the semantic segmentation neural network model comprises the following steps: a feature extraction network and a semantic segmentation network;

the feature extraction network is used for extracting global features and local features of the three-dimensional point cloud;

the semantic segmentation network is used for fusing the global features and the local features of the point cloud, and the output feature map corresponds to the probability that each point in the point cloud belongs to each semantic category;

s2, inputting the point cloud to be detected into the trained semantic segmentation neural network model to obtain a point cloud segmentation result;

the feature extraction network takes the point cloud as input and extracts features sequentially through a1 × 9MLP (64), a first local feature extraction module, a first channel attention promotion module, a second local feature extraction module, a second channel attention promotion module, a connection structure, a1 × 1MLP (1024) and a maximum pooling layer; wherein,

the connecting structure is used for connecting the local features output by the two local feature extraction modules with the point features output by the second channel attention boosting module, and obtaining the final features through 1 x 1MLP (1024) and the maximum pooling layer;

the number in parentheses of MLP () represents the number of convolution kernels.

2. The method of semantic segmentation of three-dimensional point clouds according to claim 1, wherein the local feature extraction module includes four parallel branches, each branch extracting a scale of local neighborhood features,

a first branch for extracting point features through a1 x 1mlp (O), the output feature dimension being O;

the second branch is used for searching K neighborhoods of the point cloud through KNN (K), extracting features through 1X 1MLP (O), extracting the weight of each edge in the local features through 1X 1MLP (1) and a Softmax function, and performing matrix multiplication on the output of the MLP (O) and the weight to obtain the local features under the local neighborhood scale, wherein the output feature dimension is O;

the third branch is used for searching a 2K neighborhood of the point cloud through KNN (2K), and the subsequent characteristic dimension is O as same as that of the second branch;

the fourth branch is used for searching a 3K neighborhood of the point cloud through KNN (3K), and the subsequent output characteristic dimension is O as same as that of the second branch;

and finally, connecting the output features of the four branches to obtain the multi-scale local feature, wherein the dimension of the output feature is 4-O multi-scale local feature.

3. The three-dimensional point cloud semantic segmentation method according to claim 1, wherein the channel attention boosting module has a specific structure as follows:

a1 x 1mlp (C) for extracting C-dimensional features;

the first branch is used for enabling the C-dimensional features to pass through a global average pooling layer, a full connection layer and a Sigmoid function to obtain the weight of each feature channel, and multiplying the weight by the C-dimensional features to obtain point features with improved channel attention;

the second branch is used for obtaining a C-dimensional global feature by performing maximal pooling on the C-dimensional feature, copying the global feature to restore the size of the original C-dimensional feature, and multiplying the global feature by the weight obtained by the first branch to obtain the global feature with the channel attention promoted;

4. The three-dimensional point cloud semantic segmentation method according to any one of claims 1 to 3, wherein the semantic segmentation network semantic segmentation sub-network obtains the probability of each point in the point cloud in each semantic category through a connection structure, 1 x 1MLP (512), 1 x 1MLP (256), 1 x 1MLP (128), and 1 x 1MLP (c), and the category with the highest probability is used as the label of the point, so as to realize semantic segmentation, and obtain the segmentation result, wherein c represents the number of semantic categories.

5. The method for semantic segmentation of three-dimensional point cloud according to any one of claims 1 to 3, wherein in the training stage, the current network weight parameters are calculated and updated, and the weighted multi-class loss function is:

where Loss is the Loss function, f_l(x)(x) Is a Softmax function, a_l(x)(x) For segmenting the value of the network output characteristic diagram corresponding to the position of the point x belonging to the semantic category l (x), K represents the number of the semantic categories, w_l(x)(x) Representing the weight of x points belonging to semantic categories l (x), N being the total number of points in the point set omega input into the network model, N_l(x)The number of dots of the type l (x) is shown.

6. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the deep learning based three-dimensional point cloud semantic segmentation method according to any one of claims 1-5.