CN111027559A

CN111027559A - Point cloud semantic segmentation method based on expansion point convolution space pyramid pooling

Info

Publication number: CN111027559A
Application number: CN201911048539.3A
Authority: CN
Inventors: 余洪山; 何勇; 邹艳梅; 杨振耕
Original assignee: Shenzhen Research Institute Of Hunan University; Hunan University
Current assignee: Shenzhen Research Institute Of Hunan University; Hunan University
Priority date: 2019-10-31
Filing date: 2019-10-31
Publication date: 2020-04-17

Abstract

The invention discloses a point cloud semantic segmentation method based on expansion point convolution space pyramid pooling, which comprises the steps of firstly obtaining point cloud subset central points through a farthest point sampling algorithm, and determining a subset range by utilizing a KNN algorithm; then, cloud subset features of each point are extracted in a pyramid mode through the expansion point convolution space, the receptive field of point convolution is increased, and feature extraction of scene multi-scale targets is enriched; secondly, a simple and effective decoding module is adopted to realize feature decoding, so that the segmentation precision of sparse point cloud is improved; and finally, realizing the label classification of each point cloud through a full connection layer. The point cloud semantic segmentation method has the outstanding advantages of high segmentation precision, various adaptive scenes and the like.

Description

Point cloud semantic segmentation method based on expansion point convolution space pyramid pooling

Technical Field

The invention belongs to the field of computer vision, and relates to a 3D semantic segmentation method based on expansion point convolution space pyramid pooling.

Background

Point cloud semantic segmentation is one of the main research difficulties and hot spots of a 3D scene analysis technology, and how to efficiently and quickly acquire local features, global features and scene context information of a point cloud becomes an urgent problem to be solved. The point cloud semantic segmentation realizes point-by-point classification by acquiring scene point cloud characteristics, and achieves the purpose of scene analysis. However, the scene point cloud has disorder, sparsity and density unevenness, and the scene target has multi-scale characteristics, which all seriously affect the acquisition of the point cloud characteristics. At present, the schemes for semantic segmentation of point cloud mainly include a multi-view scheme, a voxelization scheme, and a scheme for directly processing point cloud. The multi-view scheme acquires images of a point cloud scene from different views by using a projection mode and inputs the images into a traditional 2D convolutional neural network; the voxelization scheme divides the point cloud into 3D grids, and extracts features by using a 3D convolutional neural network; the two schemes convert irregular point clouds into regular data, so that the limitation of the point clouds is avoided to a certain extent, but partial geometric information of scene point clouds is lost, quantization errors are introduced, and the segmentation precision depends on the performance of a traditional convolutional neural network. The scheme of directly processing the point cloud has received more and more attention because the point cloud information is retained to the maximum. In order to obtain context information of different scales, a point cloud semantic segmentation network usually adopts a point cloud multi-scale grouping mode, which is easy to cause more calculation cost. In addition, the local feature and context information acquisition capability of the existing point cloud semantic segmentation network still needs to be improved.

How to improve the local feature and context information acquisition capability and reduce the network computing cost is a main urgent technical problem to be solved in the field.

Disclosure of Invention

Aiming at the problems, the invention provides a point cloud semantic segmentation method based on expanded point convolution space pyramid pooling.

A point cloud semantic segmentation method based on an expansion point convolution space pyramid comprises the following steps:

step 1: based on ScanNet data set point cloud, adopting a farthest point sampling algorithm to obtain point cloud subset center points;

the input point clouds of the network are respectively P { P₁,p₂,p₃…,p_nF, selecting a subset P from the cloud of input points using an iterative farthest point sampling algorithm_sub-i{P_i1,P_i2,P_i3,…,P_imIs such that P is_ijFurthest from other points in the subset;

step 2: determining the range of the point cloud subset by using a nearest neighbor algorithm based on the point cloud subset center point obtained in the step 1;

the input forms are P x (D + C) and P₁Matrix information of x D, output form P₁×K₁X (D + C) matrix information. Wherein P is the number of the input point clouds, P₁The number of the central points is sampled, D is D-dimensional coordinate information of each point, C is C-dimensional point characteristic information, K₁The number of the neighborhood points of the central point is;

searching out K by KNN algorithm₁And each neighborhood point closest to the central point is sorted and numbered according to the distance from each neighborhood point to the central point. K₁The individual domain points and the center point constitute a center point neighborhood, also referred to as a local neighborhood.

And step 3: pyramid pooling extraction of local neighborhood features F using improved expansion point convolution space₁And obtaining P₁An image extraction point;

input form is P₁×K₁Matrix information of x (D + C), output form P₁Matrix information of x (D + C "). Wherein, C' is a point characteristic dimension obtained by pyramid pooling of the improved expansion point convolution space in a local neighborhood abstraction;

and 4, step 4: based on P obtained in step 3₁And sampling and grouping the abstract point clouds.

The step is input in the form of P₁Matrix information of x (D + C'), output form P₂×K₂X (D + C'); wherein, P₂The number of the central points of the second down-sampling, K₂The number of neighborhood points of the central point is sampled for the second time; based on P₁Repeating the

steps

1 and 2 to obtain P₂A central point of down sampling, K₂A central point domain point, get P₂A local neighborhood;

and 5: PointNet extracting local field feature F₂；

Input form is P₂×K₂Matrix of x (D + C'), output form P₂X (D + C "); wherein, C' is a point characteristic dimension abstracted by PointNet in a local field;

step 6: decoding P₂Each contains F₂Abstract point cloud of features, to obtain P₁Each contains F₃An abstract point cloud of features;

input form is P₂Matrix information of x (D + C ″), output form P₁X (D + C'); wherein C' is a characteristic dimension of the decoded point cloud;

and 7: decoding P₁Each contains F₃Obtaining P abstract point clouds containing F₄An abstract point cloud of features;

input form is P₁X (D + C') matrix information, output form P₁X (D + C ""). Wherein, C "" is the characteristic dimension of the decoded point cloud;

and 8: and obtaining the label of each point cloud by adopting the full connection layer.

This step inputs the form P × (D + C ""), and outputs P × k. Where k is the number of scene point cloud categories.

Further, the pyramid pooling extraction of the point cloud local neighborhood characteristics by the improved expansion point convolution space comprises the following three steps: 1) improving the conventional dilation point convolution; 2) respectively extracting local field features from improved expansion point convolution channels with different expansion rates; 3) fusing and adding the characteristics of all channels; 4) reducing dimension of the features;

the specific extraction process is as follows:

1) replacing a point convolution kernel function (MLP) and improving the expansion point convolution;

the continuous convolution of a conventional dilated point convolution is defined as:

where H is a continuous characteristic function, given by p_jAssigning a feature value; g is a continuous kernel function, p_jTo p_iDistance is mapped as kernel weight;using monte carlo integration, the continuous convolution definition translates into:

wherein d is the expansion rate of the expansion point convolution; the infinite kernel G (·) is replaced with a multi-layered perceptron:

wherein p is the relative position of the neighborhood point and the central point, and the Euclidean distance is used; θ is a series of parameters of MLP.

In order to obtain more local neighborhood characteristics, the local neighborhood point characteristics are abstracted to a higher dimension to obtain redundant information, so that the improved expansion point convolution replaces an infinite kernel function g (-) by PointNet:

wherein, PN is a PointNet network, and theta' is a series of parameters of PN;

2) extracting local neighborhood characteristics by convolution of each improved expansion point;

input form is P₁×K₁Matrix information of x (D + C), output form P₁×(D+C_i) (ii) a Wherein, C_iObtaining a point characteristic dimension abstracted in a local neighborhood for the ith improved expansion point convolution;

the space pyramid pooling has i channels, each channel carries an improved expansion point convolution with an expansion rate d₁,d₂,…,d_iThen, the content information P of different scales of the i groups of local neighborhoods is obtained₁×(D+C_i)；

3) Fusing characteristic information extracted by each improved expansion point convolution network;

input form is P₁×(D+C_i) The output form is sigma P₁×(D+C_i) (ii) a Wherein i is space goldThe number of character tower pooling channels is also the improved expansion point convolution number; fusing content information of different scales by a splicing method;

4) reducing the dimension characteristic information;

the input form is output form is sigma P₁×(D+C_i) The output form is P₁×(D+C′)。

With the increase of the number of channels, the local neighborhood characteristic number is increased by i times, and the calculation cost of the rear-end coding of the coding layer is increased. Therefore, 1 × 1 convolution operation is performed on the fused feature information, and feature dimensionality is reduced.

Compared with the traditional point convolution, the improved expansion point convolution enlarges the receptive field and obtains more scene context information without increasing the convolution calculation cost. In addition, spatial pyramid pooling can effectively encode scene multi-scale content information. Therefore, the method combines the advantages of the improved expansion point convolution and the spatial pyramid pooling, ensures the convolution calculation cost, and efficiently extracts the local neighborhood characteristics and the context information.

Further, PointNet is selected as a feature extractor of local neighborhood point cloud, and the working principle of the PointNet feature extractor is as follows:

given a set of unordered local neighborhood points { p_l1,p_l1,…,p_lnH' may be defined to map the set of point clouds into a vector:

where γ and h are typically MLP.

PointNet [ document 1 ] is often used as a point cloud feature extractor, which ensures the translational infeasibility of disordered point clouds, can abstract low-dimensional point features into rich high-dimensional semantic features, and improves segmentation accuracy.

Further, the decoding process in step 6 is as follows:

1) interpolation and upsampling: based on P₂Sampling the abstract point cloud by interpolation algorithm to obtain P₁An abstract point cloud with point cloud characteristics of F₂. The step is inputP₂X (D + C') matrix, output P₁×(D+C″)；

2) Jump link fusion feature: p in step 3 is converted into a link-hopping mode₁An abstract point cloud characteristic F₁With up-sampled P₁An abstract point cloud characteristic F₂Fusion addition; the step inputs P₁X (D + C') and P₁X (D + C'), and output P₁×(D+C″+C′)；

3) And (3) decoding: decoding the point cloud based on the fusion characteristics by using a Unit PointNet;

the step inputs P₁X (D + C "+ C'), output P₁×(D+C″′)。

Further, the decoding process in step 7 is as follows:

1) interpolation and upsampling: input P₁X (D + C ') matrix, output P x (D + C');

2) jump link fusion feature: inputting P x (D + C ') and P x (D + C), and outputting P x (D + C' + C);

3) and (3) decoding: p × (D + C '+ C) is input, and P × (D + C') is output.

Advantageous effects

The point cloud semantic segmentation method based on the expansion point convolution space pyramid pooling firstly obtains a point cloud subset central point through a farthest point sampling algorithm, and determines a subset range by utilizing a KNN algorithm; then, cloud subset features of each point are extracted in a pyramid mode through the expansion point convolution space, the receptive field of point convolution is increased, and feature extraction of scene multi-scale targets is enriched; secondly, a simple and effective decoding module is used for realizing feature decoding, and the segmentation precision of the sparse point cloud is improved; and finally, realizing the label classification of each point cloud through a full connection layer. The point cloud semantic segmentation method has the outstanding advantages of high segmentation precision, various adaptive scenes and the like.

The invention can realize the semantic segmentation of the irregular, sparse and uneven-density point cloud, has the advantages of high segmentation precision, low calculation cost, multiple adaptive scenes and the like, and effectively solves the problems of low acquisition efficiency and high calculation cost of local characteristics and context information of the point cloud in the indoor and outdoor scene semantic segmentation technology.

Compared with the existing point cloud semantic segmentation network, the invention has the advantages that:

1) the expansion point convolution and the PointNet are combined, an improved expansion point convolution is provided, and the acquisition capability of the local characteristics and the context information of the point cloud is improved;

2) inspiring the space pyramid pooling, the invention provides the improved expansion point convolution space pyramid pooling, which effectively encodes multi-scale context information, enriches point cloud characteristics and improves scene semantic segmentation precision.

3) The invention provides an improved coding layer fused with expansion point convolution space pyramid pooling, which is placed at the front end of the coding layer in a pyramid pooling mode, so that the loss of point cloud high-dimensional features is avoided, and the segmentation of scene small targets is facilitated.

4) The invention provides a simple and effective decoding layer, which is used for sampling point cloud high-dimensional features of the coding layer and adding the point cloud high-dimensional features to point cloud low-dimensional features, so that scene detail feature information is enriched, and the segmentation precision is improved.

Drawings

FIG. 1 is a block diagram of the overall network of the present invention;

FIG. 2 is a diagram of a conventional point convolution, an extended point convolution and an improved extended point convolution;

FIG. 3 is an expanded point convolution spatial pyramid pooling;

FIG. 4PointNet feature extractor

FIG. 5 is a point cloud semantic segmentation network framework.

Detailed Description

The present invention will be described in further detail below with reference to the accompanying drawings.

The point cloud data related by the invention can adopt an indoor scene common data set ScanNet and an outdoor scene common data set Semantic3D and the like. The ScanNet data set provides various indoor scenes such as offices, apartments, bedrooms and the like, and the point cloud data is acquired by the RGB-D camera and comprises coordinate information, color information and an alpha channel P (x, y, z, r, g, b and alpha) of points. The sematic 3D dataset provides many types of outdoor scenes such as farms, broadacres, castle, etc., and the point cloud data is collected by a static ground laser scanner and contains coordinate information, laser reflection intensity and color information P (x, y, z, intensity, r, g, b) of points. As an application example, the test effect based on the ScanNet public data set is given.

Fig. 1 shows a flowchart of the present invention, and a point cloud semantic segmentation method based on an extended point convolution space pyramid includes the following steps:

the input point clouds of the network are respectively P { P₁,p₂,p₃…,p_nF, selecting a subset P from the cloud of input points using an iterative farthest point sampling algorithm_sub-i{P_i1,P_i2,P_i3,…,P_imIs such that P is_ijFurthest from the other points in the subset. On the premise of giving the same number of central points, compared with a random point sampling algorithm, the farthest point sampling algorithm has stronger capability of covering all input point clouds.

Step 2: determining the range of the point cloud subset by using a nearest neighbor algorithm (KNN) based on the point cloud subset center point obtained in the step 1;

the input form of this step is P × (D + C) and P₁Matrix information of x D, output form P₁×K₁X (D + C) matrix information. Wherein P is the number of the input point clouds, P₁The number of the central points is sampled, D is D-dimensional coordinate information of each point, C is C-dimensional point characteristic information, K₁The number of the neighborhood points of the central point.

the input form of the step is P₁×K₁Matrix information of x (D + C), output form P₁Matrix information of x (D + C "). Wherein, C' is a point characteristic dimension abstracted in a local neighborhood by the pyramid pooling of the improved expansion point convolution space.

Compared with the traditional point convolution, the improved expansion point convolution enlarges the receptive field and obtains more scene context information without increasing the convolution calculation cost. In addition, spatial pyramid pooling can effectively encode scene multi-scale content information. Therefore, the method combines the advantages of the improved expansion point convolution and the spatial pyramid pooling, ensures the convolution calculation cost, and efficiently extracts the local neighborhood characteristics and the context information. The pyramid pooling extraction of the local neighborhood characteristics of the point cloud by the improved expansion point convolution space mainly comprises the following three steps: 1) improving the conventional dilation point convolution; 2) respectively extracting local field features from improved expansion point convolution channels with different expansion rates; 3) merging and adding the characteristics of all channels; 4) and (5) reducing the dimension of the feature. The specific extraction process is as follows:

as shown in fig. 2, the continuous convolution of the conventional dilated point convolution is defined as:

where H is a continuous characteristic function, given by p_jAssigning a feature value; g is a continuous kernel function, p_jTo p_iDistance is mapped as kernel weight; in most practical applications, the characteristic function F is not completely known, and with monte carlo integration, the continuous convolution definition can be approximately converted into:

wherein d is the expansion rate of the expansion point convolution; here the infinite kernel function G (·) is replaced with a multi-layered perceptron (MLP):

In order to obtain more local neighborhood characteristics, abstracting the local neighborhood point characteristics to a higher dimension to obtain redundant information, so that the modified expansion point convolution replaces an infinite kernel function g (-) with PointNet, and a PointNet characteristic extractor will be described in detail in step 5:

wherein, PN is PointNet network, and theta' is a series of parameters of PN.

the input form of this step is P₁×K₁Matrix information of x (D + C), output form P₁×(D+ C_i). Wherein, C_iAnd (4) convolving the feature dimension of the point abstracted in the local neighborhood for the ith improved expansion point.

As shown in FIG. 3, the spatial pyramid pooling has i channels, each of which carries an improved dilation point convolution with a dilation rate d₁,d₂,…,d_iThen, the content information P of i groups of local neighborhoods with different scales is obtained₁×(D+C_i)。

the step is input in the form of P₁×(D+C_i) The output form is sigma P₁×(D+C_i). Wherein, i is the number of the spatial pyramid pooling channels and is the number of the improved expansion point convolution. And fusing content information of different scales by a splicing method.

4) And (5) reducing the dimension characteristic information.

The step inputs the form of output as sigma P₁×(D+C_i) The output form is P₁×(D+C′)。With the increase of the number of channels, the local neighborhood characteristic number is increased by i times, and the calculation cost of the coding at the rear end of the coding layer is increased. Therefore, 1 × 1 convolution operation is performed on the fused feature information, and feature dimensionality is reduced.

The step is input in the form of P₁Matrix information of x (D + C'), output form P₂×K₂X (D + C'). Wherein, P₂The number of the central points of the second down-sampling, K₂And the number of the neighborhood points of the central point is sampled for the second time. Based on P₁Repeating the

steps

1 and 2 to obtain P₂A central point of down sampling, K₂A central point domain point, get P₂A local neighborhood.

And 5: PointNet extracting local field feature F₂；

The step is input in the form of P₂×K₂Matrix of x (D + C'), output form P₂X (D + C "). Wherein, C' is a point characteristic dimension abstracted by PointNet in a local field.

PointNet [ document 1 ] is often used as a point cloud feature extractor, which ensures the translational infeasibility of disordered point clouds, can abstract low-dimensional point features into rich high-dimensional semantic features, and improves segmentation accuracy. According to the invention, PointNet is selected as a feature extractor of local neighborhood point cloud, as shown in FIG. 4, the working principle of the PointNet feature extractor is as follows.

where γ and h are typically MLP.

the step is input in the form of P₂Matrix information of x (D + C ″), output form P₁X (D + C'). Wherein, C' is the characteristic dimension of the decoded point cloud. The decoding layer mainly comprises three steps: 1) interpolation and upsampling: based on P₂Sampling the abstract point cloud by interpolation algorithm to obtain P₁An abstract point cloud with point cloud characteristics of F₂. The step is input with P₂X (D + C') matrix, output P₁X (D + C "). 2) Jump link fusion feature: p in step 3 is connected in a hopping link mode₁An abstract point cloud characteristic F₁With up-sampled P₁An abstract point cloud characteristic F₂Fused addition. The step inputs P₁X (D + C') and P₁X (D + C'), and output P₁X (D + C "+ C'). 3) And (3) decoding: and (4) decoding the point cloud based on the fusion characteristics by using Unit PointNet (document 1). The step inputs P₁X (D + C "+ C'), output P₁×(D+C″′)。

the step is input in the form of P₁X (D + C') matrix information, output form P₁X (D + C ""). Wherein, C "" is the characteristic dimension of the decoded point cloud. The decoding process is shown in step 6, and the data stream form is as follows: 1) Interpolation and upsampling: input P₁X (D + C ') matrix, output P X (D + C'). 2) Jump link fusion feature: p (D + C') and P (D + C) are input, and P (D + C) is output. 3) And (3) decoding: p × (D + C '+ C) is input, and P × (D + C') is output.

Charles R Q, Hao S, Mo K, et al.PointNet: Deep Learning on pointSets for 3D Classification and Segmentation [ C ]// IEEE Conference on computer Vision & Pattern recognition.2017.

Claims

1. A point cloud semantic segmentation method based on an expansion point convolution space pyramid is characterized by comprising the following steps:

the input point clouds of the network are respectively P { P₁,p₂,p₃…,p_nF, selecting a subset from the cloud of input points using an iterative furthest point sampling algorithm

So that

Furthest from other points in the subset;

The step is input in the form of P₁Matrix information of x (D + C'), output form P₂×K₂X (D + C'); wherein，P₂The number of the central points of the second down-sampling, K₂The number of neighborhood points of the central point is sampled for the second time; based on P₁Repeating the steps 1 and 2 to obtain P₂A central point of down sampling, K₂A central point domain point, to obtain P₂A local neighborhood;

and 5: PointNet extracting local field feature F₂；

and 8: obtaining a label of each point cloud by adopting a full connection layer;

input form P × (D + C ""), output P × k; where k is the number of scene point cloud categories.

2. The method of claim 1, wherein the improved expanded point convolution space pyramid extraction of the point cloud local neighborhood features comprises the following three steps: 1) improving the conventional dilation point convolution; 2) respectively extracting local field features from improved expansion point convolution channels with different expansion rates; 3) fusing and adding the characteristics of all channels; 4) reducing dimension of the features;

the specific extraction process is as follows:

where H is a continuous characteristic function, given by p_jAssigning a feature value; g is a continuous kernel function, p_jTo p_iDistance mapping is used as kernel weight; using monte carlo integration, the continuous convolution definition translates into:

wherein, PN is a PointNet network, and theta' is a series of parameters of PN;

spatial pyramid pooling has i channelsEach channel carries an improved expansion point convolution with an expansion rate d₁,d₂,…,d_iThen, the content information P of different scales of the i groups of local neighborhoods is obtained₁×(D+C_i)；

input form is P₁×(D+C_i) The output form is sigma P₁×(D+C_i) (ii) a Wherein, i is the number of the spatial pyramid pooling channels and is the number of the improved expansion point convolution; fusing content information of different scales by a splicing method;

4) reducing the dimension characteristic information;

3. The method of claim 2, wherein PointNet is selected as the feature extractor of the local neighborhood point cloud, and the working principle of the PointNet feature extractor is as follows:

where γ and h are typically MLP.

4. The method of claim 1, wherein the decoding process in step 6 is as follows:

1) interpolation and upsampling: based on P₂Sampling the abstract point cloud by interpolation algorithm to obtain P₁An abstract point cloud with point cloud characteristics of F₂. The step inputs P₂X (D + C') matrix, output P₁×(D+C″)；

2) Jump link fusion feature: p in step 3 is converted into a link-hopping mode₁An abstract point cloudCharacteristic F₁With up-sampled P₁An abstract point cloud characteristic F₂Fusion addition; the step inputs P₁X (D + C') and P₁X (D + C'), and output P₁×(D+C″+C′)；

the step inputs P₁X (D + C "+ C'), output P₁×(D+C″′)。

5. The method of claim 4, wherein the decoding process in step 7 is as follows:

3) and (3) decoding: p × (D + C '+ C) is input, and P × (D + C') is output.