CN109410307A

CN109410307A - A kind of scene point cloud semantic segmentation method

Info

Publication number: CN109410307A
Application number: CN201811204443.7A
Authority: CN
Inventors: 李坤; 杨鑫; 尹宝才; 张强; 魏小鹏
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2018-10-16
Filing date: 2018-10-16
Publication date: 2019-03-01
Anticipated expiration: 2038-10-16
Also published as: CN109410307B

Abstract

The invention belongs to technical field of computer vision, provide a kind of scene point cloud semantic segmentation method, design the frame of the extensive intensive scene point cloud semantic segmentation model based on depth learning technology, for the extensive intensive scene point cloud of input, the two-dimensional signal that convolution can be handled directly can be converted by the three-dimensional information of cloud in the case where information is not lost, and complete a task for cloud semantic segmentation in conjunction with the technology that image, semantic is divided.Under this framework, it can effectively solve the semantic segmentation task of extensive intensive scene point cloud.The semantic segmentation result for the scene point cloud that method of the invention obtains can be utilized directly in tasks such as robot navigation, automatic Pilots.And this method effect in the natural scene of unartificial synthesis is especially significant.

Description

A kind of scene point cloud semantic segmentation method

Technical field

The invention belongs to technical field of computer vision, more particularly to based on deep learning to extensive intensive point cloud field The method of scape progress semantic segmentation.

Background technique

The development of modern computer vision is dominate using the method for convolutional neural networks processing two dimensional image.It is successfully Key factor is convolution being effectively treated on the image.Convolution is defined on regular grid in the picture, the regular grid Convolution operation is supported extremely efficiently to realize.This characteristic allows to using powerful deep layer architecture come to high-resolution Large data collection handled.

When analyzing large-scale three-dimensional scenic, the direct extension of the above method is that three are carried out on voxel grid Tie up convolution.However, this voxel-based method has significant limitation, cube growth and calculating effect including memory consumption The problems such as rate.For this reason, voxel-based convolutional neural networks are mostly run on the voxel grid of low resolution, this limit Their precision of prediction is made.Can be by alleviating these problems based on the technology of Octree, the technology is fixed on Octree Adopted convolution and it is capable of handling slightly high-resolution data.However, these are still not sufficient to ensure that efficiently analysis large-scale three dimensional field Scape.

The data of the 3D sensor such as RGB-D camera and Li-DAR capture typically represent the surface of object: i.e. one kind is embedded into Two-dimensional structure in three-dimensional space.The three-dimensional data of this and real voxel form is contrasted, such as medical image.For dividing Cloud is considered as a kind of potential surface texture of object by the classical feature for analysing such data, and this data is not regarded as body Element.

The drawbacks of voxel-based three-dimensional data analysis method is obvious.Nearest some researchs are thought, are based on The three-dimensional data structure of voxel is not the most natural form of Three dimensional convolution, and proposes based on unordered point set, graph structure and ball The alternative of shape surface texture.Unfortunately, these methods have the defect of its own, such as have limit quick partial structurtes Perception relies on restrictive topology hypothesis.

(1) three-dimensional point cloud semantic segmentation

The scene understanding of three-dimensional data, including cloud semantic segmentation, have a long history in computer vision.It starts Property method be based on hand-made feature, it be suitable for aviation Li-DAR data.These methods can also be with advanced frame Structure combines.The model of graphics, including condition random field is utilized in popular pre- flow gauge.Equally have in recent years Method for interactive point cloud semantic segmentation is suggested.

(2) development of the deep learning in three-dimensional data

In recent years, the deep learning revolution of computer vision field has had spread over three-dimensional data analysis, some to be used for The deep learning method of processing three-dimensional data is suggested.

The common expression of three-dimensional data for deep learning is voxel grid.But the time of cube rank and space are multiple Miscellaneous degree, this run these methods can only with low resolution, and precision is limited.In order to overcome this limitation, people is studied Member proposes the expression based on layering spatial data structure, and such as Octree and Kd-Tree, they have preferably storage and calculate Efficiency, therefore can handle the data of higher resolution ratio.

The application of other some deep learning networks is using RGB-D image as input, later with full convolutional neural networks Or it is handled based on the neural network of figure, but be generally unsuitable for unstructured unknown cloud of sensor visual angle.For Solution this problem, Boulch et al. using the virtual camera randomly placed from a cloud rendering image, and with these pictures Training convolutional neural networks.In the more controlled setting with fixed camera visual angle, multiple view method is used successfully to shape Segmentation, shape recognition and shape synthesis.

Neat et al. to propose a kind of for analyzing the network of unordered cloud, which is independently handled and is made to a progress With the information of maximum Chi Hualai polymerization context.But it is very weak due to putting communication between, when the network application in When large scale scene with complex topology, this method can encounter many difficulties.

Summary of the invention

The present invention in order to solve conventional point cloud scene understanding vulnerable to data resolution limitation, the inadequate robust of local feature and It is difficult to handle the technical problems such as extensive point off density cloud, devises the extensive intensive scene point based on deep learning technology The frame of cloud semantic segmentation model can incite somebody to action the extensive intensive scene point cloud of input in the case where information is not lost The three-dimensional information of point cloud is converted into the two-dimensional signal that convolution can be handled directly, and completes in conjunction with the technology that image, semantic is divided The task of point cloud semantic segmentation.Under this framework, it can effectively solve the semantic segmentation task of extensive intensive scene point cloud.

Technical solution of the present invention:

A kind of scene point cloud semantic segmentation method, steps are as follows:

(1) building of local coordinate system planar convolution: in order to directly construct two-dimensional convolution on cloud, so that model A local feature for cloud robust can be extracted in the lower situation of computation complexity, a cloud is projected to utilization by the present invention PCA technology decomposes three coordinate planes generated to cloud, and constructs convolution module respectively in three coordinate planes and come to a cloud Carry out the extraction of local feature.Local coordinate plane convolution module is described in detail below.

(1.1) local coordinate system plane is estimated:

For point p each in cloud, its local coordinate system plane is estimated by the analysis of covariance of part first；Tool For body, for meeting | | p-q | | the point set q in a ball domain of < R, the estimation for tangent plane, the side of the tangent plane of point p To being by covariance matrix ∑_qrr^TFeature vector determine, r=q-p；It is worth the smallest feature vector and determines tangent plane Normal vector n_p, two feature vectors i and j in addition determine the direction of two reference axis of tangent plane；

(1.2) local coordinate system planar convolution:

Local message is extracted in order to carry out convolution operation on cloud, needs three coordinates by each cloud Plane；The point in ball domain range that the radius from point p is R is indicated with point set q, and q is projected to three coordinates of p respectively In plane；For each point p, the function that F (p) is point p is defined, for encoded colors, geometrical characteristic or is come in automatic network The abstract characteristics of interbed；Building for convolution, the tangent plane π of point p_p, defining S (u) is the continuous letter in tangent plane on the u of position Number amount, c (u) is the convolution nuclear parameter on the u of position, wherein u ∈ R²；

Therefore the convolution operation at point p is defined as follows:

(1.3) signal difference:

For tangent plane, signal interpolation target is to estimate to participate in tangent plane with the semaphore F (q) of the neighborhood point set q of p The semaphore S (u) of each position of convolution algorithm；Q is projected in the tangent plane of p first, generates a projection point set v= (r^Ti,r^Tj)；Definition:

S (v)=F (q) (2)

In this way, point set v is scattered in the plane of delineation；Therefore these semaphores are subjected to interpolation to estimate that S (u) is participating in rolling up The semaphore of each position of product operation:

∑_v(w(u,v)·S(v)) (3)

Here, w (u, v) is the weight of convolution kernel, and meets ∑_vW=1；The present invention is inserted using a kind of fairly simple Value method: arest neighbors (NN) interpolation.In this interpolation strategies,

Finally again to the formula for carrying out tangent plane convolution operation at point p:

Note that the effect of tangent plane herein is more and more implicit: it provides range domain for u, and is convolution kernel w's Deduction provides the foundation, but does not need clearly to safeguard.This enables the method to support in the point cloud with millions of a points Upper building depth network.

(1.4) pond layer:

Convolutional network polymerize the signal on larger space region usually using pond layer.The present invention will be by that will put cloud signal Pond is realized in amount hash to conventional 3D grid.For the point set being scattering into the same grid, it is polymerize by average Chi Hualai Its semaphore.Consider point set P={ p } and corresponding semaphore { F (p) }.It enables g represent a voxel grid and enables V_gIt represents in P The point set being hashed into g.Assuming that V_gNon-empty is then converged to the information of its all the points on one point by average pond:

(2) cloud semantic segmentation module is put:

(2.1) module inputs:

The input of the module is large-scale indoor and outdoor intensive scene point cloud, and putting the quantity of cloud, there is no limit put cloud Input feature vector includes the information of RGBXYZ, needs to be converted into RGB, D (depth), H (height), N (normal vector) by pretreatment As input feature vector；

(2.2) module architectures:

Point cloud semantic segmentation module is the convolutional neural networks from coding structure, and effect is realized to input point cloud The prediction of semantic information, formula are as follows:

I_out=f_seg(I_in；θ_f) (7)

In above formula, I_outIt is prediction of the network to cloud about n classification semantic information, I_inIt is input comprising RGBDHN The scene point cloud of information；f_seg() indicates the convolutional neural networks from coding structure, θ_fIndicate the weight parameter of network model； It wherein, include 2 pond layers from the encoder of the convolutional neural networks of coding structure, it is therefore an objective to polymerize volume by pond layer The feature of volume module output and the Spatial Dimension for reducing feature；There can be 3 convolution modules to obtain a little before each pond layer The local message of cloud；Restore the Spatial Dimension of feature, same packet before each up-sampling layer in decoder by up-sampling layer Containing 3 convolution modules；Connection is jumped in increase by two between the respective layer of encoder and decoder makes network that mesh be better anticipated Target details.

In each convolution module, due to using local coordinate system planar convolution that input feature vector can be projected to three planes It causes the port number of feature to increase by 3 times of redundancies resulted in a feature that, therefore makes first after local coordinate system planar convolution Feature port number is further expanded 2 times with 1 × 1 convolution, then separates convolution (n single pass volumes using depth N channel of product core and input feature vector carries out one-to-one convolution operation) decoupling of the realization to redundancy feature, finally use one 1 × 1 convolution kernel comes fusion and compression to feature.

(3) training method

This patent is using the outdoor point cloud contextual data collection of the Semantic3D comprising 8 classifications and comprising 13 classifications The outdoor scene data set of S3DIS；Model is trained using the method for data-driven, is lacked to solve 3-D data set Weary problem, scene point cloud in Semantic3D data set and S3DIS data set is rotated horizontally 10 times respectively by this patent will Sample size increases by 10 times.

Backpropagation and stochastic gradient descent are used from the convolutional neural networks of coding structure in point cloud semantic segmentation module Method training.The scene point cloud inputted for one uses the cross entropy with class weight as loss function L_seg, benefit Weight is calculated with formula (8), wherein the weight w of classification i_iFor belong in sample classification i point quantity D_iWith classes all in sample The quantity D of other point_kRatio logarithm opposite number, this is prevented to alleviate the class imbalance phenomenon in data set The training for the point cloud branch distribution network that quantity occupies the majority.

Network overall error is calculated using formula (9), wherein N indicates the number at scene point cloud midpoint, y_lIndicate that the output of point l exists Score corresponding to true classification, w_lFor the weight of point l generic.

Obtain training error after, network will be updated the parameter of network along the opposite direction of gradient, iteration until Convergence.

The present invention is had the significant advantage that compared with the method for same domain for a cloud semantic segmentation task, relatively more intuitive Way be that entire point cloud scene is subjected to voxelization, then using three-dimensional convolution kernel and combine at full convolution technique Reason, but since the problems such as dimension explosion and resolution limitations causes the computational efficiency of this method and accuracy to be unable to To guarantee.Based on the neural network method of multi-layer perception (MLP) when solving the problems, such as some cloud semantic segmentations due to can not effectively mention Get the local feature of a cloud so as to cause network cannot future position cloud scene well details.

And cloud is regarded a kind of table of object by the semantic segmentation method of extensive intensive scene point cloud proposed by the present invention Face structure projects to local coordinate system plane by that will put cloud to directly build two-dimensional convolution on cloud, this makes the party Method can effectively extract the local feature of a cloud under conditions of information lossless, jump over connection by building in a network and make The textural characteristics and network high-rise semanteme abundant of Network Low-layer can fully be used when being predicted by obtaining model Feature, so that network be helped preferably to realize to a prediction for cloud scene details.The semanteme for the scene point cloud that this method obtains point Cutting result can directly utilize in tasks such as robot navigation, automatic Pilots.And this method is in the natural field of unartificial synthesis Effect is especially significant in scape.

Detailed description of the invention

Fig. 1 (a) is the scene point cloud of a true meeting room, and Fig. 1 (b) is that the semantic segmentation of meeting room scene point cloud is true Value.

Fig. 2 is a cloud semantic segmentation network structure.Using scene point cloud as input, by convolution, pondization and up-sampling Deng operation, the semantic segmentation result of scene point cloud is finally entered.

Fig. 3 is the internal structure of each convolution block, is 1. the shape for converting feature vector to local coordinate system planar convolution 2. formula is that three n × 3 × 3 × d tensors are spliced into n × 3 × 3 × 3d tensor, 3. and is 5. 1 × 1 convolution, 4. It is that depth separates convolution, is compressed to the dimension of tensor.

Specific embodiment

Invention is described in further detail With reference to embodiment, but the invention is not limited to specific implementations Mode.

A method of semantic segmentation, including network model are carried out to extensive intensive scene point cloud based on deep learning Training and model operating procedure part.

1. training network model

The semantic segmentation network of the training extensive intensive scene point cloud, it is necessary first to prepare sufficient point cloud data.Often A scene point cloud sample should include semantic classes information belonging to RGBXYZ and each point.With S3DIS indoor scene data set For, after data enhance, 2654 scene point cloud samples are shared as training set and 578 samples as verifying collection.

After obtaining enough data sets, it is necessary first to by the preprocessed information for being converted into RGBDHN of the feature of each point Input as semantic segmentation network.Later by establish Kd-Tree come in Searching point cloud centered on each point, radius R Ball domain in the information put, solve the local coordinate system of each point using PCA technology, and by the information put in ball domain by projecting And etc. be converted into the form that can carry out local coordinate system planar convolution.

Training data, is transported to network to be trained by the semantic segmentation network that a cloud is then built according to attached drawing 2 in batches In, the class weight of each point is calculated according to formula (8) and formula (9) respectively and puts the error of cloud semantic segmentation network, and according to The iteration that the method for gradient backpropagation carries out parameter updates, and is accelerated using GPU, sets until the error of network is reduced to Deconditioning within fixed threshold value or when the number of network iteration is met the requirements.

2. cloud semantic segmentation process

The scene point cloud indoor and outdoor for one, a cloud is converted to cloud feeding preprocessing module first can The form for carrying out local coordinate system planar convolution, is then input to trained point Yun Yuyi for after pretreatment cloud The semantic information of scene point cloud is obtained in parted pattern.The semantic information of scene point cloud can be then used for automatic Pilot, In the tasks such as robot navigation.Process is as shown in Fig. 2.

Claims

1. a kind of scene point cloud semantic segmentation method, which is characterized in that steps are as follows:

(1) cloud the building of local coordinate system planar convolution: is projected to three seats for being decomposed and being generated to cloud using PCA technology Plane is marked, and constructs the extraction that convolution module to carry out a cloud local feature respectively in three coordinate planes；

(1.1) local coordinate system plane is estimated:

For point p each in cloud, its local coordinate system plane is estimated by the analysis of covariance of part first；It is specific next It says, for meeting | | p-q | | the point set q in a ball domain of < R, the estimation for tangent plane, the direction of the tangent plane of point p is By covariance matrix ∑_qrr^TFeature vector determine, r=q-p；It is worth the normal direction that the smallest feature vector determines tangent plane Measure n_p, the direction of two reference axis of two feature vectors i and j decision tangent plane in addition；

(1.2) local coordinate system planar convolution:

The point in ball domain range that the radius from point p is R is indicated with point set q, and q is projected to three coordinates of p respectively In plane；For each point p, the function that F (p) is point p is defined, for encoded colors, geometrical characteristic or is come in automatic network The abstract characteristics of interbed；Building for convolution, the tangent plane π of point p_p, defining S (u) is the continuous letter in tangent plane on the u of position Number amount, c (u) is the convolution nuclear parameter on the u of position, wherein u ∈ R²；

Therefore the convolution operation at point p is defined as follows:

(1.3) signal difference:

For tangent plane, signal interpolation target is to estimate to participate in convolution in tangent plane with the semaphore F (q) of the neighborhood point set q of p The semaphore S (u) of each position of operation；Q is projected in the tangent plane of p first, generates a projection point set v= (r^Ti,r^Tj)；Definition:

S (v)=F (q) (2)

In this way, point set v is scattered in the plane of delineation；Therefore these semaphores are subjected to interpolation to estimate that S (u) is participating in convolution fortune The semaphore for each position calculated:

∑_v(w(u,v)·S(v)) (3)

Here, w (u, v) is the weight of convolution kernel, and meets ∑_vW=1；With fairly simple interpolation method: arest neighbors (NN) Interpolation；In this interpolation strategies,

(1.4) pond layer:

Pond is realized in cloud semaphore hash to conventional 3D grid by that will put；For the point set being scattering into the same grid, It polymerize its semaphore by average Chi Hualai；Consider point set P={ p } and corresponding semaphore { F (p) }, g is enabled to represent a voxel Grid simultaneously enables V_gRepresent the point set being hashed into g in P；Assuming that V_gThe information of its all the points is then passed through average pond Hua Hui by non-empty Gather on a point:

(2) cloud semantic segmentation module is put:

(2.1) module inputs:

The input of the module is large-scale indoor and outdoor intensive scene point cloud, and putting the quantity of cloud, there is no limit put the input of cloud Feature includes the information of RGBXYZ, and it is special as input to need to be converted into RGB, depth D, height H, normal vector N by pretreatment Sign；

(2.2) module architectures:

Point cloud semantic segmentation module is the convolutional neural networks from coding structure, and effect is realized to input point cloud semanteme The prediction of information, formula are as follows:

I_out=f_seg(I_in；θ_f) (7)

In above formula, I_outIt is prediction of the network to cloud about n classification semantic information, I_inIt is input comprising RGBDHN information Scene point cloud；f_seg() indicates the convolutional neural networks from coding structure, θ_fIndicate the weight parameter of network model；Wherein, It include 2 pond layers from the encoder of the convolutional neural networks of coding structure, it is therefore an objective to polymerize convolution mould by pond layer The feature of block output and the Spatial Dimension for reducing feature；There are 3 convolution modules before each pond layer to obtain the office of a cloud Portion's information；Restore the Spatial Dimension of feature in decoder by up-sampling layer, is equally rolled up comprising 3 before each up-sampling layer Volume module；Connection is jumped in increase by two between the respective layer of encoder and decoder makes network that the thin of target be better anticipated Section；

In each convolution module, first using 1 × 1 convolution further by feature after local coordinate system planar convolution Port number expands 2 times, then decoupling of the convolution realization to redundancy feature is separated using depth, finally using one 1 × 1 Convolution kernel comes fusion and compression to feature；

(3) training method

Using the room of outdoor point the cloud contextual data collection and the S3DIS comprising 13 classifications of the Semantic3D comprising 8 classifications Outer scene data set；Model is trained using the method for data-driven, in order to solve the problems, such as that 3-D data set lacks, Scene point cloud in the outdoor scene data set of the outdoor point cloud contextual data collection of Semantic3D and S3DIS is rotated horizontally respectively 10 times by sample size increase by 10 times；

The side of backpropagation and stochastic gradient descent is used in point cloud semantic segmentation module from the convolutional neural networks of coding structure Method training；The scene point cloud inputted for one uses the cross entropy with class weight as loss function L_seg, utilize formula (8) weight is calculated, wherein the weight w of classification i_iFor belong in sample classification i point quantity D_iWith all categories in sample The quantity D of point_kRatio logarithm opposite number, this is to prevent quantity to alleviate the class imbalance phenomenon in data set The training of the point cloud branch distribution network to occupy the majority:

Network overall error is calculated using formula (9), wherein N indicates the number at scene point cloud midpoint, y_lIndicate the output of point l true Score corresponding to classification, w_lFor the weight of point l generic；

After obtaining training error, network will be updated the parameter of network along the opposite direction of gradient, and iteration is until convergence.