CN113205018A

CN113205018A - High-resolution image building extraction method based on multi-scale residual error network model

Info

Publication number: CN113205018A
Application number: CN202110434612.1A
Authority: CN
Inventors: 眭海刚; 杜卓童; 李强; 段志强; 肖昶; 王海涛; 王挺; 程旗; 冯文卿
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2021-04-22
Filing date: 2021-04-22
Publication date: 2021-08-03
Anticipated expiration: 2041-04-22
Also published as: CN113205018B

Abstract

The invention discloses a high-resolution image building extraction method based on a multi-scale residual error network model. Firstly, analyzing the type and the characteristics of a typical building in a high-resolution remote sensing image, designing a data augmentation strategy based on the large data demand of a deep learning network, and determining the super-parameter ratio of a training sample set and a verification sample set; secondly, a dense shortcut structure is combined in a basic unit of a U-Net network symmetrical structure, a residual mapping unit is designed, and the arrangement of a convolution layer structure in the basic unit is improved, so that model training is facilitated; meanwhile, the improved network designs the image input stage as a characteristic pyramid input structure, so that image characteristics can be learned on different scales, multi-scale characteristic fusion is performed by combining the designed residual jump connection mode, the building segmentation result is refined through multi-level residual unit operation, the reusability of multi-level characteristics among different network layers is enhanced, the transmission of gradients in the network is effectively enhanced, and the model convergence is accelerated.

Description

High-resolution image building extraction method based on multi-scale residual error network model

Technical Field

The invention relates to the technical field of remote sensing application, in particular to a high-resolution image building extraction method based on a multi-scale residual error network model.

Background

The building is one of basic elements forming the urban structure, and the intelligent extraction method is a vital task for urban planning, monitoring and management and has important application value for urban development analysis. Compared with the medium-low resolution remote sensing image, the method for detecting the change of the building by using the high-resolution remote sensing image has the advantages that: the image contains richer terrain information, for example, an artificial building is represented as a point target in the medium-low resolution remote sensing image, but becomes an obvious surface target in the high resolution image, and the targets occupy more pixels; the characterization information such as the spatial structure, the texture and the like of the ground object of the same type is richer, and the information can better reflect the local characteristics and the internal detail difference of the ground object of the same type. However, the high-resolution remote sensing image brings rich detail information, simultaneously amplifies the more subtle and ignorable interference information in the medium-low resolution remote sensing image, and forms a new interference factor influencing building detection. Although the improvement of the image spatial resolution alleviates the problem of the mixed pixels of the low-resolution sensor, the spectral response of the interior of the area corresponding to the same building is greatly different due to the difference of building materials. The target structure of the building under the complex background is variable, the height is staggered, the building is often easily confused with surrounding ground objects such as trees, roads and the like, the phenomena of 'same-spectrum foreign matters and same-object different-spectrum' are obvious, and the difficulty in extracting the building is increased. The elevation discontinuities, the relatively more shadowing in the image, and the effects of shadows present in complex structures of buildings are more challenging to detect.

In recent years, with the development of computer computing power and deep learning algorithms, the target detection and identification and image semantic segmentation based on the convolutional neural network gradually exceed the best effect of the traditional algorithm, and the accuracy rate of building target extraction in the remote sensing image is greatly improved by an end-to-end deep network training method. Among them, the method based on the deep coding-decoding network has been widely applied to the building extraction. The coding part of the network is mainly used for extracting deep abstract features, most common coding network parts adopt classical network models such as VGGNet, ResNet and DenseNet, a full connection layer is abandoned, image blocks input in the network are subjected to multiple pooling operations, and feature diagram sizes in the middle of the network are subjected to multiple compression. The decoding part of the network is mainly used for learning from the characteristics acquired by the encoding part and recovering the image to obtain the building prediction mark image. At present, most networks adopt an upsampling and hopping connection structure, and characteristics learned at a bottom layer are transmitted to a high layer for decoding detailed information lost in the network recovery image. However, simply connecting the feature maps extracted by the encoder portions directly to the symmetrical encoder portions does not make full use of the feature information in multiple levels, and the detailed position information of the building object is still not effectively recovered. In addition, the depth model often has too high limitation on video memory and hardware conditions, how to improve the efficiency of target extraction, and the balance precision and the calculation cost are another main problem.

Disclosure of Invention

Aiming at the problems in the prior art, the invention adopts a strategy of multi-level feature integration and multi-scale feature fusion, designs a multi-scale residual error link network model, aims to solve the problem of building detail information loss caused by pooling operation in a deep network, and utilizes abundant multi-scale context feature information to realize more precise building segmentation. Meanwhile, the deep network model constructed by the invention can reduce the requirements of model training parameters and memory.

The technical scheme of the invention is as follows: a high-resolution image building extraction method based on a multi-scale residual error network model comprises the following steps:

analyzing the characteristics of building images of different types and styles according to typical building regions in a high-resolution remote sensing image, expanding a sample based on a data augmentation strategy, and determining the hyper-parameter ratio of a training set and a verification set;

designing a multi-scale residual error connection depth network integral model structure based on a convolution neural network basic symmetrical structure, a dense shortcut structure, a residual error jumping connection mode and a characteristic pyramid input structure, wherein the multi-scale residual error connection depth network integral model structure comprises the following substeps;

step 2.1, the multi-scale residual error connection depth network integrally comprises an encoder part and a decoder part;

step 2.2, the encoder part adopts a characteristic pyramid network input structure to obtain images on m different scales, then the images are processed by a convolution layer to ensure that the input of the next layer is consistent with the size of the output characteristic graph of the previous layer, the convolution characteristic graph output under the scale of the previous layer is merged with the image characteristic graph processed by the convolution layer under the scale of the previous layer to be used as the input of the next layer, and then the images are processed by a residual error mapping unit and a maximum pooling layer;

the residual mapping unit comprises two branches, a main branch comprises a plurality of convolutional layer units, a branch comprises a convolutional layer unit, and the convolutional layer unit comprises a convolutional layer, a modified linear unit and a batch normalization layer; let the input be x and the principal branches be denoted as

The branch is represented as

The output of the residual mapping unit is as shown in equation (2):

step 2.3, the decoder part comprises an up-sampling layer and a residual mapping unit corresponding to the encoder part;

step 2.4, combining the depth characteristic graph output by each scale of the encoder part with the characteristic graph obtained by an upper sampling layer under the corresponding scale of the decoder part in a residual error jumping connection mode;

step 2.5, finally, the output of the encoder part is processed by a convolution layer, and then a two-dimensional feature map is converted into a classification map through a Sigmoid activation layer;

and step three, training the multi-scale residual error connection depth network by using the training sample set in the step 1, obtaining an optimal multi-scale residual error connection depth network model by verifying the sample set, and finally performing high-resolution image building extraction on the test set by using the optimal model.

Further, the specific implementation steps of the first step are as follows:

(1) analyzing different characteristics of a typical building area in the high-resolution remote sensing image:

(a) the multi-layer residential area with the brick-concrete structure is a building which is orderly arranged, the planning is orderly, the number of layers is large, the arrangement is orderly, the buildings in the same residential area are uniformly arranged, and the styles of the buildings are uniform;

(b) the high-rise residential district, the single high-rise office building and the commercial building roof with clear building frame structures have the advantages that the streets are neat, the height of the adjacent buildings is large, the long and narrow shadows are formed, the building spacing is large, and the height and the appearance of each building are different;

(c) suburban buildings are sparsely distributed on roofs, are low houses, are irregular in shape, alternate in canine teeth and are connected with one another;

(d) villas are orderly arranged and are all single buildings, the length and the width of the buildings are consistent, and the shadows are short and small; the roof shape is consistent with the outer wall material, and each villa has a garden;

(2) and (3) expanding a training sample set by adopting a plurality of data augmentation strategies:

(a) randomly cutting the input image and the output label image;

(b) the input image and the output label image are randomly rotated, t_r∈[-5,5]；

(c) Multiplying each wave band of an input image by a random value n, wherein n belongs to [0.5,1 ];

(d) randomly turning the input image and the output label image horizontally and vertically;

(3) data segmentation, namely dividing a data set of building examples and other surface objects covering various urban, suburban and rural areas into a training set, a verification set and a test set, and determining a training set data sample: validation set data samples were 5: 1, dividing the data sample into input images with the size of 512 multiplied by 512, so as to train the model and test the training effect of the model in the following.

Further, in step 2.2, the convolution kernel sizes of the convolution layer in the main branch include two types, i.e., 3 × 3 and 1 × 1, the step parameter size is set to 1, the filling parameter size is set to 1, and the convolution layer in the branch adopts convolution kernels with the size of 1 × 1.

Further, in step 2.2, a feature pyramid network input structure is adopted, and images at 5 different scales are input as convolutional layers to perform image feature learning at different scales, which is 512 × 512 × 3, 256 × 256 × 3, 128 × 128 × 3, 64 × 64 × 3, and 32 × 32 × 3, respectively.

The invention is based on a multi-scale residual connection depth network model, researches an extraction method of a monomer building in a high-resolution remote sensing image, and is characterized in that:

(1) because the existing open-source building data sets are basically from the same sensor or images with approximate imaging time, namely the data distribution of the test image and the training image is very close, the robustness of a depth network model is poor, a data augmentation strategy is designed by analyzing the building characteristics in a multi-source multi-temporal data image, the data hyper-parameter in the depth network training is determined, and the generalization capability of a depth convolution neural network is improved;

(2) in consideration of the advantages of UNet and ResNet network structures, the basic unit structure of the deep convolutional network convolutional layer is improved, a residual mapping unit is designed on the basis of the basic unit structure of the UNet network, the structural arrangement of the convolutional layer in the basic unit is improved, and the model training efficiency is improved while gradient information transmission is ensured;

(3) because the network layers with different receptive fields have different representation capabilities on geometric and semantic information, the image features are learned on different scales by designing a feature pyramid input structure, and if the feature graphs extracted by the encoder part are directly connected to the symmetrical encoder part, the feature information cannot be fully fused, a residual error jump connection method is further researched, and the reusability of multi-level features among different network layers is enhanced.

Drawings

FIG. 1 is a flow processing diagram of a high-resolution remote sensing image single building extraction method based on a multi-scale residual connection depth network model.

Fig. 2 is a basic structural unit of residual mapping in a building extraction network, wherein "Conv" represents a convolutional layer, "ReLU" represents a modified linear unit, and "BN" represents a batch normalization layer.

Fig. 3 shows a designed 'Res path' hopping connection scheme. Different from a direct connection mode of directly connecting the characteristics of the encoder and the characteristics of the decoder, the method adopts multi-level residual error unit operation to fuse the multi-level characteristics.

Detailed Description

The invention provides a high-resolution image building extraction method based on a multi-scale residual error network model. Firstly, analyzing the type and the characteristics of a typical building in a high-resolution remote sensing image, designing a data augmentation strategy based on a large amount of data requirements of a deep learning network, and determining the super-parameter ratio of a training sample set and a verification sample set; secondly, a dense shortcut structure, namely a residual mapping unit, is combined in a basic unit of a U-Net network symmetrical structure, and the arrangement of a convolution layer structure in the basic unit is improved, so that the model training efficiency is improved while the gradient information transmission is ensured, the gradient disappearance in a deep network is avoided, and the model training is facilitated; meanwhile, the improved network designs the image input stage as a characteristic pyramid input structure, so that image characteristics can be learned on different scales, multi-scale characteristic fusion is performed by combining the designed residual jump connection mode, the building segmentation result is refined through multi-level residual unit operation, the reusability of multi-level characteristics among different network layers is enhanced, the transmission of gradients in the network is effectively enhanced, and the model convergence is accelerated. The method can carry out self-adaptive learning and analysis from the shallow local features to the deep abstract features, and adopts the strategies of multi-level feature integration and multi-scale feature integration to obtain rich multi-scale context feature information so as to realize more precise building segmentation.

The technical solution of the present invention is described in detail below with reference to the accompanying drawings and embodiments, wherein a flow chart is shown in fig. 1, and the technical solution flow of the embodiments includes the following steps:

step one, analyzing building image characteristics of different types and styles according to typical building regions in the high-resolution remote sensing image, designing a data augmentation strategy based on a large amount of data requirements of a deep learning network, and determining the super-parameter ratio of a training sample set and a verification sample set. The specific implementation steps for obtaining the super-parameter of the sample data in the deep network training process are as follows:

(1) and (6) analyzing the data. Analyzing different characteristics of a typical building area in the high-resolution remote sensing image:

(a) the multi-layer residential area with the brick-concrete structure is generally a building which is orderly arranged, the planning is orderly, the number of layers is large, the arrangement is orderly, the arrangement of the buildings in the same residential area is relatively consistent, and the style of the buildings is basically consistent;

(c) suburban buildings are sparsely distributed on roofs, are mostly low houses, are irregular in shape, and are connected with one another in a canine tooth alternating manner;

(d) villas are generally arranged in order and are all single buildings, the length and the width of the buildings are consistent, and the shadows are short and small; the roof shape, the exterior wall material is substantially uniform, and each villa has its own garden.

(2) And (5) data amplification. The deep convolutional neural network has better generalization test capability by utilizing a large number of training samples covering various architectural characteristics, overfitting is avoided, and a training sample set needs to be expanded by adopting various data augmentation strategies:

(a) randomly cutting the input image and the output label image;

(c) Multiplying each wave band of the input image by a random value (n belongs to [0.5,1 ]);

(d) the input image and the output label image are randomly flipped horizontally and vertically.

(3) And (4) data segmentation. Dividing data sets of building examples and other surface objects covering various urban, suburban and rural areas into a training set, a verification set and a test set, and determining that (training set data samples: verification set data samples) ≈ 5: 1, dividing the data sample into input images with the size of 512 multiplied by 512, so as to train the model and test the training effect of the model in the following.

And secondly, designing a basic structure unit of the multi-scale residual error connection deep network model convolution layer based on the characteristics of a symmetrical structure and a dense shortcut structure, promoting the transmission of information in the network, avoiding the gradient disappearance in the deep network, and being beneficial to model training. The specific implementation steps of obtaining the improved network residual mapping structure unit are as follows:

(1) and designing a basic structure unit of the network convolution layer, namely, residual-to-short blocks. The residual mapping basic unit structure is provided with two branches, so that the information transmission is promoted, the model convergence speed is accelerated, and the model training is facilitated:

(a) designing a main branch structure:

convolution kernel sizes (kernel) of convolution layers include 3 × 3 and 1 × 1, a stride parameter (stride) size is set to 1, a padding parameter (padding) size is set to 1, and a modified linear unit (ReLU) and a Batch Normalization layer (BN) are used. Let the input image be x and the principal branches be denoted as

As shown in the left branch of fig. 2. The ReLU activation function is shown in equation (1):

(b) branch structure design:

the convolutional layer adopts 1 × 1 convolutional kernel, and is followed by ReLU layer and BN layer, and branch is shown as

As shown in the right branch of fig. 2.

(2) Residual mapping elementary unit output, as shown in equation (2):

and step three, learning image characteristics on different scales by adopting a characteristic pyramid network input structure, performing multi-scale characteristic fusion by combining a designed residual error jumping connection mode, refining a building segmentation result through multi-stage residual error unit operation, and improving network training performance. The method for fusing multi-scale feature representation based on the residual jump connection mode comprises the following steps:

(1) and (3) adopting a characteristic pyramid network input structure, taking images under 5 different scales as input of the convolution layer to carry out image characteristic learning on the different scales, wherein the input of the next layer is consistent with the output characteristic diagram of the previous layer in size, and the input of the next layer is respectively 512 multiplied by 3, 256 multiplied by 3, 128 multiplied by 3, 64 multiplied by 3 and 32 multiplied by 3.

(2) And merging the convolution characteristic graph output under the previous layer scale and the image input under the previous layer scale, and performing characteristic learning by using a merged result as a new convolution layer input, so that multi-scale characteristic fusion can be performed.

(3) Designing a residual jump connection mode (Res Path) to replace the traditional simple mode of directly connecting the characteristic diagram of the encoder part to the symmetrical decoder part, directly connecting the output characteristic diagram of the encoder part with the corresponding upsampling characteristic of the decoder after residual unit operation (shown in a formula (2)), integrating the low-layer characteristic diagram with the symmetrical high-layer characteristic diagram to form a new tensor, and performing subsequent calculation and processing operations. The design of the residual jump connection structure is shown in fig. 3.

And step four, designing the overall model structure design of the multi-scale residual error connection depth network based on the basic structure unit of the convolutional neural network, the residual error jump connection mode and the characteristic pyramid input structure, wherein the specific parameters are shown in the table 1. The steps of obtaining the overall structure of the multi-scale residual connection depth network model are as follows:

(1) the network encoder structure design:

the encoder part adopts the strategies of multi-level feature integration and multi-scale feature integration and is formed by stacking a continuous convolutional layer, a residual mapping unit (short blocks) and a maximum pooling layer (max-pooling layers).

(2) And (3) structural design of a network decoder:

the decoder part is formed by stacking corresponding upper sampling layers (Upsampling layers), dense short-cut units (dense short blocks) and convolution layers; the Sigmoid activation layer is responsible for converting the two-dimensional depth feature map into a classification map.

(3) And (4) based on the residual jump connection mode designed in the step three (3), adding 5 extra Res path jump connection modes in a 5-pair down-sampling-up-sampling symmetrical structure, and fusing the pixel position information of the front layer feature diagram in the encoder and the semantic information of the up-sampling feature diagram of the decoder part to finish the refinement of the building segmentation result and improve the network training performance.

TABLE 1 Multi-Scale residual connection depth network Whole model parameter Table

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims

1. A high-resolution image building extraction method based on a multi-scale residual error network model is characterized by comprising the following steps:

the residual mapping unit comprises two branches, a main branch comprises a plurality of convolutional layer units, a branch comprises a convolutional layer unit, and the convolutional layer unitComprises a convolution layer, a modified linear unit and a batch standardization layer; let the input be x and the principal branches be denoted as

The branch is represented as

The output of the residual mapping unit is as shown in equation (2):

2. The method for extracting high-resolution image buildings based on multi-scale residual error network model according to claim 1, characterized in that: the specific implementation steps of the first step are as follows:

(a) randomly cutting the input image and the output label image;

3. The method for extracting high-resolution image buildings based on multi-scale residual error network model according to claim 1, characterized in that: in step 2.2, the convolution kernel sizes of the convolution layers in the main branch include 3 × 3 and 1 × 1, the step parameter size is set to be 1, the filling parameter size is 1, and the convolution layers in the branch adopt convolution kernels with the size of 1 × 1.

4. The method for extracting high-resolution image buildings based on multi-scale residual error network model according to claim 3, characterized in that: in the step 2.2, a feature pyramid network input structure is adopted, and images at 5 different scales are input as convolution layers to perform image feature learning at different scales, wherein the image feature learning is 512 × 512 × 3, 256 × 256 × 3, 128 × 128 × 3, 64 × 64 × 3 and 32 × 32 × 3 respectively.