CN114549555A

CN114549555A - Human ear image planning and division method based on semantic division network

Info

Publication number: CN114549555A
Application number: CN202210179495.3A
Authority: CN
Inventors: 肖文栋; 田媛; 高枕岳; 贾世瑾
Original assignee: University of Science and Technology Beijing USTB
Current assignee: University of Science and Technology Beijing USTB
Priority date: 2022-02-25
Filing date: 2022-02-25
Publication date: 2022-05-27

Abstract

The invention discloses a human ear image planning and dividing method based on a semantic dividing network, which comprises the following steps: acquiring a human ear image, and constructing a human ear anatomy segmentation data set; improving based on a U-Net network to obtain an ear segmentation model; training a model based on the dataset; and carrying out planning segmentation on the human ear graph to be detected by using the trained segmentation model. The invention can segment the human ear image and improve the segmentation precision.

Description

Human ear image planning and division method based on semantic division network

Technical Field

The invention relates to the technical field of image processing, in particular to a human ear image planning segmentation method based on a semantic segmentation network.

Background

Modern people pay more and more attention to health and health care related aspects, and the auricular point therapy is gradually welcomed by people as a simple and effective health care means. The auricular point positioning is an important prerequisite for auricular point therapy, but the auricular points of human ears are many, the auricular points of human ears are small in area, and the auricular points of each human ear are different in shape, so the auricular point positioning has certain difficulty. Most auricular points are distributed in eight regions, namely the helix, the periauricular region, the fossa trigonalis, the antihelix, the concha, the antitragus, the tragus and the earlobe, which are divided according to the auricle anatomical structure. The accurate 8 regions of dividing into of people's ear according to anatomy is the important prerequisite of the accurate location of auricular point, but the auricular point distributes densely and auricle shape difference can lead to the difference of auricular point position, to the beginner of traditional chinese medical science or non professional, and the auricular point location has certain degree of difficulty.

In the prior art, the human ear cannot be accurately segmented into eight parts according to the anatomical features of the human ear, and the semantic information, the detail and the edge information in the human ear image are not accurately acquired enough in extraction capacity, so that the segmentation accuracy is low.

Disclosure of Invention

The invention provides a human ear image gouging and splitting method based on a semantic splitting network, aiming at the defects that the prior art cannot realize. The segmentation of the auricular pinna region is realized, the positioning region of the auricular point can be greatly reduced, an important foundation is laid for the automatic positioning of the auricular point, the reference significance is provided for increasing the understanding of non-professionals on auricular point therapy or developing intelligent instruments related to the auricular point therapy, and the significance is provided for the division teaching of auricular point anatomy and more detailed human ear recognition.

In order to realize the purpose, the technical scheme adopted by the invention is as follows:

a human ear image gouging and splitting method based on a semantic splitting network comprises the following steps:

s101: 200 images of the left ear from different individuals were selected as training data in the USTB-Helloear database. Eight regions of the helix, periauricular, fossa trigonalis, antihelix, concha, antitragus, tragus and earlobe of each image are labeled using labelme as a labeling tool. Making a human ear segmentation data set, wherein the data set is divided into: training set, validation machine and test set, then carrying out a series of preprocessing operations on the training set in the data set, wherein the preprocessing operations include data enhancement, and the data enhancement includes: random overturning, brightness adjustment and affine transformation.

S102: improving the U-Net network according to a preset method, which comprises the following steps: and adopting a method for increasing the attention mechanism, and taking the decoder structure newly increased attention mechanism in the encoder-decoder symmetrical structure in the U-Net network as the feature extraction network. And constructing a spatial convolution pooling pyramid in the U-Net network, performing convolution parallel sampling by holes with different sampling rates, introducing more context information, and expanding the receptive field.

S103: training a semantic segmentation model for the preprocessed human ear images by using an improved U-Net network;

s104: detecting a test data set based on a semantic segmentation model, and dividing a human ear image into eight regions according to the solving and planning characteristics;

s105: and the detection result adopts an mIou evaluation mode to carry out effectiveness evaluation and verification on the improved U-Net network model.

Further, 160 of the 200 left ear images are selected as a training set, 20 of the 200 left ear images are selected as a verification set, and 20 of the 200 left ear images are selected as a test set;

further, randomly flipping: including horizontal flipping, vertical flipping, and diagonal flipping. Turning an n-by-n two-dimensional matrix by 180 degrees left and right, rotating the matrix by 180 degrees up and down and rotating the matrix by 180 degrees clockwise respectively;

further, the brightness was adjusted to: in the HSV color space of an image, the saturation, brightness, and contrast are randomly changed. Wherein the brightness is the brightness of the image; saturation refers to how much of the image color category; the contrast is the difference between the maximum gray level and the minimum gray level of the image;

firstly, normalizing the digital image, changing the digital image into a floating point type, converting a color space BGR into an HLS, wherein the HLS space and three channels are respectively as follows: hue, Lightness, saturation; then, the brightness and the saturation are respectively processed by linear change, and two sliding bars are created to manually adjust the brightness and the saturation respectively.

Further, affine transformation: a linear transformation from two-dimensional coordinates to two-dimensional coordinates is realized through a series of atomic transformation composites, and the linear transformation method specifically comprises the following steps: translation, zooming, rotation, and flipping. And simultaneously, carrying out translation, scaling and rotation operations on the image, needing an M matrix, and automatically solving by using the correspondence provided by opencv and according to three points before and after transformation.

Further, the method for increasing the attention mechanism in S102 specifically includes: the decoder structure part in the U-Net network uses the Attention Gates, the part extracted from the encoder is subjected to the Attention Gates and then is subjected to the decoder, the output characteristic of the encoder is readjusted, a plurality of models and a large number of additional model parameters are not required to be trained, and the characteristic response of irrelevant background areas is inhibited.

Specifically, a decoder matrix and an encoder matrix are multiplied by a weight matrix respectively and then enter the next link, the weight matrix continuously obtains the importance degrees of the decoder matrix and the encoder matrix through back propagation learning, the importance degrees are learned according to targets, the input elements are matrices needing add in the encoding process, and the attention coefficients calculated in the AG are used for scaling. Activation and context information provided by the pooled matrices are required to select spatial regions, and grid resampling of attention coefficients is done using tri-linear interpolation.

Further, in S102, the spatial convolution pooling pyramid is constructed in the U-Net network as follows:

the spatial convolution pooling pyramid has 5 layers in total, a layer of 1 × 1 ordinary convolution and three layers of 3 × 3 cavity convolution and pooling layers with sampling rates of 1, 6 and 12 respectively, after the human ear image is subjected to feature extraction through an encoder network, the human ear image is used as input and sent into a cavity convolution pyramid module, and then after five layers of concat, the human ear image is subjected to convolution through 1 × 1 to obtain the size which is the same as that of the input feature, and fusion of shallow features and deep features is realized.

Further, the improved U-Net network structure is divided into: encoder, void convolution pyramid, attention mechanism, decoder. In order to obtain the characteristics of the human ear image, a hollow convolution pyramid is constructed between an encoder and a decoder, more context information is introduced through global pooling operation and feature fusion, and the receptive field is increased.

When the human ear image is input, firstly, the basic feature extraction is carried out on the human ear image through an encoder, and the process is totally subjected to 4 downsampling processes to obtain feature mapping at a higher level; then, taking the output features as an input cavity convolution feature pyramid, acquiring more context information through global pooling and feature fusion, and performing advanced feature mapping on the context information as an original image, wherein the resolution of the feature mapping is not changed in the process; finally, performing up-sampling operation on the feature mapping for 4 times through a decoder to gradually recover to the resolution of the original image; in the process of each up-sampling, an attention mechanism is added, the input of each encoder is connected to the output of a corresponding decoder and passes through a weight matrix, the effective information weight is increased, useless information is suppressed, the loss of spatial information caused by down-sampling is recovered, meanwhile, the parameters of the decoder can be reduced, and the network execution is more efficient; and taking each decoder-stage as an input for a 3 x 3 convolution, followed by a bilinear upsampling and a sigmoid function to obtain the side output of each layer of feature map.

Further, in 103, training a semantic segmentation model by using an improved U-Net network, specifically including the following steps:

step 1: setting a training batch as 100 batches, setting a maximum iteration batch as 32, adjusting a learning rate by adopting a cosine annealing strategy, setting an initial learning rate as 0.1, and setting an offline as 0.001. The weight decay and momentum are set to 1e-4 and 0.9, respectively.

Step 2: and expanding the data set by adopting a conventional means, wherein the conventional means comprises random overturning, shading adjustment and affine change, and then iteratively inputting the data set.

And step 3: extracting feature mapping of an image from each batch of input data in a network structure through an encoder and a cavity convolution pyramid block module, and transmitting the feature mapping and an attention gate into a decoder to obtain the type of a pixel point which is a region;

and 4, step 4: and calculating the error between the prediction result and the label, reversely propagating the error to the input layer by layer when the error is larger than the expected value, and calculating the error of each neuron in the network. When the error is less than or equal to our expected value, the training is finished;

and 5: and updating the weight according to the calculated error, wherein the SGD is adopted to perform gradient updating on each sample, the cosine annealing strategy is adopted to reduce the learning rate, and the problem that the learning rate becomes smaller when the loss is gradually close to the minimum value is solved so that the model cannot be overshot. And then entering the iteration input data set of the second step, and circularly performing.

Step 6: and (5) training the improved U-Net network according to the steps 1 to 5, and repeatedly circulating until the loss is not converged, so as to obtain the optimal training weight.

Further, in step 103, the test data set is detected based on the semantic segmentation model, and the specific steps are as follows:

step 1: reading the optimal weight combination in the training process to initialize the network structure;

step 2: iteratively inputting a data set;

and step 3: extracting feature mapping of an image from each batch of input data in a network structure through an encoder and a spatial convolution pyramid block module, and transmitting the feature mapping and an attention gate into a decoder to obtain the type of a pixel point which is a region;

and 4, step 4: IoU of the prediction result and the label is calculated, and all IoU are accumulated to calculate the average value to be used as the performance evaluation index of the model on the test set;

and 5: and judging whether all the test sets are predicted, if not, returning to the step 2, otherwise, ending the prediction process.

Compared with the prior art, the invention has the advantages that:

the present invention uses 200 left ear images of different individuals as training data. Labeling eight areas of an helix, a periauricular area, a fossa trigonalis, an antihelix, a concha, an antitragus, an tragus and an earlobe of each image by using labelme as a labeling tool, manufacturing a data set, wherein each data set is divided into three parts, namely a training set, a verification machine and a test set, and then performing a series of preprocessing operations including data enhancement on the training set in the data set; improving the U-Net network according to a preset method, and training a semantic segmentation model for the preprocessed human ear images by using the improved U-Net network; detecting a test data set by utilizing an improved semantic segmentation model trained by a U-Net network, and segmenting human ears into eight parts according to anatomical features; and adopting an mIou evaluation mode to carry out effectiveness evaluation and verification on the improved U-Net network model. By adopting the invention, the anatomy segmentation of the human ear image can be realized, and good segmentation precision can be ensured.

Drawings

Fig. 1 is a schematic flow chart of a human ear image debriding segmentation method based on a semantic segmentation network according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a portion of an image of a human ear according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a portion of an image of an ear tag provided in an embodiment of the present invention;

FIG. 4 is an original diagram of an image of a human ear according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a human ear image after being randomly flipped according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating an adjusted brightness of an ear image according to an embodiment of the present invention;

fig. 7 is a schematic diagram of an ear image after affine transformation according to an embodiment of the present invention;

FIG. 8 is a schematic attention diagram provided in accordance with an embodiment of the present invention;

FIG. 9 is a schematic diagram of a spatial convolution pooling pyramid according to an embodiment of the present invention;

FIG. 10 is a schematic diagram of an improved U-Net network structure provided by an embodiment of the present invention;

FIG. 11 is a diagram illustrating a segmentation result of a human ear image according to an embodiment of the present invention;

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail below with reference to the accompanying drawings by way of examples.

As shown in fig. 1, an embodiment of the present invention provides a human ear image gouging segmentation method based on a semantic segmentation network, including:

s101, the data set used is a USTB-Helloear database. The database is collected and established by the Beijing university of science and technology by the ear recognition laboratory and the Xian woodworker electronic science and technology Limited company, has 612661 ear images from 1570 individuals, and comprises angles, illumination changes and natural shelters with different degrees. 200 images of the left ear from different individuals were selected as training data. Eight regions of the helix, periauricular, fossa trigonalis, antihelix, concha, antitragus, tragus and earlobe of each image are labeled using labelme as a labeling tool. Making a human ear segmentation data set, wherein the data set is divided into three parts, namely a training set, a verification set and a test set, and then carrying out a series of preprocessing operations on the training set in the data set, including data enhancement:

s102, improving the U-Net network according to a preset method, and training a semantic segmentation model for the preprocessed human ear images by using the improved U-Net network;

s103, detecting a test data set based on a semantic segmentation model obtained by improved U-Net network training, and dividing the human ear image into 8 regions according to anatomical features.

And S104, performing effectiveness evaluation and verification on the improved U-Net network model by adopting an mIou evaluation mode.

As shown in fig. 2, 200 images of the left ear from different individuals were used as training data. Using labelme as an annotation tool, eight regions of helix, periauricular, fossa trigonalis, antihelix, concha, antitragus, tragus and earlobe of each image are annotated. Making a human ear segmentation data set, wherein the data set is divided into three parts, namely a training set, a verification set and a test set, and the method specifically comprises the following steps:

finally, 200 left ear pictures are picked out, 160 of the left ear pictures are selected as a training set, 20 of the left ear pictures are selected as a verification set, and 20 of the left ear pictures are selected as a test set; the labeling results are shown in figure 3 of the drawings,

in this embodiment, the training set preprocessing operation is mainly data enhancement, and the specific implementation process is as follows:

acquiring a human ear image, and performing data enhancement operation on the human ear image;

the data enhancement mode in this embodiment includes: random overturning, brightness adjustment and affine transformation.

In this embodiment, a plurality of data enhancement modes such as random inversion, shading adjustment, affine transformation, and the like may be used to perform data enhancement processing on the ear image, so as to increase data under different conditions, and make the model have better generalization and robustness.

In this embodiment, in order to better understand the data enhancement method, the following description is made:

and (4) random overturning: including horizontal flipping, vertical flipping, and diagonal flipping. Turning an n-by-n two-dimensional matrix by 180 degrees left and right, rotating the matrix by 180 degrees up and down and rotating the matrix by 180 degrees clockwise respectively;

adjusting the brightness: in the HSV color space of an image, the saturation, brightness, and contrast are randomly changed. Wherein the brightness is the brightness of the image; saturation refers to how much of the image color category; the contrast is the difference between the maximum gray level and the minimum gray level of the image;

affine transformation: a linear transformation from two-dimensional coordinates to two-dimensional coordinates is realized through a series of atomic transformation composites, and the linear transformation method specifically comprises the following steps: translation, zooming, rotation, and flipping.

Specifically, in this embodiment, fig. 4 is an original image of a human ear, which is respectively turned horizontally, vertically and diagonally, and the turning result is shown in fig. 5.

In this embodiment, the original image is subjected to data enhancement in a shading adjustment manner, and HSL spatial brightness saturation adjustment is adopted, so that the brightness adjustment increases or decreases the intensity of the pixels as a whole, and the saturation can change the color type between the maximum and minimum gray levels of the image, so that the image looks more vivid, and the display accuracy in a certain region is widened.

Specifically, the digital image is normalized and converted into a floating point type, and the color space conversion BGR is HLS, where the HLS space and the three channels are: hue, Lightness, saturation; then, linear change processing is performed on the brightness and the saturation respectively, and two sliding bars are created, so that the brightness and the saturation can be manually and respectively adjusted, as shown in fig. 6, a schematic diagram of a human ear after manual adjustment is shown, and as can be seen from the diagram, brightness adjustment can change the brightness and the imaging style of an image, and L, S values suitable for the human ear area can be selected.

In this embodiment, as a result of the affine transformation of the human ear image in fig. 7, the original image is affine transformed, and actually, the image is simultaneously subjected to translation, scaling and rotation operations, which requires an M matrix that can be automatically solved according to the correspondence between three points before and after the transformation, provided by opencv, where the positions before and after given in this example are [ [0,0], [ cols-1,0], [0, rows-1], [ cols 0.2, rows 0.1], [ cols 0.9, rows 0.2], [ cols 0.1, rows 0.9], where cols and rows respectively represent the length and width of the input image and have a size of 1024.

In this embodiment, after a series of preprocessing operations are performed on the training set, the implementation process of S102 is as follows:

and improving the U-Net network according to a preset method, and training an ear segmentation model for the pre-processed ear image by using the improved U-Net network.

Specifically, the steps are that the improved U-Net network is adopted to carry out human ear segmentation. The U-Net network is proposed at the beginning, and is a typical end-to-end encoder-decoder structure, supports a small amount of data training models, and is high in segmentation speed. The encoder part consists of 4 blocks of encoding modules, each module containing two 3 x 3 repeated convolutions, each convolution followed by a Relu function, and each module followed by a max-pooling operation with a step size of 2. The network is used for extracting the characteristics of the image by the module.

As shown in fig. 8, the first improvement of this embodiment is to adopt a method of increasing Attention mechanism, use the Attention Gates in the decoder structure part in the U-Net network, perform the Attention Gates on the part extracted from the encoder, then perform the decoder, readjust the output characteristics of the encoder, and suppress the characteristic response of the irrelevant background region without training a plurality of models and a large number of additional model parameters.

Specifically, a decoder matrix and an encoder matrix are multiplied by a weight matrix respectively and then enter the next link, the weight matrix is subjected to back propagation learning, the importance degrees of the decoder matrix and the encoder matrix are continuously obtained, the importance degrees are learned according to our targets, the input element is a matrix which needs add in the encoding process, and the attention coefficient calculated in the AG is used for scaling. Activation and context information provided by the pooled matrices are required to select spatial regions, and grid resampling of attention coefficients is done using tri-linear interpolation.

The mode of increasing attention is adopted, and the aim is to capture a large enough receiving field and consequently capture semantic context information, enhance effective information, inhibit useless information and improve the segmentation progress.

A second improvement to the U-Net network of this embodiment is the spatial convolution pooling pyramid, as shown in fig. 9. And constructing a spatial convolution pooling pyramid in the U-Net network, performing convolution parallel sampling by holes with different sampling rates, introducing more context information, and expanding the receptive field.

Specifically, in this embodiment, the spatial convolution pooling pyramid has 5 layers in total, a layer of 1 × 1 ordinary convolution and three layers of 3 × 3 cavity convolution and pooling layers with sampling rates of 1, 6 and 12, respectively, the human ear image is subjected to feature extraction through the encoder network, and then is used as input and sent to the cavity convolution pyramid module, and then five layers of concat are subjected to convolution with 1 × 1 to obtain the same size as the input features, so that fusion of shallow features and deep features is realized.

As shown in fig. 10, in the present embodiment, the improved U-Net network structure is mainly divided into three parts: encoder, void convolution pyramid, attention mechanism, decoder. In order to obtain the characteristics of the human ear image, a hollow convolution pyramid is constructed between an encoder and a decoder, more context information is introduced through global pooling operation and feature fusion, and the receptive field is increased.

Specifically, in this embodiment, when an ear image is input, basic feature extraction is performed on the ear image through an encoder structure, and the process is subjected to 4 downsampling in total to obtain a feature map at a higher level; then, taking the output features as an input cavity convolution feature pyramid, acquiring more context information through global pooling and feature fusion, and performing advanced feature mapping on the context information as an original image, wherein the resolution of the feature mapping is not changed in the process; finally, performing up-sampling operation on the feature mapping for 4 times through a decoder to gradually recover to the resolution of the original image; in the process of each up-sampling, an attention mechanism is added, the input of each encoder is connected to the output of a corresponding decoder and passes through a weight matrix, the effective information weight is increased, useless information is suppressed, the loss of spatial information caused by down-sampling is recovered, meanwhile, the parameters of the decoder can be reduced, and the network execution is more efficient; and taking each decoder-stage as an input for a 3 x 3 convolution, followed by a bilinear upsampling and a sigmoid function to obtain the side output of each layer of feature map.

Specifically, the training process of the human ear image anatomy segmentation model is as follows:

and training the improved U-Net network according to the steps, and repeatedly circulating until the loss is not converged to obtain the optimal training weight.

Further, in this embodiment, the test data set is detected based on a semantic segmentation model obtained by training the ear image by the improved U-Net network, and the detection result is shown in fig. 11, where the specific detection process is as follows:

building detection is carried out on the large airplane image by the improved U-Net network and various network structures according to the steps, the detection result is qualitatively compared with the MIOU index, and the comparison result is shown in Table 1

TABLE 1 comparison of MIOU indicators for test results

In table 1, epoch is the number of times of training using all samples in the training set, and is the number of times of training required when different models reach the optimal weight.

In the above Table 1, the MIOU evaluation indexes obtained in this example are improved by 32.44%, 21.35% and 2.65% respectively as compared with the U-Net network, the U-Net + ASPP network, the U-Net + anchorage network and the U-Net + anchorage + ASPP network. This comparison further shows that the network structure for the problem of segmenting the anatomical features of the human ear image is effective, so that the segmentation of the anatomical features of the human ear image is realized, and the segmentation precision is good.

In summary, on the basis of the human ear image information, firstly acquiring a human ear image and marking the human ear image by using a labelme tool to manufacture a data set, wherein the data set is divided into three parts, namely a training set, a verification machine and a test set; further carrying out a series of preprocessing operations on the training set, including data enhancement; improving the U-Net network according to a preset method, and training a semantic segmentation model for the preprocessed human ear images by using the improved U-Net network; and finally, detecting the test data set based on a semantic segmentation model obtained by improved U-Net network training, realizing human ear image anatomy segmentation and having better segmentation precision.

It will be appreciated by those of ordinary skill in the art that the examples described herein are intended to assist the reader in understanding the manner in which the invention is practiced, and it is to be understood that the scope of the invention is not limited to such specifically recited statements and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims

1. A human ear image gouging and splitting method based on a semantic splitting network is characterized by comprising the following steps:

s101: selecting 200 left ear images from different individuals in a USTB-Helloear database as training data; labeling eight areas of an ear wheel, a periauricular area, a triangular fossa, an antihelix, a concha, an antitragus, an tragus and an earlobe of each image by using labelme as a labeling tool; making a human ear segmentation data set, wherein the data set is divided into: training set, validation machine and test set, then carrying out a series of preprocessing operations on the training set in the data set, wherein the preprocessing operations include data enhancement, and the data enhancement includes: random turning, brightness adjustment and affine transformation;

s102: improving the U-Net network according to a preset method, which comprises the following steps: adopting a method for increasing an attention mechanism, and taking a decoder structure newly increased attention mechanism in a symmetric structure of an encoder-decoder in a U-Net network as a feature extraction network; constructing a spatial convolution pooling pyramid in a U-Net network, carrying out convolution parallel sampling by holes with different sampling rates, introducing more context information, and expanding the receptive field;

s104: detecting a test data set based on a semantic segmentation model, and dividing the human ear image into eight regions according to the solving and planning characteristics;

2. The human ear image debridement and segmentation method based on the semantic segmentation network as claimed in claim 1, wherein: 160 of the 200 left ear images were selected as training set, 20 as verification set, and 20 as test set.

3. The human ear image debridement and segmentation method based on the semantic segmentation network as claimed in claim 1, wherein: the random inversion is: comprises horizontal turning, vertical turning and diagonal turning; and turning the two-dimensional matrix of n x n by 180 degrees from left to right, rotating the two-dimensional matrix by 180 degrees up and down and rotating the two-dimensional matrix by 180 degrees clockwise respectively.

4. The human ear image debridement and segmentation method based on the semantic segmentation network as claimed in claim 1, wherein: the brightness is adjusted as follows: randomly changing saturation, brightness and contrast in an HSV color space of an image; wherein the brightness is the brightness of the image; saturation refers to how much of the image color category; the contrast is the difference between the maximum gray level and the minimum gray level of the image;

5. The human ear image debridement and segmentation method based on the semantic segmentation network as claimed in claim 1, wherein: the affine transformation is: a linear transformation from two-dimensional coordinates to two-dimensional coordinates is realized through a series of atomic transformation composites, and the linear transformation method specifically comprises the following steps: translation, zooming, rotation and overturning; and simultaneously, carrying out translation, scaling and rotation operations on the image, needing an M matrix, and automatically solving by using the correspondence provided by opencv and according to three points before and after transformation.

6. The human ear image debridement and segmentation method based on the semantic segmentation network as claimed in claim 1, wherein: the method for increasing the attention mechanism adopted in the step S102 specifically comprises the following steps: the method comprises the following steps that (1) Attention Gates are used for a decoder structure part in a U-Net network, Attention Gates are carried out on a part extracted from an encoder, then a decoder is carried out, the output characteristic of an encoder is readjusted, a plurality of models and a large number of additional model parameters do not need to be trained, and the characteristic response of an irrelevant background area is inhibited;

specifically, a decoder matrix and an encoder matrix are multiplied by a weight matrix respectively and then enter the next link, the weight matrix continuously acquires the importance degrees of the decoder matrix and the encoder matrix through back propagation learning, the importance degrees are learned according to targets, the input element is a matrix which needs add in the encoding process, and the attention coefficient calculated in the AG is used for scaling; activation and context information provided by the pooled matrices are required to select spatial regions, and grid resampling of attention coefficients is done using tri-linear interpolation.

7. The human ear image debridement and segmentation method based on the semantic segmentation network as claimed in claim 6, wherein: in S102, the spatial convolution pooling pyramid is constructed in the U-Net network as follows:

8. The human ear image debridement and segmentation method based on the semantic segmentation network as claimed in claim 7, wherein: the improved U-Net network structure comprises the following components: an encoder, a hole convolution pyramid, an attention mechanism and a decoder; in order to obtain the characteristics of the human ear image, a hollow convolution pyramid is constructed between an encoder and a decoder, more context information is introduced through global pooling operation and feature fusion, and the receptive field is increased;

9. The human ear image debridement and segmentation method based on the semantic segmentation network as claimed in claim 1, wherein: 103, training a semantic segmentation model by using an improved U-Net network, and specifically comprising the following steps:

step 1: setting a training batch as 100 batches, setting a maximum iteration batch as 32, adjusting a learning rate by adopting a cosine annealing strategy, setting an initial learning rate as 0.1, and setting an offline as 0.001; the weight attenuation and momentum are set to 1e-4 and 0.9, respectively;

step 2: expanding the data set by adopting a conventional means, including random overturning, shading adjustment and affine change, and then iteratively inputting the data set;

and 4, step 4: calculating the error between the prediction result and the label, reversely propagating the error to an input layer by layer when the error is larger than the expected value, and calculating the error of each neuron in the network; when the error is less than or equal to our expected value, the training is finished;

and 5: updating the weight according to the calculated error, wherein SGD is adopted to perform gradient updating on each sample, and a cosine annealing strategy is adopted to reduce the learning rate, so that the problem that the model cannot be overshot because the learning rate becomes smaller when the loss is gradually close to the minimum value is solved; then entering the iteration input data set of the second step, and circularly performing;

10. The human ear image debridement and segmentation method based on the semantic segmentation network as claimed in claim 9, wherein: 103, detecting a test data set based on a semantic segmentation model, and specifically comprising the following steps:

step 2: iteratively inputting a data set;