CN116843715B

CN116843715B - Multi-view collaborative image segmentation method and system based on deep learning

Info

Publication number: CN116843715B
Application number: CN202310742724.2A
Authority: CN
Inventors: 石霏; 周健; 夏文涛
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2023-06-21
Filing date: 2023-06-21
Publication date: 2024-08-02
Anticipated expiration: 2043-06-21
Also published as: CN116843715A

Abstract

The invention relates to a multi-view collaborative image segmentation method and a system based on deep learning, wherein the method comprises the following steps: step S1: acquiring a multi-view image; step S2: inputting the multi-view image into a deep learning model, wherein the deep learning model realizes image segmentation of the same target in the multi-view image through the consistency of the internal features of the multi-view image; the deep learning model is provided with a MVFI module, and the MVFI module is used for realizing characteristic interaction among multi-view images. The deep learning model constructed by the invention can fully utilize the feature consistency among multiple views to optimize view features, thereby improving the segmentation precision.

Description

Multi-view collaborative image segmentation method and system based on deep learning

Technical Field

The invention relates to the technical field of image collaborative segmentation, in particular to a multi-view collaborative image segmentation method and system based on deep learning.

Background

Image segmentation is an important research direction in the field of computer vision, and its task is to assign each pixel point in an image a category to which it belongs, so that the computer can better understand and process the image information. Collaborative segmentation of images refers to segmenting common-type objects contained in a set of images. And multi-view collaborative segmentation is to segment images of different perspectives containing the same object at the same time. The multi-view collaborative segmentation is proposed to solve the problem that single-view segmentation in a complex scene is susceptible to noise and thus leads to inaccurate segmentation. Image information can be acquired from different view angles based on multi-view collaborative segmentation, and the segmentation accuracy and the multi-view consistency are improved.

Conventional multi-view collaborative segmentation algorithms typically require extracting conventional features in an image, such as color histograms, gabor filters, SIFT operators, etc., then building a matching model based on the extracted features, and finally classifying each pixel using a classification model. In recent years, a multi-view collaborative segmentation algorithm based on deep learning is gradually developed, and the method generally adopts a Convolutional Neural Network (CNN) to extract characteristics of input multi-views, then performs characteristic interaction among the multi-views, and finally obtains a multi-view collaborative segmentation result through characteristic decoding. And the other multi-view collaborative segmentation algorithms based on deep learning integrate the manual segmentation results of a small number of views into the model as priori information, so that the model optimizes the segmentation consistency among the multiple views in a learning mode.

The prior art has the following defects:

The multi-view collaborative segmentation technology based on the traditional image processing method has certain limitations. It performs well in specific scenes with more obvious features, but does not perform well in complex scenes with more difficult extraction of features. Mainly because traditional image segmentation algorithms are greatly affected by noise and manually designed features are not suitable for complex scenes. The multi-view collaborative segmentation based on the traditional method cannot fully utilize multi-view information, so that the segmentation accuracy is not high.

The multi-view collaborative segmentation technique based on deep learning works well in terms of extracted features, but its accuracy is highly dependent on the structural design and parameter settings of the model algorithm. In the prior art, simple multi-view feature interaction operation often cannot effectively improve collaborative segmentation performance, and designing some complex models can introduce larger calculation amount and parameter amount.

Disclosure of Invention

Therefore, the technical problem to be solved by the invention is to overcome the problem that the inherent feature consistency information among multiple views cannot be fully utilized in the prior art, so that the cooperative segmentation precision is low.

In order to solve the technical problems, the invention provides a multi-view collaborative image segmentation method based on deep learning, which comprises the following steps:

Step S1: acquiring a multi-view image;

Step S2: inputting the multi-view image into a deep learning model, wherein the deep learning model realizes image segmentation of the same target in the multi-view image through the consistency of the internal features of the multi-view image;

The deep learning model is provided with a MVFI module, and the MVFI module is used for realizing characteristic interaction among multi-view images.

In one embodiment of the present invention, the step S2 inputs the multi-view image into a deep learning model, and the deep learning model performs image segmentation on the same object in the multi-view image by consistency of intrinsic features of the multi-view image, where the method includes:

Setting a multi-view image of three channels as I epsilon R ^{B×3×N×H×W}, wherein B is a batch processing number, N is a view number, H is an image height, and W is an image width; outputting X epsilon R ^BN×3×H×W after combining the views, wherein the views are combined to convert the view dimension into a batch dimension; inputting X into an encoder to obtain depth characteristics F epsilon R ^BN×c×h×w; performing view separation on the depth feature F to obtain F _i∈R^B×c×h×w, i=1, 2..n, wherein the view separation is to convert batch dimensions into view dimensions, c is a feature dimension, h is a feature height, and w is a feature width;

Inputting the separated feature F _i into a MVFI module to yield an output F _i′∈R^B×c×h×w, i=1, 2,;

Combining the views of the F _i 'to obtain an optimized characteristic output F' ∈R ^BN×c×h×w; inputting F' into a decoder to obtain a final output O epsilon R ^BN×c×h×w; finally, view separation is performed on O, resulting in a split result O _i∈R^B ^×1×H×W for each view in the batch, i=1, 2.

In one embodiment of the present invention, the MVFI module includes:

Obtaining N characteristics F _i∈R^B×c×h×w through view separation, wherein the N characteristics F _i∈R^B×c×h×w correspond to views V _i, i=1, 2 and N in the multi-view image I epsilon R ^{B×3×N×H×W} respectively; for N features F _i, each feature accepts the consistency guidance of the remaining N-1 features.

In one embodiment of the present invention, the method for N features F _i, each of which accepts consistent guidance of the remaining N-1 features, includes:

Setting a view V _i as a reference view, downsampling the prior segmentation result P _i of the view V _i to be consistent with the size of F _i, and multiplying the downsampled result P _i with F _i to mask off the background, so as to obtain a pseudo mask feature; global pooling is carried out on the pseudo mask features to obtain features pg _i∈R^B×c×1×1 of the foreground target to be segmented with the length of c, wherein pg is a pseudo global feature;

performing feature offset on pg _i by using an LRL module to obtain a real category feature g _i∈R^B×c×1×1; calculating cosine similarity by using g _i and each pixel position of the rest N-1 view features F _j to obtain a similarity matrix M;

Multiplying the similarity matrix M obtained each time with the rest N-1 view features F _j after the CBR module, fusing the multiplication result with the category feature g _i, and reducing the dimension to obtain a final output F _ij,f_ij, wherein the i-th view feature is used as a reference view, and the consistency of the j-th view feature is guided and output; all the characteristics of consistency guiding output form a matrix G; the step of fusing the multiplication result with the category characteristic g _i to reduce the dimension to obtain a final output f _ij includes: splicing the multiplication result and the category characteristic g _i, and carrying out 1X 1 convolution on the splicing result to obtain a final output f _ij;

All features of each column of matrix G were added and fused to give F _i′∈R^B×c×h×w, i=1, 2.

In one embodiment of the present invention, the cosine similarity is calculated using g _i and each pixel position of the remaining N-1 view features F _j, as follows:

M_ij＝cos_similarity(g_i,F_j),i,j＝1,2,...,N,i≠j

Wherein M _ij is the element of the ith row and jth column of the similarity matrix M, and cos_similarity (D) is the solution cosine similarity.

In one embodiment of the present invention, the expression of the matrix G is:

In one embodiment of the invention, the F _i′∈R^B×c×h×w, i=1, 2, the expression of N is:

in one embodiment of the invention, the LRL module includes a linear layer, a ReLU activation layer, and a linear layer connected in sequence; the CBR module comprises a convolution layer, a normalization layer and a ReLU activation layer which are sequentially connected.

In one embodiment of the invention, the encoder is ResNet network for extracting features; the decoder is a 3 x 3 convolution, BN layer, reLu activation function and upsampling operation connected in sequence for feature recovery.

In order to solve the technical problems, the invention provides a multi-view collaborative image segmentation system based on deep learning, which comprises:

The acquisition module is used for: for acquiring a multi-view image;

And a segmentation module: the multi-view image processing method comprises the steps that the multi-view image is input into a deep learning model, and the deep learning model is used for realizing image segmentation of the same target in the multi-view image through consistency of internal features of the multi-view image;

Compared with the prior art, the technical scheme of the invention has the following advantages:

According to the invention, MVCSNet network is constructed, MVFI module is added in MVCSNet network, MVFI module fuses single view segmentation result as prior information, feature consistency among multiple views is fully utilized to optimize view features, robustness of feature reconstruction is improved, and subsequent image segmentation effect is more outstanding;

According to the invention, the calculated amount can be effectively reduced by carrying out view merging and view separation on the feature images;

Experiments prove that MVCSNet of the invention has good image segmentation performance, is better improved compared with the existing segmentation method, is particularly suitable for the segmentation task of the skin scar multi-view image, has good segmentation performance, and lays a foundation for the subsequent scar quantitative analysis.

Drawings

In order that the invention may be more readily understood, a more particular description of the invention will be rendered by reference to specific embodiments thereof that are illustrated in the appended drawings.

FIG. 1 is a schematic diagram of a MVCSNet network architecture in an embodiment of the present invention;

FIG. 2 is a schematic diagram of MVFI in an embodiment of the present invention;

FIG. 3 is a schematic view of a characteristic visualization of a strip scar passing MVCSNet in an embodiment of the present invention;

FIG. 4 is a schematic view of a characteristic visualization of a lamellar scar passing MVCSNet in an embodiment of the present invention;

FIG. 5 is a graph showing the comparison of segmentation results corresponding to the different methods in Table 2 according to the embodiment of the present invention;

FIG. 6 is a graph showing the comparison of the segmentation results corresponding to the different methods in Table 3 according to the embodiment of the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings and specific examples, which are not intended to be limiting, so that those skilled in the art will better understand the invention and practice it.

Example 1

The invention relates to a multi-view collaborative image segmentation method based on deep learning, which comprises the following steps:

Step S1: acquiring a multi-view image;

step S2: inputting the multi-view image into a deep learning model (MVCSNet), wherein the deep learning model realizes image segmentation of the same target in the multi-view image through consistency of internal features of the multi-view image;

According to the invention, MVCSNet network is constructed, MVFI module is added in MVCSNet network, MVFI module fuses single view segmentation result as prior information, feature consistency among multiple views is fully utilized to optimize view features, robustness of feature reconstruction is improved, and subsequent image segmentation effect is more outstanding.

The present embodiment is described in detail below:

the embodiment provides a Multi-View Co-segmentation network (Multi-View Co-Segmentation Network, MVCSNet) for realizing Multi-View Co-segmentation based on a deep learning mode. To take full advantage of the inherent feature consistency between multiple views, a Multi-view feature interaction (Multi-view Feature Interaction, MVFI) module is inserted in the middle of MVCSNet. The MVCSNet network proposed in this embodiment is suitable for collaborative segmentation of multi-view images.

(1) Network structure

As shown in fig. 1, MVCSNet is an end-to-end network as a whole, and mainly comprises an encoder for extracting features, a MVFI module and a decoder for feature recovery, wherein the encoder is a ResNet34 pre-training model, and the decoder comprises a convolutional, BN layer, reLu activation function and upsampling operation for 3×3. Considering that too small a resolution of the depth features lacks most of the detail information, the encoder portion MVCSNet removes the first largest pooling layer, which obviously increases the computational effort of the network, discarding the fourth residual block of the encoder as an equalization. Therefore MVCSNet downsamples the original three times, i.e. 8 times, resulting in a 64 x 64 depth feature from the 512 x 512 image. MVCSNet is input into different views of a plurality of samples, and output into segmentation results of the different views of the plurality of samples, and the combination and separation of the view dimension and the batch dimension are realized through view combination and view separation operations. Wherein in MVFI a segmentation result of a pre-trained network is required as segmentation a priori.

(2) View merging and splitting

The present embodiment merges the view dimension (i.e., how many images there are per sample) into the batch dimension (i.e., the number of samples) through view merging. Let the original three-channel image be I epsilon R ^{B×3×N×H×W}, where B is the batch number, N is the view number, H is the image height, and W is the image width. I is X ε R ^BN×3×H×W after view merging. Inputting X into an encoder to obtain depth characteristics F epsilon R ^BN×c×h×w, and then carrying out characteristic separation according to view dimensions to obtain F _i∈R^B×c×h×w, i=1, 2. The separated features are input into a multi-view feature interaction module, resulting in an output F _i′∈R^B×c×h×w, i=1, 2. And then, combining all the outputs in a view mode to obtain an optimized characteristic output F' ∈R ^BN×c×h×w. And then inputting F' into a decoder to obtain a final output O epsilon R ^BN×c×h×w. Finally, view separation is performed on O, resulting in a split result O _i∈R^B×1×H×W, i=1, 2, N for each view in the batch. The purpose of employing view merging and splitting is mainly to reduce the computational effort. Because the multi-view feature interaction module needs depth features of different views, and the input image dimensions are (B, 3, N, H and W), for a two-dimensional convolutional neural network, input can only be 4-dimensional tensors, if view merging is not performed, each view can be encoded independently, N times of forward propagation are needed, and the depth features of images of different view angles can be obtained only by one time of forward propagation after view merging is adopted.

(3) Multi-view feature interaction module

Based on the mask averaging idea, the invention provides a multi-view feature interaction (MVFI) module, which fuses single-view segmentation results as priori information, fully utilizes feature consistency among multiple views to optimize view features and improves the robustness of feature reconstruction. The depth feature obtained by the encoder is F e R ^BN×c×h×w, and the N features F _i∈R^B ^×c×h×w obtained by view separation correspond to views V _i, i=1, 2, respectively. For these N features, each feature needs to be optimized with the remaining N-1 features, i.e., it needs to accept the consistency guidance of the other N-1 features, which in this embodiment uses cosine similarity. The cosine similarity is calculated as shown in (1):

Wherein F _x,F_y is a feature map of two arbitrary view angles, AndRespectively the value of the i-th point therein.

Thus, MVFI modules together require N× (N-1) similarity calculations, resulting in N× (N-1) optimized feature maps. Fig. 2 is a schematic diagram of the optimization with V ₁ as a reference view. Assuming V _i as a reference view, the a priori segmentation result P _i of V _i is downsampled to the size F _i, And then multiplied by F _i to obtain a pseudo mask feature. global pooling of the pseudo-mask features results in a feature pg _i∈R^B×c×1×1 of length c, where pg is a pseudo-global feature (pg is a pseudo-global feature since the a priori segmentation result is a pseudo tag). The background is masked as a result of the multiplication of F _i and the a priori segmentation result P _i. Thus, pg _i is a feature representing the foreground object to be segmented, i.e. a class feature. Since pseudo tags are used, and miss-split situations may exist, only one pseudo category feature is obtained through mask global pooling. In this embodiment, the LRL module (i.e., the linear layer, the ReLU activation layer, and the linear layer) performs feature offset on pg _i, and maps pg _i to a more uniform distribution, so as to obtain a real class feature g _i∈R^B×c×1×1. Using g _i and other N-1 view features F _j a cosine similarity is calculated for each pixel position of (c), the region with high similarity corresponds to the region with high similarity, the response of the similarity characteristic is large, and the response of the region with low similarity is small. Let the similarity matrix be M, wherein the calculation of the elements in the ith row and the jth column is shown in (2):

m _ij＝cos_similarity(g_i,F_j), i, j=1, 2,..n, i+.j (formula 2)

For each obtained similarity matrix M, multiplying M with the rest N-1 view features F _j after the CBR module (convolution layer, normalization layer and ReLU activation layer), which is equivalent to enhancing the features of the high similarity part and inhibiting the features of the low similarity part. Finally, merging the multiplication result and the category characteristic g _i for dimension reduction to obtain a final output f _ij (namely, splicing the multiplication result and the category characteristic g _i, and then carrying out 1×1 convolution on the spliced result to obtain a final output f _ij),f_ij which represents the ith view characteristic as a reference view and outputs consistent guidance of the jth view characteristic, wherein a matrix formed by all consistent guidance characteristics is shown as (3):

all the features of each column of the matrix G are added and fused, namely all the results of optimizing each view by other views are fused, so that a matrix F' is obtained, namely the optimized features are calculated as shown in (formula 4):

(4) Loss function

The present embodiment uses a joint loss function L _seg of binary Focal loss and Dice loss, as shown in (equation 5):

l _seg＝L_focal+L_dice (5)

The formula of the Focal loss function is shown in (6):

wherein, gamma is an adjusting factor, and gamma is more than 0, and the value of gamma is 0.5 in the invention.

The method for calculating the similarity between the network prediction result and the manual annotation by the Dice loss function is shown as a formula (7):

in the formulas (6) and (7), For the predicted value of the ith pixel value, y _i is the true value of the ith pixel, ε is a smoothing factor, the numerator or denominator is prevented from being zero, and the value is 10 ^-6.

(5) View disorder-based data augmentation

In training, the present embodiment generates training data in a manner of randomly disturbing the view order. I.e. the sequence of the views is random each time the data of each sample is input into the network for training. This can be considered as a data amplification scheme, which is beneficial to improving the generalization performance of the network.

The experimental analysis is specifically as follows:

in order to verify the effectiveness of the method of the invention, the acquisition of a multi-view image of a real skin scar is performed.

1) Together is summarized

Quantitative analysis of skin scars is helpful for injury identification, treatment evaluation, clinical research and the like. Skin scar image segmentation remains a current challenge due to the complexity of the scene in which the scar image is acquired. Scar quantification based on two-dimensional vision has a great limitation. Three-dimensional information can be introduced by using a multi-view scheme, so that accuracy of scar quantitative analysis is improved. Whereas multi-view based scar three-dimensional segmentation often depends on the accuracy and view consistency of the two-dimensional segmentation. Therefore, the multi-view collaborative segmentation of the skin scar image is very important, helping to accurately locate the scar in three-dimensional space.

2) Data set

The data used in this example are all clinical data from Shanghai market supervisor method identification sciences. 744 strip-shaped skin scar images are acquired by adopting a smart phone photographing mode, 130 scar samples are contained, 159 sheet-shaped skin scar images are contained, 18 scar samples are contained, and each sample has 3-10 different visual angles. All images come from the same acquisition device: hua is a smart phone HUAWEI Nova 7 with a resolution of 3456×4608. The gold standard of all images is manually noted by a person with relevant expertise. The resolution is uniformly adjusted to 512×512 for the input image of the network. And carrying out data amplification by adopting modes of online turning, rotation and the like. The sensitivity (Sen) was used as a segmentation evaluation index using the Dice Similarity Coefficient (DSC), cross-over ratio (IoU). The definition is as follows:

TP, FP, FN respectively represent true positive, false positive and false negative, scar area is used as positive sample, and background area is used as negative sample. TP represents that the true value is positive and the predicted value is positive. FP represents a true value as negative, while the predicted value is positive. FN represents true values as positive, but predicted values as negative.

3) Results

The invention adopts a U-shaped network with MVFI modules removed from a base network MVCSNet, namely ResNet, which removes the first maximum pooling layer and the residual blocks of the last stage as encoders. Full supervised training on the dataset using Baseline results in a pre-trained model for providing segmentation priors.

In order to examine the improvement of the view disorder amplification scheme on the network performance, two training strategies of view sequence and view disorder are compared to train on the strip scar data set. View order, i.e., each sample maintains the same view order for each training. Table 1 shows the results of comparative experiments of view disorder and view order, and it can be seen that the network performance is improved by 0.68% by IoU and 0.42% by DSC and 0.46% by Sen after data amplification using view disorder.

Table 1 comparative experiments of view sequences

Network structure	View disorder	IoU	DSC	Sen
					Baseline+MVCSNet	Whether or not	81.95±1.79	89.84±1.11	90.62±1.05
Baseline+MVCSNet	Is that	82.63±1.90	90.26±1.18	91.08±1.11

In order to examine the effectiveness of MVFI modules and MVCSNet, an ablation experiment of MVFI modules and a comparative experiment of different multi-view collaborative segmentation methods were designed. Table 2 lists the relevant experiments MVCSNet on the striped scar dataset, including the ablation experiments of MVFI modules and single-view, multi-view and cross-sample comparison experiments. Wherein a single view represents a random disruption of the input data for each batch of the network, possibly containing multiple samples, i.e., a conventional training pattern. Multiple views means that each time the network is entered, multiple views of the same sample. In Baseline containing prior segmentation, the prior segmentation result is directly fused to input data in an addition mode. the input across sample representations MVFI module contains multiple samples, i.e., feature consistency is calculated between different samples. Comparing the first and second or third and fourth rows in table 2, it can be seen that the a priori segmentation information has a significant impact on network performance, with less a priori information for single view segmentation, 1.84% lower DSC and 3.69% lower DSC for multi-view segmentation. In comparison with the first and third rows in table 2, without prior segmentation, multiple views are obviously less than single-view input due to the number of samples input into the network at a time, and although the data enhancement method of view disorder is used, the diversity of data is far less than that of single-view, so that the network performance of the multi-view segmentation training is not good as that of single-view segmentation. Comparing the second line with the fourth line in table 2, after the prior segmentation is added, the segmentation performance of the single view is slightly higher than that of the multiple views, which means that the prior segmentation result is simply fused to perform the multiple view segmentation, the consistency information between the views is not fully utilized, and the performance of the network cannot be improved well. Comparing the second and fifth rows in table 2, MVCSNet does not add a priori segmentation to the Baseline network across sample input performance, indicating that feature consistency optimization across samples does not improve network performance well. Comparing the fourth and sixth rows in table 2, MVCSNet employs multi-view input, the index is higher than Baseline employing multi-view input and fusing a priori segmentation results, which illustrates the effectiveness of MVFI module, and the performance of the network is improved by feature consistency optimization. Comparing the fifth and sixth rows in table 2, MVCSNet multi-view input is better than cross-sample input results, indicating that there is indeed a difference between depth features of different samples, and feature consistency between the same samples is better, so that the effect of feature interaction by adopting multiple views is better. Table 3 lists MVCSNet related experiments on the patch scar dataset, the experimental setup was the same as for the striped scar. As can be seen from Table 3, for the flaky scar data, MVCSNet performed substantially identically to that on the striped scar. As can be seen from the first and second or third and fourth rows in table 3, the a priori partitioned information has a significant impact on network performance. Particularly for multi-view inputs, the base DSC index for multi-view inputs is only 71.51% lower than for single view by 14.74%. Comparing the second and fourth rows in table 3, after the prior segmentation is added, the segmentation performance of the single view is slightly lower than that of the multiple views, which is different from that of the stripe scar data, but it also shows that simply fusing the prior segmentation information and not fully utilizing the consistency information between the views can not well improve the performance of the network. Comparing tables 2 and 3, the MVFI module improved the flaky scar for a small amount of data more significantly than the striped scar data.

Table 2 MVFI Module ablates experiments and contrast experiments on a strip scar dataset

Method of	IoU	DSC	Sen
				Baserine (Single view)	79.32±2.26	88.01±1.70	89.08±1.00
Baserine (Single view + prior segmentation)	81.96±2.21	89.85±1.43	90.84±1.12
				Baserine (Multi-view)	76.33±1.82	85.51±1.43	85.61±2.94
Baserine (Multi-view + prior segmentation)	81.02±2.15	89.20±1.40	90.02±1.18
				MVCSNet (Cross sample + prior segmentation)	80.80±2.10	89.11±1.40	89.47±2.00
MVCSNet (multiple view + prior segmentation)	82.63±1.90	90.26±1.18	91.08±1.11

Table 3 MVFI Module ablates experiments and contrast experiments on a sheet scar dataset

Method of	IoU	DSC	Sen
				Baserine (Single view)	76.96±10.65	86.25±7.09	91.45±2.14
Baserine (Single view + prior segmentation)	79.25±7.84	87.85±5.23	91.69±2.86
				Baserine (Multi-view)	59.76±11.17	71.54±9.22	74.35±6.37
Baserine (Multi-view + prior segmentation)	79.60±7.95	88.07±5.26	91.55±2.56
				MVCSNet (Cross sample + prior segmentation)	78.76±7.68	87.54±5.14	91.35±2.51
MVCSNet (multiple view + prior segmentation)	85.49±4.16	91.69±2.73	93.89±1.00

Fig. 3 is a visualization of striped scar features, the second row being view features that are not optimized using MVFI modules, i.e. features in Baseline with a resolution of 64 x 64. Features not optimized except features near the two scale labels are obvious, features in the middle of the scar are not obvious, and the situation is likely to cause missing of the segmentation result, namely fracture in the middle of the segmentation result. The third row is view features optimized by MVFI modules, namely features with a resolution of 64 x 64 in MVCSNet. It can be seen that the less pronounced regions of the second row of features, i.e. features with lower pixel confidence, become quite pronounced and the background and foreground differentiation increases. The effectiveness of the feature similarity matrix for feature enhancement is illustrated, the part with high similarity is enhanced, and the part with low similarity is restrained. Fig. 4 is a visual view of a lamellar scar feature, and it can be seen that MVFI has a significant effect on background suppression and also has a significant effect on enhancement of a foreground object. For the flaky scar, the edge information of the features is relatively fuzzy, so that the confidence of the network segmentation result is not high. For the features of the same hierarchy, the scar region cannot be accurately represented by features optimized by MVFI modules, and the scar region can be more accurately represented by features optimized by MVCSNet.

Fig. 5 shows the segmentation results of the different methods in table 2 on a striped scar image ((a) artwork, (b) gold standard, (c) a priori segmentation result, (d) a segmentation result of Baseline (single view+a priori segmentation), (e) a segmentation result of Baseline (multi-view), (g) a segmentation result of Baseline (multi-view+a priori segmentation), (h) a segmentation result of MVCSNet (cross-sample+a priori segmentation), (i) a segmentation result of MVCSNet (multi-view+a priori segmentation). As can be seen from fig. 5, the segmentation a priori has missed view 3 in the segmentation of the sample. Whereas the method in table 2, except for the multi-view-based MVCSNet presented herein, other methods have more severe leakages in view 3. Wherein, the effect is worst when Baseline single-view training, and the omission occurs in all the views 1,3 and 5. Comparing the view 3 of the (c), (d) and (e) methods, baseline single view training, even if segmentation priors are fused, does not improve much on the leakage score. Comparing the view 3 of the (c), (f) and (g), the Baseline multi-view training is that before and after the prior information is introduced, the area and the area of the missed area are different, and the (g) method is larger than the (f) method. Therefore, the prior information is simply fused, which can lead to wrong information of model learning and even reduce the performance of the model. Comparing (c) and (h) methods MVCSNet improves the confidence of a portion of pixels at the drain during cross-sample training, thus improving the segmentation prior, but does not really solve the problem. Compared with the methods (c), (h) and (i), MVCSNet fully utilizes the feature consistency among the same sample in multi-view training, and plays a remarkable role in enhancing the foreground pixel points of pixels which are difficult to classify, namely the areas with lower confidence. Therefore, MVCSNet based on multi-view training can better process the multi-view collaborative segmentation task of the strip scar. Fig. 6 shows the segmentation results of the different methods in table 3 on slice scar images ((a) artwork, (b) gold standard, (c) a priori segmentation result, (d) a segmentation result of Baseline (single view), (e) a segmentation result of Baseline (single view+a priori segmentation), (f) a segmentation result of Baseline (multi-view), (g) a segmentation result of Baseline (multi-view+a priori segmentation), (h) a segmentation result of MVCSNet (cross-sample+a priori segmentation), (i) a segmentation result of MVCSNet (multi-view+a priori segmentation). As can be seen from fig. 6, the result of segmentation a priori is more misclassified for this sample. in the method of the attached table 3, the effect is the worst when the Baseline is trained in multiple views, and serious misclassifications appear in all views, because the data size of the flaky scar sample is small, and after multiple views are adopted, the total sample size is smaller, so that the performance of the network is poor. Comparing (c), (d) and (e), the method has a worse segmentation effect after the segmentation prior is fused in view 4 and Baseline single view training. Compared with the view 3 of the methods (c), (f) and (g), the Baseline multi-view training has the advantages that after the prior information is fused, the segmentation result is obviously improved, but some miss-segmentation points still appear in the scar of the views 2 and 3. The comparison (c) and (h) methods MVCSNet do reduce some misconnections while training across samples, but introduce more misconnections, which is evident in views 1 and 3. Compared with the methods (c), (h) and (i), MVCSNet, the method reduces the error rate and the miss rate during multi-view training, and comprehensively improves the performance of the model. Therefore, MVCSNet based on multi-view training can better handle the slice scar multi-view segmentation task.

To this end, a novel multi-view collaborative segmentation network (MVCSNet) suitable for multi-view images of skin scars has been implemented and validated. The multi-view collaborative image segmentation method based on the deep learning solves the problem of inconsistent features among views in multi-view segmentation, and improves the robustness of feature reconstruction by enabling each view to accept the consistency optimization of the features of all other views. The invention is verified by comprehensive experiments on the collected 744 Zhang Tiaozhuang skin scar image data and 159 sheet skin scar image data, and the experimental results show that the method has better performance on the strip scar image data.

Example two

The embodiment provides a multi-view collaborative image segmentation system based on deep learning, which comprises the following steps:

The acquisition module is used for: for acquiring a multi-view image;

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations and modifications of the present invention will be apparent to those of ordinary skill in the art in light of the foregoing description. It is not necessary here nor is it exhaustive of all embodiments. And obvious variations or modifications thereof are contemplated as falling within the scope of the present invention.

Claims

1. A multi-view collaborative image segmentation method based on deep learning is characterized by comprising the following steps of: comprising the following steps:

Step S1: acquiring a multi-view image;

The deep learning model is provided with a MVFI module, and the MVFI module is used for realizing characteristic interaction among multi-view images;

In the step S2, the multi-view image is input to a deep learning model, and the deep learning model performs image segmentation on the same target in the multi-view image through consistency of intrinsic features of the multi-view image, where the method includes:

Combining the views of the F _i 'to obtain an optimized characteristic output F' ∈R ^BN×c×h×w; inputting F' into a decoder to obtain a final output O epsilon R ^BN×c×h×w; finally, carrying out view separation on the O to obtain a segmentation result O _i∈R^B ^×1×H×W of each view in the batch, wherein i=1, 2,.;

The MVFI module includes:

Obtaining N characteristics F _i∈R^B×c×h×w through view separation, wherein the N characteristics F _i∈R^B×c×h×w correspond to views V _i, i=1, 2 and N in the multi-view image I epsilon R ^{B×3×N×H×W} respectively; for N features F _i, each feature receives consistent guidance for the remaining N-1 features;

The method for N features F _i, each of which is to accept consistent guidance of the remaining N-1 features, includes:

All features of each column of matrix G are added and fused to obtain F _i′∈R^B×c×h×w, i=1, 2.

The cosine similarity is calculated by using g _i and each pixel position of the other N-1 view features F _j, and the formula is as follows:

M_ij＝cos_similarity(g_i,F_j),i,j＝1,2,...,N,i≠j

wherein M _ij is the element of the ith row and jth column of the similarity matrix M, and cos_similarity (,) is the solution cosine similarity;

the expression of the matrix G is:

The expression F _i′∈R^B×c×h×w, i=1, 2,..n is:

The LRL module comprises a linear layer, a ReLU activation layer and a linear layer which are sequentially connected;

The CBR module comprises a convolution layer, a normalization layer and a ReLU activation layer which are sequentially connected.

2. The depth learning based multi-view collaborative image segmentation method according to claim 1, wherein:

the encoder is ResNet network for extracting characteristics;

The decoder is a 3 x 3 convolution, BN layer, reLu activation function and upsampling operation connected in sequence for feature recovery.

3. A multi-view collaborative image segmentation method system based on deep learning is characterized in that: comprising the following steps:

The acquisition module is used for: for acquiring a multi-view image;

the segmentation module inputs the multi-view image to a deep learning model, and the deep learning model realizes image segmentation of the same target in the multi-view image through consistency of internal features of the multi-view image, and the method comprises the following steps:

The MVFI module includes:

M_ij＝cos_similarity(g_i,F_j),i,j＝1,2,...,N,i≠j

the expression of the matrix G is:

The expression F _i′∈R^B×c×h×w, i=1, 2,..n is: