Nothing Special   »   [go: up one dir, main page]

CN118172561B - Complex image segmentation model for unmanned aerial vehicle scene and segmentation method - Google Patents

Complex image segmentation model for unmanned aerial vehicle scene and segmentation method Download PDF

Info

Publication number
CN118172561B
CN118172561B CN202410605358.0A CN202410605358A CN118172561B CN 118172561 B CN118172561 B CN 118172561B CN 202410605358 A CN202410605358 A CN 202410605358A CN 118172561 B CN118172561 B CN 118172561B
Authority
CN
China
Prior art keywords
edge
mask
image
features
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410605358.0A
Other languages
Chinese (zh)
Other versions
CN118172561A (en
Inventor
魏玲
胥志伟
李庆华
杨晓刚
赵天旭
刘振
王胜科
李嘉宁
孙杰洪
张为蛟
蔡敏鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Weiran Intelligent Technology Co ltd
Original Assignee
Shandong Weiran Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Weiran Intelligent Technology Co ltd filed Critical Shandong Weiran Intelligent Technology Co ltd
Priority to CN202410605358.0A priority Critical patent/CN118172561B/en
Publication of CN118172561A publication Critical patent/CN118172561A/en
Application granted granted Critical
Publication of CN118172561B publication Critical patent/CN118172561B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/17Terrestrial scenes taken from planes or by drones
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Remote Sensing (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a complex image segmentation model and a segmentation method for an unmanned aerial vehicle scene, and belongs to the technical field of unmanned aerial vehicle image recognition segmentation based on computer vision. The invention provides a novel EG-SAM model based on the SAM model, which utilizes the SAM and the edge supervision to improve the perception of the image edge of the unmanned aerial vehicle by the model, explores the integration of valuable additional edge semantic information related to the object, and guides the learning process of the model. The edge branch comprises two important modules, the gradient edge attention module cooperatively generates an accurate edge mask by utilizing the front-stage semantic features and the rear-stage semantic features, the edge driving module integrates and fuses edge information through a dual attention mechanism, the model is forced to generate features for emphasizing the object structure aspect, and excellent segmentation performance can be maintained even under the condition that the target of the ecological environment of the coastal zone of the unmanned plane is very complex.

Description

Complex image segmentation model for unmanned aerial vehicle scene and segmentation method
Technical Field
The invention belongs to the technical field of unmanned aerial vehicle image recognition and segmentation based on computer vision, and particularly relates to a complex image segmentation model for an unmanned airport scene and a segmentation method.
Background
Unmanned aerial vehicle is widely applied to marine environment monitoring tasks due to the characteristics of low cost, flexible shooting, large visual field, high resolution and the like. For example, the similarity between spilled oil and different kinds of objects such as sea surface, spartina alterniflora and algae is high, the general segmentation algorithm is poor in effect, and a new challenge is brought to semantic segmentation under the unmanned airport scene. Moreover, under the view angle of unmanned aerial vehicle, coastal zone ecological plants are crisscross vertically and horizontally, and are unevenly distributed, and for certain type of plants or sea oil spilling, they are hollow in the distribution, and the edge is complicated structure in most cases. When the traditional general segmentation models are adopted, the traditional general segmentation models can often produce adhesion, splitting or misjudgment on the segmentation of the target.
For proprietary segmentation models such as GSCNN (gated convolutional neural network GATED SHAPE CNNS for Semantic Segmentation for semantic segmentation), although a new dual stream CNN architecture for semantic segmentation is proposed that explicitly links shape information into a single processing branch, i.e., a shape stream, that processes the information in parallel with classical streams, in edge streams their features are often poorly extracted and the resulting edge details are poor. The reason for this is generally thought to be that the gate filters out many details of the texture features while removing noise. Moreover, such proprietary segmentation models do not have generalization and perform poorly when migrated to other tasks.
For a general segmentation model, such as HQ-SAM (high Quality segmentation all model SEGMENT ANYTHING IN HIGH Quality), it does not perform satisfactorily in hollow-out and multi-elongate structured objects, and many false positives occur. One main reason is that HQ-SAM does not fully exploit the early local fine-grained features, which often contain many redundant underlying information such as edges and textures, and simply extracting and fusing the early output features directly with the late features without processing the early output features often loses many important fine-grained features due to mismatch of semantic information contained in the early output features and the late features.
Disclosure of Invention
In view of the above, the present invention provides a new EG-SAM model (a segmentation cut model An Edge-Guided SAM for Accurate Complex Object Segmentation for complex targets under Edge supervision) that decouples Edge information and overall features of unmanned aerial vehicle images, balances the distribution between thin structure pixels and non-thin structure pixels using Edge supervision, predicts accurate segmentation masks even in very challenging cases, freezes the encoder structure, fine-tunes the decoder, and reuses their output in order to avoid reducing SAM's powerful zero sample capability.
The first aspect of the invention provides a complex image segmentation model for an unmanned aerial vehicle scene, which is constructed based on a SAM model and comprises an image encoder, a prompt encoder, a mask decoder and a multi-layer perceptron, wherein a gradient edge attention module, an edge driving module and a feature fusion module are added to construct a segmentation all model for edge supervision, namely an EG-SAM model;
The gradient edge attention module is positioned behind the image encoder, and the image encoder sends the middle and final results generated by the gradient edge attention module into the module, namely the gradient edge attention module utilizes the output of the sixth layer and the twenty-fourth layer of the image encoder to extract the edge characteristics of the unmanned aerial vehicle image and is used for generating a fine edge mask;
The feature fusion module is used for receiving mask features from the mask decoder, edge features of the gradient edge attention module and original output features of sixth and twenty-fourth encoder blocks of the image encoder;
The edge driving module is arranged behind the gradient edge attention module and the mask decoder and is used for integrating the edge mask information, the mask characteristics generated by the mask decoder and the final output characteristics of the image encoder, improving the expression capacity of the model and slowing down the risk of information loss; and finally outputting the final unmanned aerial vehicle image mask after the finally output characteristics pass through the multi-layer perceptron.
Preferably, in the image encoder, other structures of the image encoder are frozen, and only the earlier and later semantic output results are reused, and the specific flow is as follows: for a picture of the coastal zone of the unmanned aerial vehicle with the input size of (c, h, w), the picture encoder processes the picture only once, scales the picture, and unifies the input size of (c, 1024,1024), wherein c represents the number of channels; then the input image is divided into a series of image blocks, and the image blocks are subjected to linear mapping to obtain respective sequence vectors; the image encoder uses a MAE pre-trained ViT-L model that contains 24 Transformer Block blocks, each processing the image output sequence from the previous block, and except for normally outputting the final results to the image encoder, the results of the 6 th and 24 th layers are extracted and sent to a gradient edge attention module for generating fine edge information.
Preferably, the gradient edge attention module has the specific structure as follows:
first, layer 6 features are characterized by a1×1 convolution layer And layer 24 featuresTo the same number of channels and then to perform a join operation after calculating their gradients separately, namely:
Wherein Cat stands for Concat operation, Representing a convolution operation using 1*1 convolution blocks,Representing gradient calculation;
then connecting and fusing the two, and carrying out point-by-point multiplication after the channel quantity is adjusted, namely:
Wherein, Representative of convolution operations using 1*1 convolution blocks, cat represents Concat operations, (-)) Representing element-by-element multiplication;
Finally, the result features are transmitted into two continuous 3X 3 convolution blocks, so that the network is helped to learn the image features with finer granularity, and the local texture and shape information is captured; finally, by a1×1 convolution layer and a Sigmoid activation layer, edge features of the segmented object are obtained, which can be expressed as:
Wherein, Representing the sigmoid activation function,Representing a convolution operation using 1*1 convolution blocks,Representative of a convolution operation using 3*3 convolution blocks, cat represents Concat operation.
Preferably, the mask encoder receives the position coding information of the image embedding and prompting encoder from the image encoder, and obtains the final output target mask information after passing through the multi-layer perceptron;
To improve the output quality of the mask decoder for the final mask, EG-SAM tokens is introduced to perform mask prediction, the weights of the dynamic MLP are predicted, then the dynamic mask prediction is performed with the mask features, and a new mask prediction layer is introduced to perform high-quality mask prediction.
Preferably, the edge driving module utilizes a double staggered attention mechanism to comprehensively capture complex relations between local edges or texture features and overall global semantic features, integrates edge mask information extracted by the gradient edge attention module, mask features generated by a mask decoder and final output features of an image encoder, improves the expression capability of a model, and slows down the risk of information loss, and the specific structure and the processing flow are as follows:
edge feature Downsampling to and fusing featuresThe same size, in the upper branch, inAndAfter the multiplication of elements is carried out, the method is reusedThe same operation is performed by the add-by-add, and the lower branch is denoted as:
Wherein D represents downsampling operation, (-) is performed ) Represents a per-element multiplication, (+) represents a per-element addition;
Subsequently, the two outputs are fused and subjected to two separate 3×3 convolution operations; then, carrying out self-adaptive average pooling on the obtained characteristics so as to adjust the output size; next, a1×1 convolution block and a sigmoid layer are applied; the output is further multiplied by the previous convolution output on the element to produce the final output characteristic # ) ; Expressed as:
Wherein P represents a global pooling operation; σ denotes entering the sigmoid activation layer, AndRespectively representing 3x3 and 1x1 convolution operations.
Preferably, the loss function is designed for the constructed EG-SAM model, with two types of annotation supervision, namely object masksAnd object edge; For mask supervision, randomly sampling the feature map using an uncertainty function; then, the overlapping between the sampling areas and the real areas is measured by using a DICE function, so that the problem of class unbalance is effectively solved; meanwhile, the binary cross entropy BCE loss function is utilized to calculate the classification loss, and the formula is as follows:
where S is a random sampling function, Representing the predicted outcome of the masking of the pattern,Representing a true mask result;
for edge supervision, directly adopt The loss function solves the problem of unbalance between positive and negative classes, namely:
Thus, the first and second substrates are bonded together,
I.e. the total loss function.
The second aspect of the invention provides a complex image segmentation method for an unmanned aerial vehicle scene, which comprises the following steps:
shooting by an unmanned aerial vehicle to obtain an image;
Inputting an image into the complex image segmentation model for the unmanned aerial vehicle scene according to the first aspect for image segmentation processing;
outputting the mask image after the segmentation processing.
A third aspect of the present invention provides a complex image segmentation apparatus for an unmanned aerial vehicle scene, the apparatus comprising at least one processor and at least one memory, the processor and memory being coupled; a computer-implemented program of the complex image segmentation model for an unmanned airport scene according to the first aspect is stored in the memory; when the processor executes the computer execution program stored in the memory, the processor is caused to execute a complex image segmentation method for the unmanned aerial vehicle scene.
A fourth aspect of the present invention provides a computer-readable storage medium having stored therein a computer program or instructions of the complex image segmentation model for an unmanned aerial vehicle scene according to the first aspect, which when executed by a processor, causes the processor to perform a complex image segmentation method for an unmanned aerial vehicle scene.
Compared with the prior art, the invention has the following beneficial effects:
The invention provides a novel EG-SAM model based on the SAM model, which utilizes the SAM and the edge supervision to improve the perception of the image edge of the unmanned aerial vehicle by the model, explores the integration of valuable additional edge semantic information related to the object, and guides the learning process of the model. The edge branch comprises two important modules, the gradient edge attention module cooperatively generates an accurate edge mask by utilizing the front-stage semantic features and the rear-stage semantic features, and the edge driving module integrates and fuses edge information through a dual attention mechanism to force the model to generate features for emphasizing the object structure, so that EG-SAM can be guided to keep high-quality segmentation prediction even under the very complex condition.
The EG-SAM model can maintain excellent segmentation performance and has accurate segmentation accuracy even if the target such as the ecological environment of the coastal zone of the unmanned plane is very complex, such as a multi-slender structure, a complicated hollow structure and a camouflage structure with boundaries difficult to distinguish; the invention has the key points that the fine-tuning general segmentation model SAM not only maintains the original zero sample generalization capability under the condition of only adding a small amount of parameters, but also fully utilizes the front and rear semantic features of the object in the edge branches to generate an accurate edge mask, thereby carrying out segmentation guidance on the integral complex structure.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will be given simply with reference to the accompanying drawings, which are used in the description of the embodiments or the prior art, it being evident that the following description is only one embodiment of the invention, and that other drawings can be obtained from these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a block diagram showing the overall structure of the EG-SAM model of the present invention.
Fig. 2 is a schematic diagram of an input picture (model input) and an output mask (model output) in embodiment 1.
FIG. 3 is a schematic diagram of a gradient edge attention module structure according to the present invention.
Fig. 4 is a schematic diagram of an edge driving module structure according to the present invention.
Fig. 5 is a simple structure diagram of the complex image segmentation apparatus in embodiment 2.
Detailed Description
Example 1:
the invention will be further described with reference to specific examples.
The invention provides a novel EG-SAM model based on the SAM model, which utilizes the SAM and the edge supervision to improve the perception of the image edge of the unmanned aerial vehicle by the model, explores the integration of valuable additional edge semantic information related to the object, and guides the learning process of the model. The edge branch comprises two important modules, the gradient edge attention module cooperatively generates an accurate edge mask by utilizing the front-stage semantic features and the rear-stage semantic features, and the edge driving module integrates and fuses edge information through a dual attention mechanism to force the model to generate features for emphasizing the object structure, so that EG-SAM can be guided to keep high-quality segmentation prediction even under the very complex condition.
The overall framework of the EG-SAM model is shown in FIG. 1, and the present invention freezes the Vit image encoder in the SAM and reuses its output. Specifically, the first and last Transformer Block output features in the Vit that use global attention are chosen, i.e., the sixth and twenty-fourth layer outputs of the Vit are reused. It is then fed into a gradient edge attention module and a final feature fusion module that generate edge mask information, and after the edge driver module receives the fused features from the edge output features of the gradient edge attention module and the feature fusion module, the information between them is mined and integrated. And finally outputting the final mask after the finally output characteristics pass through the multi-layer perceptron. An example diagram of the unmanned aerial vehicle image input and output is shown in fig. 2.
Wherein the feature fusion module is configured to receive the mask features from the mask decoder, the edge features of the gradient edge attention module, and the original output features of the sixth and twenty-fourth encoder blocks of the image encoder. The number of channels of the data are unified through a convolution block, then the fusion features of the data are obtained through element-by-element addition, and then the fusion features are input into an edge driving module for feature integration. The purpose of this is to prevent its segmentation details from being covered by other strong semantic regions during feature fusion.
Image encoder
In this structure, the structure of the image encoder is frozen, and only its early and late semantic output results are reused. The specific flow is as follows: for a picture of the unmanned aerial vehicle coastal zone with input size (c, h, w), the image encoder processes the picture only once, scales the image, and unifies the input size (c, 1024,1024), wherein c represents the channel number. The input image is then divided into a series of image blocks which are linearly mapped (using a convolution layer) to obtain respective sequence vectors. Since the image encoder uses the MAE (Masked AutoEncoder) pre-trained ViT-L model, it contains 24 blocks Transformer Block, each of which processes the image output sequence from the previous block. In addition to outputting the final results normally to the image encoder, the results of layers 6 and 24 are also extracted and fed to the gradient edge attention module for generating fine edge information.
(II) gradient edge attention Module
For those elongated, hollow structural or camouflaged objects in the unmanned aerial vehicle image, they typically have unique shape features and significant edge variations. Thus, introducing edge supervision helps the model to more easily distinguish the boundary between the object and the background. However, features extracted from a single hierarchy may have certain limitations. By effectively integrating visual semantic information of different layers, the accuracy and the fineness of the edges of the segmented object can be obviously improved. Thus, the gradient edge attention module uses the outputs of layers 6 and 24 of Vit to extract edge features, as shown in fig. 3.
Previous features can effectively capture local edge semantic information of an input image and are very sensitive to detected local features such as contours of objects. The subsequent features have rich global semantic information, so that the model can better understand the relationship between the whole scene and the object, thereby improving the perception of the whole shape and structure of the object. First, layer 6 features are [. Times.1 through the 1X 1 convolution layer) And layer 24 features) The same number of channels are converted and then the connection operation is performed after calculating their gradients separately. These gradients provide information about the rate of change of each location in the image, helping to capture edge details more accurately. Namely:
Wherein Cat stands for Concat operation, Representing a convolution operation using 1*1 convolution blocks,Representing gradient calculations.
And then connecting and fusing the two, and carrying out point-by-point multiplication after the channel number is adjusted. Namely:
Wherein, Representative of convolution operations using 1*1 convolution blocks, cat represents Concat operations, (-)) Representing a per-element multiplication.
Finally, the result features are transmitted into two continuous 3×3 convolution blocks, which is helpful for network learning of image features with finer granularity and capture of local texture and shape information. Finally, by means of the 1×1 convolution layer and the Sigmoid activation layer, edge features of the segmented object can be obtained, which can be expressed as:
Wherein, Representing the sigmoid activation function,Representing a convolution operation using 1*1 convolution blocks,Representative of a convolution operation using 3*3 convolution blocks, cat represents Concat operation.
(III) hint encoder
The EG-SAM requires additional provision of hint information in addition to the necessary unmanned aerial vehicle image. The prompt information is divided into two types, namely sparse prompt, including point prompt, box prompt and text prompt; one type is dense hints, i.e., image masks. For sparse prompt, only the sparse prompt is needed to be encoded to obtain vector encoding; for dense cues, the image mask and the image embedding output by the image encoder are scaled down by a factor of 4 by element-wise multiplication (element-wise), implemented by two 2 x 2, stride-wise convolutions, with output channels of 4 and 16. The 1 x 1 convolution then maps these channels to 256 channels, each enhanced by layer normalization, and then adds a mask per element to the image embedding. In the case where no mask hint is presented, one learning insert representing no mask is added to each image insert location.
After the corresponding embeddings are obtained, they will all be input into a hint encoder for use as a position-coded information hint.
(IV) mask decoder
The mask encoder receives the position coding information of the image embedding and cue encoder from the image encoder, and obtains the final output target mask information after passing through the multi-layer perceptron. Specifically, the native SAM mask decoder will perform a self-attention operation on output tokens and the input prompt tokens. The resulting output is then used as query tokens for cross-attention operations with the input image embedding. Then using tokens obtained by MLP updating, finally embedding the image as a query token and tokens obtained by MLP to make cross attention. And finally updating the image embedding by using the prompt information.
To improve the output quality of the mask decoder for the final mask, EG-SAM tokens is introduced therein to perform mask prediction, the weights of the dynamic MLP are predicted, then dynamic mask prediction is performed with the mask features, and a new mask prediction layer is introduced to perform high quality mask prediction.
(V) edge drive Module
In general, the distribution of local edge information and global context semantic information of the ecological target object of the unmanned aerial vehicle coastal zone is uneven, so that the model may be too focused on certain non-thin structural areas with rich features while important information of other thin structural areas is inevitably ignored when performing attention-related operations. Therefore, an Edge Driving Module (EDM) is designed, which utilizes a double staggered attention mechanism to comprehensively capture the complex relation between local edges or texture features and overall global semantic features, integrates the edge mask information extracted by the gradient edge attention module (GEA) with mask features generated by a mask decoder and final output features of an image encoder, thereby improving the expression capability of a model and reducing the risk of information loss.
An Edge Driving Module (EDM) aims to integrate the extracted edge mask information with mask features generated by a mask decoder and final output features of an image encoder. The module aims to enhance the understanding and expression of the complex structural object by the model through the guidance of the edge information. The distribution of edge information and global semantic information is non-uniform. In performing attention-related operations, it is possible to focus too much on certain areas while ignoring other important information. Therefore, a dual-attention mechanism module is designed to comprehensively capture the complex relationship between the two, so that the expression capacity of the model is improved, and the risk of information loss is reduced.
As shown in FIG. 4, the edge driving module initially characterizes the edgeDownsampling to and fusing features) The same dimensions. Taking the above branches as examples, inAndAfter the multiplication of elements is carried out, the method is reusedPerforming the addition of the elements, the lower branch performs the same operation, which can be expressed as:
Wherein D represents downsampling operation, (-) is performed ) Representing a per-element multiplication, (+) represents a per-element addition.
Subsequently, the two outputs are fused and subjected to two separate 3×3 convolution operations. The resulting features are then adaptively averaged pooled to adjust the output size. Next, a1×1 convolution block and a sigmoid layer are applied. The output is further multiplied by the previous convolution output on the element to produce the final output characteristic #). The formula is as follows:
Where P represents a global pooling operation. σ denotes entering the sigmoid activation layer, AndRespectively representing 3x3 and 1x1 convolution operations.
(Sixth) loss function
The model is supervised by two types of annotations, object mask #)) And object edge). For mask supervision, feature maps are randomly sampled using an uncertainty function. Then, useThe function measures the overlap between the sampling areas and the real areas, effectively solves the problem of unbalance of class and enhances the robustness. Meanwhile, a classification loss is calculated using a Binary Cross Entropy (BCE) loss function. The formula is as follows:
where S is a random sampling function, Representing the predicted outcome of the masking of the pattern,Representing a true mask result;
for edge supervision, directly adopting DICE loss function to solve the problem of unbalance between positive and negative classes, namely:
Thus, the first and second substrates are bonded together,
I.e. the total loss function.
Experimental results prove that
TABLE 1 inference results on three very fine granularity partitioned datasets
Model: a model;
Baseline: a base line;
DIS, COIFT, thinObject are data set names
MIoU: average cross ratio;
mBIoU: average edge blending ratio;
As shown in Table 1, this example tested the model of the present invention on three very fine granularity data sets. The EG-SAM achieves excellent results on standard masks mIoU and boundary metrics mBIoU, comparing SAM to the HQ-SAM model.
TABLE 2 inference results on three camouflage datasets
Table 2 shows the quantitative analysis results of the EG-SAM model of the present invention on three camouflage data sets with the existing 10 SOTA model methods. From the data, the inventive model exhibited rolling results. To demonstrate the effectiveness of the present model in camouflage target detection, experiments were performed on three challenging camouflage target datasets, COD10K (2026 pictures), NC4K (4121 pictures), and CAMO (250 pictures). The present model was compared to 10 most advanced methods, including UJSC, SINetV2, DGNet, segar, zoomNet, FDNet, BGNet, MFFN, FSPNet, and EAMNet. Wherein four well-known evaluation indexes are adopted, including S-measure%) Namely, structural measurement, WEIGHTED F-measure) Namely weighted F measurement, mean E-measure #) I.e., E metric, and mean absolute error (M), i.e., mean absolute error. Wherein the arrow "≡" indicates that the effect is better as the result is larger. Arrow "∈" indicates that the smaller the result, the better the effect. All experimental data provided above were from official data provided by each model for fairness.
Based on the model, in an actual scene, the complex image segmentation method for the unmanned airport scene comprises the following steps:
shooting by an unmanned aerial vehicle to obtain an image;
Inputting the image into the EG-SAM model for image segmentation processing;
outputting the mask image after the segmentation processing.
Example 2:
As shown in fig. 5, the present application also provides a complex image segmentation device for an unmanned airport scene, which comprises at least one processor and at least one memory, and also comprises a communication interface and an internal bus; a computer-implemented program of the EG-SAM model described in embodiment 1 is stored in a memory; when the processor executes the computer-executable program stored in the memory, the processor can be caused to execute a complex image segmentation method for the unmanned aerial vehicle scene. Wherein the internal bus may be an industry standard architecture (Industry Standard Architecture, ISA) bus, an external device interconnect (PERIPHERAL COMPONENT, PCI) bus, or an extended industry standard architecture (XtendedIndustry Standard Architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, the buses in the drawings of the present application are not limited to only one bus or to one type of bus. The memory may include a high-speed RAM memory, and may further include a nonvolatile memory NVM, such as at least one magnetic disk memory, and may also be a U-disk, a removable hard disk, a read-only memory, a magnetic disk, or an optical disk.
The device may be provided as a terminal, server or other form of device.
Fig. 5 is a block diagram of an apparatus shown for illustration. The device may include one or more of the following components: a processing component, a memory, a power component, a multimedia component, an audio component, an input/output (I/O) interface, a sensor component, and a communication component. The processing component generally controls overall operation of the electronic device, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component may include one or more processors to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component may include one or more modules that facilitate interactions between the processing component and other components. For example, the processing component may include a multimedia module to facilitate interaction between the multimedia component and the processing component.
The memory is configured to store various types of data to support operations at the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, and the like. The memory may be implemented by any type of volatile or nonvolatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
The power supply assembly provides power to the various components of the electronic device. Power components may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for electronic devices. The multimedia assembly includes a screen between the electronic device and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia assembly includes a front camera and/or a rear camera. When the electronic device is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.
The audio component is configured to output and/or input an audio signal. For example, the audio component includes a Microphone (MIC) configured to receive external audio signals when the electronic device is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may be further stored in a memory or transmitted via a communication component. In some embodiments, the audio assembly further comprises a speaker for outputting audio signals. The I/O interface provides an interface between the processing assembly and a peripheral interface module, which may be a keyboard, click wheel, button, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.
The sensor assembly includes one or more sensors for providing status assessment of various aspects of the electronic device. For example, the sensor assembly may detect an on/off state of the electronic device, a relative positioning of the assemblies, such as a display and keypad of the electronic device, a change in position of the electronic device or one of the assemblies of the electronic device, the presence or absence of user contact with the electronic device, an orientation or acceleration/deceleration of the electronic device, and a change in temperature of the electronic device. The sensor assembly may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. The sensor assembly may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly may further include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component is configured to facilitate communication between the electronic device and other devices in a wired or wireless manner. The electronic device may access a wireless network based on a communication standard, such as WiFi,2G, or 3G, or a combination thereof. In one exemplary embodiment, the communication component receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component further comprises a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the electronic device may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.
Example 3:
the present invention also provides a computer-readable storage medium, in which a computer program or instructions of the EG-SAM model according to embodiment 1 is stored, which when executed by a processor, causes the processor to perform a complex image segmentation method for an unmanned airport scene.
In particular, a system, apparatus or device provided with a readable storage medium on which a software program code implementing the functions of any of the above embodiments is stored and whose computer or processor is caused to read and execute instructions stored in the readable storage medium may be provided. In this case, the program code itself read from the readable medium may implement the functions of any of the above-described embodiments, and thus the machine-readable code and the readable storage medium storing the machine-readable code form part of the present invention.
The storage medium may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks (e.g., CD-ROM, CD-R, CD-RW, DVD-20 ROM, DVD-RAM, DVD-RW), magnetic tape, and the like. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
It should be understood that the above Processor may be a central processing unit (english: central Processing Unit, abbreviated as CPU), or may be other general purpose processors, a digital signal Processor (english: DIGITAL SIGNAL Processor, abbreviated as DSP), an Application-specific integrated Circuit (english: application SPECIFIC INTEGRATED Circuit, abbreviated as ASIC), or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in a processor for execution.
It should be understood that a storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an Application SPECIFIC INTEGRATED Circuits (ASIC). The processor and the storage medium may reside as discrete components in a terminal or server.
The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.
The computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as SMALLTALK, C ++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.
The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.
While the foregoing describes the embodiments of the present invention, it should be understood that the present invention is not limited to the embodiments, and that various modifications and changes can be made by those skilled in the art without any inventive effort.

Claims (8)

1. A complex image segmentation system for an unmanned airport scene is characterized in that: the system is constructed based on a SAM model and comprises an image encoder, a prompt encoder, a mask decoder and a multi-layer perceptron, and a gradient edge attention module, an edge driving module and a feature fusion module are added to construct an edge-supervised segmentation all model, namely an EG-SAM model;
The gradient edge attention module is positioned behind the image encoder, and the image encoder sends the middle and final results generated by the gradient edge attention module into the module, namely the gradient edge attention module utilizes the output of the sixth layer and the twenty-fourth layer of the image encoder to extract the edge characteristics of the unmanned aerial vehicle image and is used for generating a fine edge mask;
The feature fusion module is used for receiving mask features from the mask decoder, edge features of the gradient edge attention module and original output features of sixth and twenty-fourth encoder blocks of the image encoder;
The edge driving module is arranged behind the gradient edge attention module and the mask decoder and is used for integrating the edge mask information, the mask characteristics generated by the mask decoder and the final output characteristics of the image encoder, improving the expression capacity of the model and slowing down the risk of information loss; the finally output characteristics pass through a multi-layer perceptron to output a final unmanned aerial vehicle image mask;
the mask encoder receives the position coding information of the image embedding and prompting encoder from the image encoder, and obtains the final output target mask information after passing through the multi-layer perceptron;
To improve the output quality of the mask decoder for the final mask, EG-SAM tokens is introduced to perform mask prediction, the weights of the dynamic MLP are predicted, then the dynamic mask prediction is performed with the mask features, and a new mask prediction layer is introduced to perform high-quality mask prediction.
2. A complex image segmentation system for an unmanned airport scene as defined in claim 1, wherein: in the image encoder, other structures of the image encoder are frozen, and only the front-stage and back-stage semantic output results are reused, wherein the specific flow is as follows: for a picture of the coastal zone of the unmanned aerial vehicle with the input size of (c, h, w), the picture encoder processes the picture only once, scales the picture, and unifies the input size of (c, 1024,1024), wherein c represents the number of channels; then the input image is divided into a series of image blocks, and the image blocks are subjected to linear mapping to obtain respective sequence vectors; the image encoder uses a MAE pre-trained ViT-L model that contains 24 Transformer Block blocks, each processing the image output sequence from the previous block, and except for normally outputting the final results to the image encoder, the results of the 6 th and 24 th layers are extracted and sent to a gradient edge attention module for generating fine edge information.
3. The complex image segmentation system for an unmanned airport scene according to claim 1, wherein the gradient edge attention module has the following specific structure:
first, layer 6 features are characterized by a1×1 convolution layer And layer 24 featuresTo the same number of channels and then to perform a join operation after calculating their gradients separately, namely:
Wherein Cat stands for Concat operation, Representing a convolution operation using 1*1 convolution blocks,Representing gradient calculation;
then connecting and fusing the two, and carrying out point-by-point multiplication after the channel quantity is adjusted, namely:
Wherein, Representative of convolution operations using 1*1 convolution blocks, cat represents Concat operations, (-)) Representing element-by-element multiplication;
Finally, the result features are transmitted into two continuous 3X 3 convolution blocks, so that the network is helped to learn the image features with finer granularity, and the local texture and shape information is captured; finally, by a1×1 convolution layer and a Sigmoid activation layer, edge features of the segmented object are obtained, which can be expressed as:
Wherein, Representing the sigmoid activation function,Representing a convolution operation using 1*1 convolution blocks,Representative of a convolution operation using 3*3 convolution blocks, cat represents Concat operation.
4. A complex image segmentation system for an unmanned airport scene as defined in claim 1, wherein: the edge driving module utilizes a double staggered attention mechanism to comprehensively capture complex relations between local edges or texture features and overall global semantic features, integrates edge mask information extracted by the gradient edge attention module, mask features generated by a mask decoder and final output features of an image encoder, improves the expressive capacity of a model, and slows down the risk of information loss, and the specific structure and the processing flow are as follows:
edge feature Downsampling to and fusing featuresThe same size, in the upper branch, inAndAfter the multiplication of elements is carried out, the method is reusedThe same operation is performed by the add-by-add, and the lower branch is denoted as:
Wherein D represents downsampling operation, (-) is performed ) Represents a per-element multiplication, (+) represents a per-element addition;
Subsequently, the two outputs are fused and subjected to two separate 3×3 convolution operations; then, carrying out self-adaptive average pooling on the obtained characteristics so as to adjust the output size; next, a1×1 convolution block and a sigmoid layer are applied; the output is further multiplied by the previous convolution output on the element to produce the final output characteristic # ) ; Expressed as:
Wherein P represents a global pooling operation; σ denotes entering the sigmoid activation layer, AndRespectively representing 3x3 and 1x1 convolution operations.
5. A complex image segmentation system for an unmanned airport scene as defined in claim 1, wherein: there are two types of annotation supervision, namely object mask, for designing a loss function for a constructed EG-SAM modelAnd object edge; For mask supervision, randomly sampling the feature map using an uncertainty function; then, useThe overlapping between the sampling areas and the real areas is measured by functions, so that the problem of class unbalance is effectively solved; meanwhile, the binary cross entropy BCE loss function is utilized to calculate the classification loss, and the formula is as follows:
where S is a random sampling function, Representing the predicted outcome of the masking of the pattern,Representing a true mask result;
for edge supervision, directly adopting DICE loss function to solve the problem of unbalance between positive and negative classes, namely:
Thus, the first and second substrates are bonded together,
I.e. the total loss function.
6. The complex image segmentation method for the unmanned airport scene is characterized by comprising the following steps of:
shooting by an unmanned aerial vehicle to obtain an image;
Inputting an image into the complex image segmentation system for the unmanned aerial vehicle scene according to any one of claims 1 to 5 for image segmentation processing;
outputting the mask image after the segmentation processing.
7. The complex image segmentation equipment for the unmanned airport scene is characterized in that: the apparatus includes at least one processor and at least one memory, the processor and the memory coupled; a computer-implemented program of the complex image segmentation system for an unmanned aerial vehicle scene according to any one of claims 1 to 5 stored in the memory; when the processor executes the computer execution program stored in the memory, the processor is caused to execute a complex image segmentation method for the unmanned aerial vehicle scene.
8. A computer-readable storage medium having stored therein a computer program or instructions of the complex image segmentation system for an unmanned aerial vehicle scene according to any one of claims 1 to 5, which when executed by a processor, causes the processor to perform a complex image segmentation method for an unmanned aerial vehicle scene.
CN202410605358.0A 2024-05-16 2024-05-16 Complex image segmentation model for unmanned aerial vehicle scene and segmentation method Active CN118172561B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410605358.0A CN118172561B (en) 2024-05-16 2024-05-16 Complex image segmentation model for unmanned aerial vehicle scene and segmentation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410605358.0A CN118172561B (en) 2024-05-16 2024-05-16 Complex image segmentation model for unmanned aerial vehicle scene and segmentation method

Publications (2)

Publication Number Publication Date
CN118172561A CN118172561A (en) 2024-06-11
CN118172561B true CN118172561B (en) 2024-07-23

Family

ID=91359210

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410605358.0A Active CN118172561B (en) 2024-05-16 2024-05-16 Complex image segmentation model for unmanned aerial vehicle scene and segmentation method

Country Status (1)

Country Link
CN (1) CN118172561B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115908442A (en) * 2023-01-06 2023-04-04 山东巍然智能科技有限公司 Image panorama segmentation method for unmanned aerial vehicle ocean monitoring and model building method
CN116206112A (en) * 2023-03-17 2023-06-02 西安电子科技大学 Remote sensing image semantic segmentation method based on multi-scale feature fusion and SAM

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3625767B1 (en) * 2017-09-27 2021-03-31 Google LLC End to end network model for high resolution image segmentation
US11216988B2 (en) * 2017-10-24 2022-01-04 L'oreal System and method for image processing using deep neural networks
CN117036686A (en) * 2023-06-29 2023-11-10 南京邮电大学 Semantic segmentation method based on self-attention and convolution feature fusion
CN117333497A (en) * 2023-10-30 2024-01-02 大连理工大学 Mask supervision strategy-based three-dimensional medical image segmentation method for efficient modeling
CN117808834A (en) * 2024-01-03 2024-04-02 安徽理工大学 SAM-based cross-modal domain generalization medical image segmentation method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115908442A (en) * 2023-01-06 2023-04-04 山东巍然智能科技有限公司 Image panorama segmentation method for unmanned aerial vehicle ocean monitoring and model building method
CN116206112A (en) * 2023-03-17 2023-06-02 西安电子科技大学 Remote sensing image semantic segmentation method based on multi-scale feature fusion and SAM

Also Published As

Publication number Publication date
CN118172561A (en) 2024-06-11

Similar Documents

Publication Publication Date Title
US11538246B2 (en) Method and apparatus for training feature extraction model, computer device, and computer-readable storage medium
US11443438B2 (en) Network module and distribution method and apparatus, electronic device, and storage medium
CN108121952A (en) Face key independent positioning method, device, equipment and storage medium
CN109189879B (en) Electronic book display method and device
CN113792207A (en) Cross-modal retrieval method based on multi-level feature representation alignment
CN115100472B (en) Training method and device for display object recognition model and electronic equipment
CN113515942A (en) Text processing method and device, computer equipment and storage medium
CN115641518B (en) View perception network model for unmanned aerial vehicle and target detection method
CN113590881A (en) Video clip retrieval method, and training method and device of video clip retrieval model
CN112163428A (en) Semantic tag acquisition method and device, node equipment and storage medium
CN111753895A (en) Data processing method, device and storage medium
CN116187398B (en) Method and equipment for constructing lightweight neural network for unmanned aerial vehicle ocean image detection
CN116824533A (en) Remote small target point cloud data characteristic enhancement method based on attention mechanism
CN116863286B (en) Double-flow target detection method and model building method thereof
CN114332149A (en) Image segmentation method and device, electronic equipment and storage medium
CN114677517A (en) Semantic segmentation network model for unmanned aerial vehicle and image segmentation identification method
CN113496237B (en) Domain adaptive neural network training and traffic environment image processing method and device
CN115578683B (en) Construction method of dynamic gesture recognition model and dynamic gesture recognition method
CN118172561B (en) Complex image segmentation model for unmanned aerial vehicle scene and segmentation method
CN116704385A (en) Method for detecting and tracking moving object target under unmanned airport scene and model thereof
CN116310633A (en) Key point detection model training method and key point detection method
CN111274389A (en) Information processing method and device, computer equipment and storage medium
CN113706506B (en) Method and device for detecting assembly state, electronic equipment and storage medium
Lv et al. A lightweight fire detection algorithm for small targets based on YOLOv5s
Hernandez An Application of Rule-Based Classification with Fuzzy Logic to Image Subtraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant