CN118172561B

CN118172561B - Complex image segmentation model for unmanned aerial vehicle scene and segmentation method

Info

Publication number: CN118172561B
Application number: CN202410605358.0A
Authority: CN
Inventors: 魏玲; 胥志伟; 李庆华; 杨晓刚; 赵天旭; 刘振; 王胜科; 李嘉宁; 孙杰洪; 张为蛟; 蔡敏鹏
Original assignee: Shandong Weiran Intelligent Technology Co ltd
Current assignee: Shandong Weiran Intelligent Technology Co ltd
Priority date: 2024-05-16
Filing date: 2024-05-16
Publication date: 2024-07-23
Anticipated expiration: 2044-05-16
Also published as: CN118172561A

Abstract

The invention provides a complex image segmentation model and a segmentation method for an unmanned aerial vehicle scene, and belongs to the technical field of unmanned aerial vehicle image recognition segmentation based on computer vision. The invention provides a novel EG-SAM model based on the SAM model, which utilizes the SAM and the edge supervision to improve the perception of the image edge of the unmanned aerial vehicle by the model, explores the integration of valuable additional edge semantic information related to the object, and guides the learning process of the model. The edge branch comprises two important modules, the gradient edge attention module cooperatively generates an accurate edge mask by utilizing the front-stage semantic features and the rear-stage semantic features, the edge driving module integrates and fuses edge information through a dual attention mechanism, the model is forced to generate features for emphasizing the object structure aspect, and excellent segmentation performance can be maintained even under the condition that the target of the ecological environment of the coastal zone of the unmanned plane is very complex.

Description

Complex image segmentation model for unmanned aerial vehicle scene and segmentation method

Technical Field

The invention belongs to the technical field of unmanned aerial vehicle image recognition and segmentation based on computer vision, and particularly relates to a complex image segmentation model for an unmanned airport scene and a segmentation method.

Background

Unmanned aerial vehicle is widely applied to marine environment monitoring tasks due to the characteristics of low cost, flexible shooting, large visual field, high resolution and the like. For example, the similarity between spilled oil and different kinds of objects such as sea surface, spartina alterniflora and algae is high, the general segmentation algorithm is poor in effect, and a new challenge is brought to semantic segmentation under the unmanned airport scene. Moreover, under the view angle of unmanned aerial vehicle, coastal zone ecological plants are crisscross vertically and horizontally, and are unevenly distributed, and for certain type of plants or sea oil spilling, they are hollow in the distribution, and the edge is complicated structure in most cases. When the traditional general segmentation models are adopted, the traditional general segmentation models can often produce adhesion, splitting or misjudgment on the segmentation of the target.

For proprietary segmentation models such as GSCNN (gated convolutional neural network GATED SHAPE CNNS for Semantic Segmentation for semantic segmentation), although a new dual stream CNN architecture for semantic segmentation is proposed that explicitly links shape information into a single processing branch, i.e., a shape stream, that processes the information in parallel with classical streams, in edge streams their features are often poorly extracted and the resulting edge details are poor. The reason for this is generally thought to be that the gate filters out many details of the texture features while removing noise. Moreover, such proprietary segmentation models do not have generalization and perform poorly when migrated to other tasks.

For a general segmentation model, such as HQ-SAM (high Quality segmentation all model SEGMENT ANYTHING IN HIGH Quality), it does not perform satisfactorily in hollow-out and multi-elongate structured objects, and many false positives occur. One main reason is that HQ-SAM does not fully exploit the early local fine-grained features, which often contain many redundant underlying information such as edges and textures, and simply extracting and fusing the early output features directly with the late features without processing the early output features often loses many important fine-grained features due to mismatch of semantic information contained in the early output features and the late features.

Disclosure of Invention

In view of the above, the present invention provides a new EG-SAM model (a segmentation cut model An Edge-Guided SAM for Accurate Complex Object Segmentation for complex targets under Edge supervision) that decouples Edge information and overall features of unmanned aerial vehicle images, balances the distribution between thin structure pixels and non-thin structure pixels using Edge supervision, predicts accurate segmentation masks even in very challenging cases, freezes the encoder structure, fine-tunes the decoder, and reuses their output in order to avoid reducing SAM's powerful zero sample capability.

The first aspect of the invention provides a complex image segmentation model for an unmanned aerial vehicle scene, which is constructed based on a SAM model and comprises an image encoder, a prompt encoder, a mask decoder and a multi-layer perceptron, wherein a gradient edge attention module, an edge driving module and a feature fusion module are added to construct a segmentation all model for edge supervision, namely an EG-SAM model;

The gradient edge attention module is positioned behind the image encoder, and the image encoder sends the middle and final results generated by the gradient edge attention module into the module, namely the gradient edge attention module utilizes the output of the sixth layer and the twenty-fourth layer of the image encoder to extract the edge characteristics of the unmanned aerial vehicle image and is used for generating a fine edge mask;

The feature fusion module is used for receiving mask features from the mask decoder, edge features of the gradient edge attention module and original output features of sixth and twenty-fourth encoder blocks of the image encoder;

The edge driving module is arranged behind the gradient edge attention module and the mask decoder and is used for integrating the edge mask information, the mask characteristics generated by the mask decoder and the final output characteristics of the image encoder, improving the expression capacity of the model and slowing down the risk of information loss; and finally outputting the final unmanned aerial vehicle image mask after the finally output characteristics pass through the multi-layer perceptron.

Preferably, in the image encoder, other structures of the image encoder are frozen, and only the earlier and later semantic output results are reused, and the specific flow is as follows: for a picture of the coastal zone of the unmanned aerial vehicle with the input size of (c, h, w), the picture encoder processes the picture only once, scales the picture, and unifies the input size of (c, 1024,1024), wherein c represents the number of channels; then the input image is divided into a series of image blocks, and the image blocks are subjected to linear mapping to obtain respective sequence vectors; the image encoder uses a MAE pre-trained ViT-L model that contains 24 Transformer Block blocks, each processing the image output sequence from the previous block, and except for normally outputting the final results to the image encoder, the results of the 6 th and 24 th layers are extracted and sent to a gradient edge attention module for generating fine edge information.

Preferably, the gradient edge attention module has the specific structure as follows:

first, layer 6 features are characterized by a1×1 convolution layer And layer 24 featuresTo the same number of channels and then to perform a join operation after calculating their gradients separately, namely:

Wherein Cat stands for Concat operation, Representing a convolution operation using 1*1 convolution blocks,Representing gradient calculation;

then connecting and fusing the two, and carrying out point-by-point multiplication after the channel quantity is adjusted, namely:

Wherein, Representative of convolution operations using 1*1 convolution blocks, cat represents Concat operations, (-)) Representing element-by-element multiplication;

Finally, the result features are transmitted into two continuous 3X 3 convolution blocks, so that the network is helped to learn the image features with finer granularity, and the local texture and shape information is captured; finally, by a1×1 convolution layer and a Sigmoid activation layer, edge features of the segmented object are obtained, which can be expressed as:

Wherein, Representing the sigmoid activation function,Representing a convolution operation using 1*1 convolution blocks,Representative of a convolution operation using 3*3 convolution blocks, cat represents Concat operation.

Preferably, the mask encoder receives the position coding information of the image embedding and prompting encoder from the image encoder, and obtains the final output target mask information after passing through the multi-layer perceptron;

To improve the output quality of the mask decoder for the final mask, EG-SAM tokens is introduced to perform mask prediction, the weights of the dynamic MLP are predicted, then the dynamic mask prediction is performed with the mask features, and a new mask prediction layer is introduced to perform high-quality mask prediction.

Preferably, the edge driving module utilizes a double staggered attention mechanism to comprehensively capture complex relations between local edges or texture features and overall global semantic features, integrates edge mask information extracted by the gradient edge attention module, mask features generated by a mask decoder and final output features of an image encoder, improves the expression capability of a model, and slows down the risk of information loss, and the specific structure and the processing flow are as follows:

edge feature Downsampling to and fusing featuresThe same size, in the upper branch, inAndAfter the multiplication of elements is carried out, the method is reusedThe same operation is performed by the add-by-add, and the lower branch is denoted as:

Wherein D represents downsampling operation, (-) is performed ) Represents a per-element multiplication, (+) represents a per-element addition;

Subsequently, the two outputs are fused and subjected to two separate 3×3 convolution operations; then, carrying out self-adaptive average pooling on the obtained characteristics so as to adjust the output size; next, a1×1 convolution block and a sigmoid layer are applied; the output is further multiplied by the previous convolution output on the element to produce the final output characteristic # ) ; Expressed as:

Wherein P represents a global pooling operation; σ denotes entering the sigmoid activation layer, AndRespectively representing 3x3 and 1x1 convolution operations.

Preferably, the loss function is designed for the constructed EG-SAM model, with two types of annotation supervision, namely object masksAnd object edge; For mask supervision, randomly sampling the feature map using an uncertainty function; then, the overlapping between the sampling areas and the real areas is measured by using a DICE function, so that the problem of class unbalance is effectively solved; meanwhile, the binary cross entropy BCE loss function is utilized to calculate the classification loss, and the formula is as follows:

where S is a random sampling function, Representing the predicted outcome of the masking of the pattern,Representing a true mask result;

for edge supervision, directly adopt The loss function solves the problem of unbalance between positive and negative classes, namely:

Thus, the first and second substrates are bonded together,

I.e. the total loss function.

The second aspect of the invention provides a complex image segmentation method for an unmanned aerial vehicle scene, which comprises the following steps:

shooting by an unmanned aerial vehicle to obtain an image;

Inputting an image into the complex image segmentation model for the unmanned aerial vehicle scene according to the first aspect for image segmentation processing;

outputting the mask image after the segmentation processing.

A third aspect of the present invention provides a complex image segmentation apparatus for an unmanned aerial vehicle scene, the apparatus comprising at least one processor and at least one memory, the processor and memory being coupled; a computer-implemented program of the complex image segmentation model for an unmanned airport scene according to the first aspect is stored in the memory; when the processor executes the computer execution program stored in the memory, the processor is caused to execute a complex image segmentation method for the unmanned aerial vehicle scene.

A fourth aspect of the present invention provides a computer-readable storage medium having stored therein a computer program or instructions of the complex image segmentation model for an unmanned aerial vehicle scene according to the first aspect, which when executed by a processor, causes the processor to perform a complex image segmentation method for an unmanned aerial vehicle scene.

Compared with the prior art, the invention has the following beneficial effects:

The invention provides a novel EG-SAM model based on the SAM model, which utilizes the SAM and the edge supervision to improve the perception of the image edge of the unmanned aerial vehicle by the model, explores the integration of valuable additional edge semantic information related to the object, and guides the learning process of the model. The edge branch comprises two important modules, the gradient edge attention module cooperatively generates an accurate edge mask by utilizing the front-stage semantic features and the rear-stage semantic features, and the edge driving module integrates and fuses edge information through a dual attention mechanism to force the model to generate features for emphasizing the object structure, so that EG-SAM can be guided to keep high-quality segmentation prediction even under the very complex condition.

The EG-SAM model can maintain excellent segmentation performance and has accurate segmentation accuracy even if the target such as the ecological environment of the coastal zone of the unmanned plane is very complex, such as a multi-slender structure, a complicated hollow structure and a camouflage structure with boundaries difficult to distinguish; the invention has the key points that the fine-tuning general segmentation model SAM not only maintains the original zero sample generalization capability under the condition of only adding a small amount of parameters, but also fully utilizes the front and rear semantic features of the object in the edge branches to generate an accurate edge mask, thereby carrying out segmentation guidance on the integral complex structure.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will be given simply with reference to the accompanying drawings, which are used in the description of the embodiments or the prior art, it being evident that the following description is only one embodiment of the invention, and that other drawings can be obtained from these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a block diagram showing the overall structure of the EG-SAM model of the present invention.

Fig. 2 is a schematic diagram of an input picture (model input) and an output mask (model output) in embodiment 1.

FIG. 3 is a schematic diagram of a gradient edge attention module structure according to the present invention.

Fig. 4 is a schematic diagram of an edge driving module structure according to the present invention.

Fig. 5 is a simple structure diagram of the complex image segmentation apparatus in embodiment 2.

Detailed Description

Example 1:

the invention will be further described with reference to specific examples.

The overall framework of the EG-SAM model is shown in FIG. 1, and the present invention freezes the Vit image encoder in the SAM and reuses its output. Specifically, the first and last Transformer Block output features in the Vit that use global attention are chosen, i.e., the sixth and twenty-fourth layer outputs of the Vit are reused. It is then fed into a gradient edge attention module and a final feature fusion module that generate edge mask information, and after the edge driver module receives the fused features from the edge output features of the gradient edge attention module and the feature fusion module, the information between them is mined and integrated. And finally outputting the final mask after the finally output characteristics pass through the multi-layer perceptron. An example diagram of the unmanned aerial vehicle image input and output is shown in fig. 2.

Wherein the feature fusion module is configured to receive the mask features from the mask decoder, the edge features of the gradient edge attention module, and the original output features of the sixth and twenty-fourth encoder blocks of the image encoder. The number of channels of the data are unified through a convolution block, then the fusion features of the data are obtained through element-by-element addition, and then the fusion features are input into an edge driving module for feature integration. The purpose of this is to prevent its segmentation details from being covered by other strong semantic regions during feature fusion.

Image encoder

In this structure, the structure of the image encoder is frozen, and only its early and late semantic output results are reused. The specific flow is as follows: for a picture of the unmanned aerial vehicle coastal zone with input size (c, h, w), the image encoder processes the picture only once, scales the image, and unifies the input size (c, 1024,1024), wherein c represents the channel number. The input image is then divided into a series of image blocks which are linearly mapped (using a convolution layer) to obtain respective sequence vectors. Since the image encoder uses the MAE (Masked AutoEncoder) pre-trained ViT-L model, it contains 24 blocks Transformer Block, each of which processes the image output sequence from the previous block. In addition to outputting the final results normally to the image encoder, the results of layers 6 and 24 are also extracted and fed to the gradient edge attention module for generating fine edge information.

(II) gradient edge attention Module

For those elongated, hollow structural or camouflaged objects in the unmanned aerial vehicle image, they typically have unique shape features and significant edge variations. Thus, introducing edge supervision helps the model to more easily distinguish the boundary between the object and the background. However, features extracted from a single hierarchy may have certain limitations. By effectively integrating visual semantic information of different layers, the accuracy and the fineness of the edges of the segmented object can be obviously improved. Thus, the gradient edge attention module uses the outputs of layers 6 and 24 of Vit to extract edge features, as shown in fig. 3.

Previous features can effectively capture local edge semantic information of an input image and are very sensitive to detected local features such as contours of objects. The subsequent features have rich global semantic information, so that the model can better understand the relationship between the whole scene and the object, thereby improving the perception of the whole shape and structure of the object. First, layer 6 features are [. Times.1 through the 1X 1 convolution layer) And layer 24 features) The same number of channels are converted and then the connection operation is performed after calculating their gradients separately. These gradients provide information about the rate of change of each location in the image, helping to capture edge details more accurately. Namely:

Wherein Cat stands for Concat operation, Representing a convolution operation using 1*1 convolution blocks,Representing gradient calculations.

And then connecting and fusing the two, and carrying out point-by-point multiplication after the channel number is adjusted. Namely:

Wherein, Representative of convolution operations using 1*1 convolution blocks, cat represents Concat operations, (-)) Representing a per-element multiplication.

Finally, the result features are transmitted into two continuous 3×3 convolution blocks, which is helpful for network learning of image features with finer granularity and capture of local texture and shape information. Finally, by means of the 1×1 convolution layer and the Sigmoid activation layer, edge features of the segmented object can be obtained, which can be expressed as:

(III) hint encoder

The EG-SAM requires additional provision of hint information in addition to the necessary unmanned aerial vehicle image. The prompt information is divided into two types, namely sparse prompt, including point prompt, box prompt and text prompt; one type is dense hints, i.e., image masks. For sparse prompt, only the sparse prompt is needed to be encoded to obtain vector encoding; for dense cues, the image mask and the image embedding output by the image encoder are scaled down by a factor of 4 by element-wise multiplication (element-wise), implemented by two 2 x 2, stride-wise convolutions, with output channels of 4 and 16. The 1 x 1 convolution then maps these channels to 256 channels, each enhanced by layer normalization, and then adds a mask per element to the image embedding. In the case where no mask hint is presented, one learning insert representing no mask is added to each image insert location.

After the corresponding embeddings are obtained, they will all be input into a hint encoder for use as a position-coded information hint.

(IV) mask decoder

The mask encoder receives the position coding information of the image embedding and cue encoder from the image encoder, and obtains the final output target mask information after passing through the multi-layer perceptron. Specifically, the native SAM mask decoder will perform a self-attention operation on output tokens and the input prompt tokens. The resulting output is then used as query tokens for cross-attention operations with the input image embedding. Then using tokens obtained by MLP updating, finally embedding the image as a query token and tokens obtained by MLP to make cross attention. And finally updating the image embedding by using the prompt information.

To improve the output quality of the mask decoder for the final mask, EG-SAM tokens is introduced therein to perform mask prediction, the weights of the dynamic MLP are predicted, then dynamic mask prediction is performed with the mask features, and a new mask prediction layer is introduced to perform high quality mask prediction.

(V) edge drive Module

In general, the distribution of local edge information and global context semantic information of the ecological target object of the unmanned aerial vehicle coastal zone is uneven, so that the model may be too focused on certain non-thin structural areas with rich features while important information of other thin structural areas is inevitably ignored when performing attention-related operations. Therefore, an Edge Driving Module (EDM) is designed, which utilizes a double staggered attention mechanism to comprehensively capture the complex relation between local edges or texture features and overall global semantic features, integrates the edge mask information extracted by the gradient edge attention module (GEA) with mask features generated by a mask decoder and final output features of an image encoder, thereby improving the expression capability of a model and reducing the risk of information loss.

An Edge Driving Module (EDM) aims to integrate the extracted edge mask information with mask features generated by a mask decoder and final output features of an image encoder. The module aims to enhance the understanding and expression of the complex structural object by the model through the guidance of the edge information. The distribution of edge information and global semantic information is non-uniform. In performing attention-related operations, it is possible to focus too much on certain areas while ignoring other important information. Therefore, a dual-attention mechanism module is designed to comprehensively capture the complex relationship between the two, so that the expression capacity of the model is improved, and the risk of information loss is reduced.

As shown in FIG. 4, the edge driving module initially characterizes the edgeDownsampling to and fusing features) The same dimensions. Taking the above branches as examples, inAndAfter the multiplication of elements is carried out, the method is reusedPerforming the addition of the elements, the lower branch performs the same operation, which can be expressed as:

Wherein D represents downsampling operation, (-) is performed ) Representing a per-element multiplication, (+) represents a per-element addition.

Subsequently, the two outputs are fused and subjected to two separate 3×3 convolution operations. The resulting features are then adaptively averaged pooled to adjust the output size. Next, a1×1 convolution block and a sigmoid layer are applied. The output is further multiplied by the previous convolution output on the element to produce the final output characteristic #). The formula is as follows:

Where P represents a global pooling operation. σ denotes entering the sigmoid activation layer, AndRespectively representing 3x3 and 1x1 convolution operations.

(Sixth) loss function

The model is supervised by two types of annotations, object mask #)) And object edge). For mask supervision, feature maps are randomly sampled using an uncertainty function. Then, useThe function measures the overlap between the sampling areas and the real areas, effectively solves the problem of unbalance of class and enhances the robustness. Meanwhile, a classification loss is calculated using a Binary Cross Entropy (BCE) loss function. The formula is as follows:

for edge supervision, directly adopting DICE loss function to solve the problem of unbalance between positive and negative classes, namely:

Thus, the first and second substrates are bonded together,

I.e. the total loss function.

Experimental results prove that

TABLE 1 inference results on three very fine granularity partitioned datasets

Model: a model;

Baseline: a base line;

DIS, COIFT, thinObject are data set names

MIoU: average cross ratio;

mBIoU: average edge blending ratio;

As shown in Table 1, this example tested the model of the present invention on three very fine granularity data sets. The EG-SAM achieves excellent results on standard masks mIoU and boundary metrics mBIoU, comparing SAM to the HQ-SAM model.

TABLE 2 inference results on three camouflage datasets

Table 2 shows the quantitative analysis results of the EG-SAM model of the present invention on three camouflage data sets with the existing 10 SOTA model methods. From the data, the inventive model exhibited rolling results. To demonstrate the effectiveness of the present model in camouflage target detection, experiments were performed on three challenging camouflage target datasets, COD10K (2026 pictures), NC4K (4121 pictures), and CAMO (250 pictures). The present model was compared to 10 most advanced methods, including UJSC, SINetV2, DGNet, segar, zoomNet, FDNet, BGNet, MFFN, FSPNet, and EAMNet. Wherein four well-known evaluation indexes are adopted, including S-measure%) Namely, structural measurement, WEIGHTED F-measure) Namely weighted F measurement, mean E-measure #) I.e., E metric, and mean absolute error (M), i.e., mean absolute error. Wherein the arrow "≡" indicates that the effect is better as the result is larger. Arrow "∈" indicates that the smaller the result, the better the effect. All experimental data provided above were from official data provided by each model for fairness.

Based on the model, in an actual scene, the complex image segmentation method for the unmanned airport scene comprises the following steps:

shooting by an unmanned aerial vehicle to obtain an image;

Inputting the image into the EG-SAM model for image segmentation processing;

outputting the mask image after the segmentation processing.

Example 2:

As shown in fig. 5, the present application also provides a complex image segmentation device for an unmanned airport scene, which comprises at least one processor and at least one memory, and also comprises a communication interface and an internal bus; a computer-implemented program of the EG-SAM model described in embodiment 1 is stored in a memory; when the processor executes the computer-executable program stored in the memory, the processor can be caused to execute a complex image segmentation method for the unmanned aerial vehicle scene. Wherein the internal bus may be an industry standard architecture (Industry Standard Architecture, ISA) bus, an external device interconnect (PERIPHERAL COMPONENT, PCI) bus, or an extended industry standard architecture (XtendedIndustry Standard Architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, the buses in the drawings of the present application are not limited to only one bus or to one type of bus. The memory may include a high-speed RAM memory, and may further include a nonvolatile memory NVM, such as at least one magnetic disk memory, and may also be a U-disk, a removable hard disk, a read-only memory, a magnetic disk, or an optical disk.

The device may be provided as a terminal, server or other form of device.

Fig. 5 is a block diagram of an apparatus shown for illustration. The device may include one or more of the following components: a processing component, a memory, a power component, a multimedia component, an audio component, an input/output (I/O) interface, a sensor component, and a communication component. The processing component generally controls overall operation of the electronic device, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component may include one or more processors to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component may include one or more modules that facilitate interactions between the processing component and other components. For example, the processing component may include a multimedia module to facilitate interaction between the multimedia component and the processing component.

The memory is configured to store various types of data to support operations at the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, and the like. The memory may be implemented by any type of volatile or nonvolatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power supply assembly provides power to the various components of the electronic device. Power components may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for electronic devices. The multimedia assembly includes a screen between the electronic device and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia assembly includes a front camera and/or a rear camera. When the electronic device is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component is configured to output and/or input an audio signal. For example, the audio component includes a Microphone (MIC) configured to receive external audio signals when the electronic device is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may be further stored in a memory or transmitted via a communication component. In some embodiments, the audio assembly further comprises a speaker for outputting audio signals. The I/O interface provides an interface between the processing assembly and a peripheral interface module, which may be a keyboard, click wheel, button, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly includes one or more sensors for providing status assessment of various aspects of the electronic device. For example, the sensor assembly may detect an on/off state of the electronic device, a relative positioning of the assemblies, such as a display and keypad of the electronic device, a change in position of the electronic device or one of the assemblies of the electronic device, the presence or absence of user contact with the electronic device, an orientation or acceleration/deceleration of the electronic device, and a change in temperature of the electronic device. The sensor assembly may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. The sensor assembly may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly may further include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component is configured to facilitate communication between the electronic device and other devices in a wired or wireless manner. The electronic device may access a wireless network based on a communication standard, such as WiFi,2G, or 3G, or a combination thereof. In one exemplary embodiment, the communication component receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component further comprises a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.

Example 3:

the present invention also provides a computer-readable storage medium, in which a computer program or instructions of the EG-SAM model according to embodiment 1 is stored, which when executed by a processor, causes the processor to perform a complex image segmentation method for an unmanned airport scene.

In particular, a system, apparatus or device provided with a readable storage medium on which a software program code implementing the functions of any of the above embodiments is stored and whose computer or processor is caused to read and execute instructions stored in the readable storage medium may be provided. In this case, the program code itself read from the readable medium may implement the functions of any of the above-described embodiments, and thus the machine-readable code and the readable storage medium storing the machine-readable code form part of the present invention.

The storage medium may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks (e.g., CD-ROM, CD-R, CD-RW, DVD-20 ROM, DVD-RAM, DVD-RW), magnetic tape, and the like. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

It should be understood that the above Processor may be a central processing unit (english: central Processing Unit, abbreviated as CPU), or may be other general purpose processors, a digital signal Processor (english: DIGITAL SIGNAL Processor, abbreviated as DSP), an Application-specific integrated Circuit (english: application SPECIFIC INTEGRATED Circuit, abbreviated as ASIC), or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in a processor for execution.

It should be understood that a storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an Application SPECIFIC INTEGRATED Circuits (ASIC). The processor and the storage medium may reside as discrete components in a terminal or server.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

The computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as SMALLTALK, C ++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.

The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

While the foregoing describes the embodiments of the present invention, it should be understood that the present invention is not limited to the embodiments, and that various modifications and changes can be made by those skilled in the art without any inventive effort.

Claims

1. A complex image segmentation system for an unmanned airport scene is characterized in that: the system is constructed based on a SAM model and comprises an image encoder, a prompt encoder, a mask decoder and a multi-layer perceptron, and a gradient edge attention module, an edge driving module and a feature fusion module are added to construct an edge-supervised segmentation all model, namely an EG-SAM model;

The edge driving module is arranged behind the gradient edge attention module and the mask decoder and is used for integrating the edge mask information, the mask characteristics generated by the mask decoder and the final output characteristics of the image encoder, improving the expression capacity of the model and slowing down the risk of information loss; the finally output characteristics pass through a multi-layer perceptron to output a final unmanned aerial vehicle image mask;

the mask encoder receives the position coding information of the image embedding and prompting encoder from the image encoder, and obtains the final output target mask information after passing through the multi-layer perceptron;

2. A complex image segmentation system for an unmanned airport scene as defined in claim 1, wherein: in the image encoder, other structures of the image encoder are frozen, and only the front-stage and back-stage semantic output results are reused, wherein the specific flow is as follows: for a picture of the coastal zone of the unmanned aerial vehicle with the input size of (c, h, w), the picture encoder processes the picture only once, scales the picture, and unifies the input size of (c, 1024,1024), wherein c represents the number of channels; then the input image is divided into a series of image blocks, and the image blocks are subjected to linear mapping to obtain respective sequence vectors; the image encoder uses a MAE pre-trained ViT-L model that contains 24 Transformer Block blocks, each processing the image output sequence from the previous block, and except for normally outputting the final results to the image encoder, the results of the 6 th and 24 th layers are extracted and sent to a gradient edge attention module for generating fine edge information.

3. The complex image segmentation system for an unmanned airport scene according to claim 1, wherein the gradient edge attention module has the following specific structure:

4. A complex image segmentation system for an unmanned airport scene as defined in claim 1, wherein: the edge driving module utilizes a double staggered attention mechanism to comprehensively capture complex relations between local edges or texture features and overall global semantic features, integrates edge mask information extracted by the gradient edge attention module, mask features generated by a mask decoder and final output features of an image encoder, improves the expressive capacity of a model, and slows down the risk of information loss, and the specific structure and the processing flow are as follows:

5. A complex image segmentation system for an unmanned airport scene as defined in claim 1, wherein: there are two types of annotation supervision, namely object mask, for designing a loss function for a constructed EG-SAM modelAnd object edge; For mask supervision, randomly sampling the feature map using an uncertainty function; then, useThe overlapping between the sampling areas and the real areas is measured by functions, so that the problem of class unbalance is effectively solved; meanwhile, the binary cross entropy BCE loss function is utilized to calculate the classification loss, and the formula is as follows:

Thus, the first and second substrates are bonded together,

I.e. the total loss function.

6. The complex image segmentation method for the unmanned airport scene is characterized by comprising the following steps of:

shooting by an unmanned aerial vehicle to obtain an image;

Inputting an image into the complex image segmentation system for the unmanned aerial vehicle scene according to any one of claims 1 to 5 for image segmentation processing;

outputting the mask image after the segmentation processing.

7. The complex image segmentation equipment for the unmanned airport scene is characterized in that: the apparatus includes at least one processor and at least one memory, the processor and the memory coupled; a computer-implemented program of the complex image segmentation system for an unmanned aerial vehicle scene according to any one of claims 1 to 5 stored in the memory; when the processor executes the computer execution program stored in the memory, the processor is caused to execute a complex image segmentation method for the unmanned aerial vehicle scene.

8. A computer-readable storage medium having stored therein a computer program or instructions of the complex image segmentation system for an unmanned aerial vehicle scene according to any one of claims 1 to 5, which when executed by a processor, causes the processor to perform a complex image segmentation method for an unmanned aerial vehicle scene.