Nothing Special   »   [go: up one dir, main page]

CN116935296A - Orchard environment scene detection method and terminal based on multitask deep learning - Google Patents

Orchard environment scene detection method and terminal based on multitask deep learning Download PDF

Info

Publication number
CN116935296A
CN116935296A CN202310901547.8A CN202310901547A CN116935296A CN 116935296 A CN116935296 A CN 116935296A CN 202310901547 A CN202310901547 A CN 202310901547A CN 116935296 A CN116935296 A CN 116935296A
Authority
CN
China
Prior art keywords
module
representing
features
orchard
orchard environment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310901547.8A
Other languages
Chinese (zh)
Inventor
赵文锋
林暖晨
江政文
梁升濠
刘易迪
蓝海洋
黄袁爵
钟敏悦
李振源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China Agricultural University
Original Assignee
South China Agricultural University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China Agricultural University filed Critical South China Agricultural University
Priority to CN202310901547.8A priority Critical patent/CN116935296A/en
Publication of CN116935296A publication Critical patent/CN116935296A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses an orchard environment scene detection method and terminal based on multitask deep learning, wherein the method comprises the steps of collecting an orchard environment image and constructing a data set; inputting an orchard environment image into an improved MobileNet 3 backbone network, and sequentially obtaining an output characteristic diagram through a CBS layer and an improved bneck module; the output feature map is generated and fused with features of different scales through a spatial pyramid pool SPP module, and different semantic hierarchy features are generated and fused through a feature pyramid network FPN module; and generating images with different scale features and different semantic hierarchy features, decoding the images through a target detection decoding head and a semantic segmentation decoding head simultaneously, obtaining a detection target through the target detection decoding head, and segmenting a travelable region through the semantic segmentation decoding head. According to the invention, semantic segmentation and target detection tasks are combined for processing, so that the simultaneous identification of the drivable area and the obstacle of the orchard is realized.

Description

Orchard environment scene detection method and terminal based on multitask deep learning
Technical Field
The invention relates to the technical field of orchard management, in particular to an orchard environment scene detection method and terminal based on multi-task deep learning.
Background
Along with the enlargement of the planting area of the orchard in China, the progress of agricultural mechanization and the rise of labor cost, the standardized development and intelligent management of the orchard are required, and the necessary trend of future development is also realized. In order to improve the efficiency of orchard management, reduce the labor intensity and simultaneously reduce the production cost, fruit farmers need to adopt an orchard environment automatic driving technology. The technology needs to rely on the perception technology of the orchard environment to provide important information for the automatic driving decision-making module. Image recognition is a very important aspect in the perception technology of the orchard environment.
Conventional pavement identification algorithms typically employ extraction of surface features such as texture, color, shape, etc. However, this method has a problem in that it lacks extraction and expression of deep features and advanced semantic information, and thus it does not perform well in recognition of complex unstructured orchard road scenes. The orchard environment recognition algorithm adopts various means to respectively process semantic segmentation and target detection tasks, so that higher accuracy is realized. While these methods work well, if the tasks are processed sequentially, they are time consuming than disposable.
Therefore, how to provide a method and a terminal for detecting an orchard environment scene through multi-task deep learning is a problem that needs to be solved by those skilled in the art.
Disclosure of Invention
In view of the above, the invention provides an orchard environment scene detection method and terminal based on multi-task deep learning, which can realize the joint task of dividing a drivable area and detecting obstacles, share the same backbone network, and have higher real-time performance, higher accuracy and shorter time.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
the orchard environment scene detection method based on multitask deep learning comprises the following steps:
collecting an orchard environment image and constructing a data set;
inputting an orchard environment image into an improved MobileNet 3 backbone network, and sequentially obtaining an output characteristic diagram through a CBS layer and an improved bneck module;
the output feature map is generated and fused with features of different scales through a spatial pyramid pool SPP module, and different semantic hierarchy features are generated and fused through a feature pyramid network FPN module;
and generating images with different scale features and different semantic hierarchy features, decoding the images through a target detection decoding head and a semantic segmentation decoding head simultaneously, obtaining a detection target through the target detection decoding head, and segmenting a travelable region through the semantic segmentation decoding head.
Preferably, the improved bneck module specifically processes:
step a: the input features are subjected to dimension increasing through 1X 1 point-by-point convolution in sequence, the features after dimension increasing are subjected to 3X 3 depth convolution, and then the features are processed through an ECA-Net attention module;
step b: in an ECA-Net attention module, global average pooling operation is adopted for each channel of a feature matrix of an input feature diagram, local cross-channel information interaction is realized through one-dimensional convolution, attention diagram is formed through a sigmoid activation function, and the input feature diagram and the attention diagram are subjected to Hadamard product;
the ECA-Net attention module generates weights ω for each channel by a one-dimensional convolution of size K:
ω=σ(C1 D K (y))
wherein ,C1DK Representing one-dimensional convolution with a convolution kernel of size K, y representing the channel, σ representing the sigmoid activation function; the mapping relationship between the channel dimensions C and K is as follows:
C=Φ(K)≈exp(γ×K-b)
i.e. given the channel dimension C, the convolution kernel size K is adaptively determined:
wherein ,|t|odd An odd number representing the nearest distance t, γ and b representing constants, the value of γ being set to 2, the value of b being set to 1;
step c: and convolving the upscales to the sizes of the input feature images point by 1X 1, and finally adding the feature images after upscales and the original feature images to obtain the output feature images.
Preferably, the target detection decoding head detects lossThe method comprises the following steps:
wherein ,indicating the loss of overlap->Indicating center distance loss, < >>Representing the loss of width and height; ioU is the intersection ratio of the bounding box and the real bounding box, b (gt) Respectively representing the center point coordinates of the prediction boundary frame and the center point coordinates, w and w of the real boundary frame (gt) Respectively represents the predicted width, the true width, h and h (gt) Respectively represent the predicted and true height, ρ 2 (. Cndot.) represents Euclidean distance, c represents diagonal distance of minimum rectangle that can wrap the predicted bounding box and the real bounding box; c w ,c h Representing the width, height of the smallest rectangular box surrounding the prediction bounding box and the real bounding box;
semantic segmentation solves for region loss that can travel in wharfThe method comprises the following steps:
wherein ,pt Representing the probability of correct prediction of the model for classification, p when the prediction is correct t P, otherwise p t =1-p,υ t and υω Is super parameter, and is used for modulating the positive and negative sample weight and controlling the difficult and easy sample weight respectively;
model total loss functionThe method comprises the following steps:
total loss of wherein γ1 ,γ 2 Is a balance weight parameter.
Preferably, acquiring an image of an orchard environment and constructing a data set specifically includes:
step 1.1, respectively photographing an orchard environment image under a natural environment at different time intervals, different illumination angles and different visual angles;
step 1.2, carrying out target detection labeling on fruit tree targets in an orchard environment, and carrying out semantic segmentation labeling on a drivable area;
and 1.3, carrying out data enhancement and augmentation on the orchard environment image through brightness adjustment, gaussian blur, affine transformation, mirror image overturning and eclosion treatment, and constructing a data set.
Orchard environment scene detection terminal based on multitasking deep learning includes: the robot comprises a camera, a processor, a navigation decision module and a robot main body, wherein the processor is provided with a deep learning model, the deep learning model comprises an improved MobileNetv3 backbone network, a spatial pyramid pool SPP module, a feature pyramid network FPN module, a target detection solution terminal and a semantic segmentation solution terminal, and the improved MobileNetv3 backbone network comprises a CBS layer and an improved bnck module;
the camera is carried on the robot main body and used for collecting an orchard environment image;
the processor is arranged in the robot main body and is used for inputting an orchard environment image into the improved MobileNetv3 backbone network, and an output characteristic diagram is obtained through the CBS layer and the improved bneck module in sequence;
the output feature map is generated and fused with features of different scales through a spatial pyramid pool SPP module, and different semantic hierarchy features are generated and fused through a feature pyramid network FPN module;
the method comprises the steps of generating images with different scale features and different semantic hierarchy features, decoding the images through a target detection decoding head and a semantic segmentation decoding head at the same time, obtaining a detection target through the target detection decoding head, and segmenting a drivable area through the semantic segmentation decoding head;
the navigation decision module is arranged inside the robot main body and is used for calculating a corresponding path through the drivable area and the detection target and controlling the movement of the robot main body.
Preferably, the improved bneck module specifically processes:
step a: the input features are subjected to dimension increasing through 1X 1 point-by-point convolution in sequence, the features after dimension increasing are subjected to 3X 3 depth convolution, and then the features are processed through an ECA-Net attention module;
step b: in an ECA-Net attention module, global average pooling operation is adopted for each channel of a feature matrix of an input feature diagram, local cross-channel information interaction is realized through one-dimensional convolution, attention diagram is formed through a sigmoid activation function, and the input feature diagram and the attention diagram are subjected to Hadamard product;
the ECA-Net attention module generates weights ω for each channel by a one-dimensional convolution of size K:
ω=σ(C1D K (y))
wherein ,C1DK Representing one-dimensional convolution with a convolution kernel of size K, y representing the channel, σ representing the sigmoid activation function; the mapping relationship between the channel dimensions C and K is as follows:
C=Φ(K)≈exp(γ×K-b)
i.e. given the channel dimension C, the convolution kernel size K is adaptively determined:
wherein ,|t|odd An odd number representing the nearest distance t, γ and b representing constants, the value of γ being set to 2, the value of b being set to 1;
step c: and convolving the upscales to the sizes of the input feature images point by 1X 1, and finally adding the feature images after upscales and the original feature images to obtain the output feature images.
Preferably, the target detection decoding head detects lossThe method comprises the following steps:
wherein ,indicating the loss of overlap->Indicating center distance loss, < >>Representing the loss of width and height; ioU is the intersection ratio of the bounding box and the real bounding box, b (gt) Respectively representing the center point coordinates of the prediction boundary frame and the center point coordinates, w and w of the real boundary frame (gt) Respectively represents the predicted width, the true width, h and h (gt) Respectively represent the predicted and true height, ρ 2 (. Cndot.) represents Euclidean distance, c represents diagonal distance of minimum rectangle that can wrap the predicted bounding box and the real bounding box; c w ,c h Representing the width, height of the smallest rectangular box surrounding the prediction bounding box and the real bounding box;
semantic segmentation solves for region loss that can travel in wharfThe method comprises the following steps:
wherein ,pt Representing the probability of correct prediction of the model for classification, p when the prediction is correct t P, otherwise p t =1-p,υ t and υω Is super parameter, and is used for modulating the positive and negative sample weight and controlling the difficult and easy sample weight respectively;
model total loss functionThe method comprises the following steps:
total loss of wherein γ1 ,γ 2 Is a balance weight parameter.
A computer readable medium storing instructions that, when executed on the readable medium, cause the readable medium to perform a multitasking deep learning based method of orchard environment scene detection.
A processing terminal comprises a memory and a processor, wherein a computer program capable of running on the processor is stored in the memory, and the processor realizes an orchard environment scene detection method based on multi-task deep learning when executing the computer program.
Compared with the prior art, the invention discloses an orchard environment scene detection method and terminal based on multi-task deep learning, which are used for processing semantic segmentation and target detection tasks in a combined way, so that the simultaneous identification of a drivable area and an obstacle of an orchard is realized. The semantic segmentation task is mainly used for segmenting a exercisable area, and the object detection is mainly used for detecting obstacles. By adopting the method, the efficiency of orchard management can be improved, the labor intensity can be reduced, and the production cost can be reduced.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic diagram of an orchard environment scene detection method based on multi-task deep learning.
FIG. 2 is a schematic diagram of the deep learning model structure of the present invention.
FIG. 3 is a schematic diagram of a modified bneck module structure of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The embodiment of the invention discloses an orchard environment scene detection method based on multi-task deep learning, which is shown in fig. 1 and comprises the following steps:
collecting an orchard environment image and constructing a data set;
inputting an orchard environment image into an improved MobileNet 3 backbone network, and sequentially obtaining an output characteristic diagram through a CBS layer and an improved bneck module;
the output feature map is generated and fused with features of different scales through a spatial pyramid pool SPP module, and different semantic hierarchy features are generated and fused through a feature pyramid network FPN module;
and generating images with different scale features and different semantic hierarchy features, decoding the images through a target detection decoding head and a semantic segmentation decoding head simultaneously, obtaining a detection target through the target detection decoding head, and segmenting a travelable region through the semantic segmentation decoding head. The driving area is divided into 2 minutes: a travelable region, a non-travelable region; obstacle detection targets are classified into people, fruit trees and the like.
In this embodiment, in the modified MobileNetv3 backbone network, the present invention uses the ECA attention mechanism to replace SE blocks in the original bneck, enhancing feature extraction.
As shown in fig. 2, the deep learning model includes an improved MobileNetv3 backbone network, a spatial pyramid pool SPP module, a feature pyramid network FPN module, a target detection solution terminal and a semantic segmentation solution terminal, and the MobileNetv3 backbone network includes a CBS layer and an improved bneck module.
The CBS layer consists of a two-dimensional convolution layer, a BN layer and a SiLU activation function:
SiLU(x)=x·Sigmoid(x)
as shown in fig. 3, the modified bneck module specifically processes:
step a: the input features are subjected to dimension increasing through 1X 1 point-by-point convolution in sequence, the features after dimension increasing are subjected to 3X 3 depth convolution, and then the features are processed through an ECA-Net attention module;
step b: in an ECA-Net attention module, global average pooling operation is adopted for each channel of a feature matrix of an input feature diagram, local cross-channel information interaction is realized through one-dimensional convolution, attention diagram is formed through a sigmoid activation function, and the input feature diagram and the attention diagram are subjected to Hadamard product;
the ECA-Net attention module generates weights ω for each channel by a one-dimensional convolution of size K:
ω=σ(C1D K (y))
wherein ,C1DK Representing one-dimensional convolution with a convolution kernel of size K, y representing the channel, σ representing the sigmoid activation function; the mapping relationship between the channel dimensions C and K is as follows:
C=Φ(K)≈exp(γ×K-b)
i.e. given the channel dimension C, the convolution kernel size K is adaptively determined:
wherein ,|t|odd An odd number representing the nearest distance t, γ and b representing constants, the value of γ being set to 2, the value of b being set to 1;
step c: and convolving the upscales to the sizes of the input feature images point by 1X 1, and finally adding the feature images after upscales and the original feature images to obtain the output feature images.
Preferably, the target detection decoding head detects lossThe method comprises the following steps:
wherein ,indicating the loss of overlap->Indicating center distance loss, < >>Representing the loss of width and height; ioU is the intersection ratio of the bounding box and the real bounding box, b (gt) Respectively representing the center point coordinates of the prediction boundary frame and the center point coordinates, w and w of the real boundary frame (gt) Respectively represents the predicted width, the true width, h and h (gt) Respectively represent the predicted and true height, ρ 2 (. Cndot.) represents Euclidean distance, c represents diagonal distance of minimum rectangle that can wrap the predicted bounding box and the real bounding box; c w ,c h Representing the width, height of the smallest rectangular box surrounding the prediction bounding box and the real bounding box;
semantic segmentation solves for region loss that can travel in wharfThe method comprises the following steps:
wherein ,pt Representing the probability of correct prediction of the model for classification, p when the prediction is correct t P, otherwise p t =1-p,υ t and υω Is prepared from radix Ginseng RubraThe number is used for modulating the positive and negative sample weights and controlling the difficult and easy sample weights respectively;
model total loss functionThe method comprises the following steps:
total loss of wherein γ1 ,γ 2 Is a balance weight parameter.
In this embodiment, capturing an image of an orchard environment and constructing a data set specifically includes:
step 1.1, respectively photographing an orchard environment image under a natural environment at different time intervals, different illumination angles and different visual angles;
step 1.2, carrying out target detection labeling on fruit tree targets in an orchard environment, and carrying out semantic segmentation labeling on a drivable area; manually marking by using labelme marking software;
and 1.3, carrying out data enhancement and augmentation on the orchard environment image through brightness adjustment, gaussian blur, affine transformation, mirror image overturning and rainfall treatment, constructing a data set, and dividing the data set into a training set, a testing set and a verification set.
The embodiment provides an orchard environment scene detection terminal based on multitasking deep learning, which comprises: the robot comprises a camera, a processor, a navigation decision module and a robot main body, wherein the processor is provided with a deep learning model, the deep learning model comprises an improved MobileNetv3 backbone network, a spatial pyramid pool SPP module, a feature pyramid network FPN module, a target detection solution terminal and a semantic segmentation solution terminal, and the MobileNetv3 backbone network comprises a CBS layer and an improved bneck module;
the camera is carried on the robot main body and used for collecting an orchard environment image;
the processor is arranged in the robot main body and is used for inputting an orchard environment image into the improved MobileNetv3 backbone network, and an output characteristic diagram is obtained through the CBS layer and the improved bneck module in sequence;
the output feature map is generated and fused with features of different scales through a spatial pyramid pool SPP module, and different semantic hierarchy features are generated and fused through a feature pyramid network FPN module;
the method comprises the steps of generating images with different scale features and different semantic hierarchy features, decoding the images through a target detection decoding head and a semantic segmentation decoding head at the same time, obtaining a detection target through the target detection decoding head, and segmenting a drivable area through the semantic segmentation decoding head;
the navigation decision module is arranged inside the robot main body and is used for calculating a corresponding path through the drivable area and the detection target and controlling the movement of the robot main body.
In this embodiment, the specific processing procedure of the improved bneck module is as follows:
step a: the input features are subjected to dimension increasing through 1X 1 point-by-point convolution in sequence, the features after dimension increasing are subjected to 3X 3 depth convolution, and then the features are processed through an ECA-Net attention module;
step b: in an ECA-Net attention module, global average pooling operation is adopted for each channel of a feature matrix of an input feature diagram, local cross-channel information interaction is realized through one-dimensional convolution, attention diagram is formed through a sigmoid activation function, and the input feature diagram and the attention diagram are subjected to Hadamard product;
the ECA-Net attention module generates weights ω for each channel by a one-dimensional convolution of size K:
ω=σ(C1D K (y))
wherein ,C1DK Representing one-dimensional convolution with a convolution kernel of size K, y representing the channel, σ representing the sigmoid activation function; the mapping relationship between the channel dimensions C and K is as follows:
C=Φ(K)≈exp(γ×K-b)
i.e. given the channel dimension C, the convolution kernel size K is adaptively determined:
wherein ,|t|odd An odd number representing the nearest distance t, γ and b representing constants, the value of γ being set to 2, the value of b being set to 1;
step c: and convolving the upscales to the sizes of the input feature images point by 1X 1, and finally adding the feature images after upscales and the original feature images to obtain the output feature images.
Target detection decoding head detection lossThe method comprises the following steps:
wherein ,indicating the loss of overlap->Indicating center distance loss, < >>Representing the loss of width and height; ioU is the intersection ratio of the bounding box and the real bounding box, b (gt) Respectively representing the center point coordinates of the prediction boundary frame and the center point coordinates, w and w of the real boundary frame (gt) Respectively represents the predicted width, the true width, h and h (gt) Respectively represent the predicted and true height, ρ 2 (. Cndot.) represents Euclidean distance, c represents diagonal distance of minimum rectangle that can wrap the predicted bounding box and the real bounding box; c w ,c h Representing the width, height of the smallest rectangular box surrounding the prediction bounding box and the real bounding box;
semantic segmentation solves for region loss that can travel in wharfThe method comprises the following steps:
wherein ,pt Representing the probability of correct prediction of the model for classification, p when the prediction is correct t P, otherwise p t =1-p,υ t and υω Is super parameter, and is used for modulating the positive and negative sample weight and controlling the difficult and easy sample weight respectively;
model total loss functionThe method comprises the following steps:
total loss of wherein γ1 ,γ 2 Is a balance weight parameter.
The invention can be used for:
1. and (3) orchard planting management: the invention can be used by fruit farmers to detect the condition of the orchard road, discover problems in time and repair the problems, thereby improving the efficiency of orchard management and the fruit yield.
2. Orchard study and analysis: the method can be used for orchard research and analysis, for example, analysis of influence of road distribution and shape on fruit tree growth, and improvement of fruit tree quality and yield.
3. Automatic fruit picking robot: the fruit picking machine can help the machine to identify roads and paths in an orchard, avoid entering fruit tree areas by mistake, ensure safe operation of the machine, and improve the efficiency and precision of automatic fruit picking. The automatic fruit picking machine is helped to plan the optimal travelling path, so that travelling distance and time are reduced to the maximum extent, and the working efficiency and economic benefit of the machine are improved.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (9)

1. The orchard environment scene detection method based on multitask deep learning is characterized by comprising the following steps of:
collecting an orchard environment image and constructing a data set;
inputting an orchard environment image into an improved MobileNet 3 backbone network, and sequentially obtaining an output characteristic diagram through a CBS layer and an improved bneck module;
the output feature map is generated and fused with features of different scales through a spatial pyramid pool SPP module, and different semantic hierarchy features are generated and fused through a feature pyramid network FPN module;
and generating images with different scale features and different semantic hierarchy features, decoding the images through a target detection decoding head and a semantic segmentation decoding head simultaneously, obtaining a detection target through the target detection decoding head, and segmenting a travelable region through the semantic segmentation decoding head.
2. The method for detecting the environment scene of the orchard based on the multi-task deep learning according to claim 1, wherein the specific processing procedure of the improved bneck module is as follows:
step a: the input features are subjected to dimension increasing through 1X 1 point-by-point convolution in sequence, the features after dimension increasing are subjected to 3X 3 depth convolution, and then the features are processed through an ECA-Net attention module;
step b: in an ECA-Net attention module, global average pooling operation is adopted for each channel of a feature matrix of an input feature diagram, local cross-channel information interaction is realized through one-dimensional convolution, attention diagram is formed through a sigmoid activation function, and the input feature diagram and the attention diagram are subjected to Hadamard product;
the ECA-Net attention module generates weights ω for each channel by a one-dimensional convolution of size K:
ω=σ(C1D K (y))
wherein ,C1DK Representing one-dimensional convolution with a convolution kernel of size K, y representing the channel, σ representing the sigmoid activation function; the mapping relationship between the channel dimensions C and K is as follows:
C=Φ(K)≈exp(γ×K-b)
i.e. given the channel dimension C, the convolution kernel size K is adaptively determined:
wherein ,|t|odd An odd number representing the nearest distance t, γ and b representing constants, the value of γ being set to 2, the value of b being set to 1;
step c: and convolving the upscales to the sizes of the input feature images point by 1X 1, and finally adding the feature images after upscales and the original feature images to obtain the output feature images.
3. The method for detecting an orchard environment scene based on multi-task deep learning according to claim 1, wherein the target detection decoding head detects lossThe method comprises the following steps:
wherein ,indicating the loss of overlap->Indicating center distance loss, < >>Representing the loss of width and height; ioU is the intersection ratio of the prediction boundary frame and the real boundary frame, b (gt) Respectively representing the center point coordinates of the prediction boundary frame and the center point coordinates, w and w of the real boundary frame (gt) Respectively represents the predicted width, the true width, h and h (gt) Respectively represent the predicted and true height, ρ 2 (. Cndot.) represents Euclidean distance, c represents diagonal distance of minimum rectangle that can wrap the predicted bounding box and the real bounding box; c w ,c h Representing the width, height of the smallest rectangular box surrounding the prediction bounding box and the real bounding box;
semantic segmentation solves for region loss that can travel in wharfThe method comprises the following steps:
wherein ,pt Representing the probability of correct prediction of the model for classification, p when the prediction is correct t P, otherwise p t =1-p,υ t and υω Is super parameter, and is used for modulating the positive and negative sample weight and controlling the difficult and easy sample weight respectively;
model total loss functionThe method comprises the following steps:
total loss of wherein γ1 ,γ 2 Is a balance weight parameter.
4. The method for detecting an orchard environment scene based on multi-task deep learning according to claim 1, wherein the steps of acquiring an image of the orchard environment and constructing a data set comprise:
step 1.1, respectively photographing an orchard environment image under a natural environment at different time intervals, different illumination angles and different visual angles;
step 1.2, carrying out target detection labeling on fruit tree targets in an orchard environment, and carrying out semantic segmentation labeling on a drivable area;
and 1.3, carrying out data enhancement and augmentation on the orchard environment image through brightness adjustment, gaussian blur, affine transformation, mirror image overturning and eclosion treatment, and constructing a data set.
5. Orchard environment scene detection terminal based on multitasking deep learning, which is characterized by comprising: the robot comprises a camera, a processor, a navigation decision module and a robot main body, wherein the processor is provided with a deep learning model, the deep learning model comprises an improved MobileNetv3 backbone network, a spatial pyramid pool SPP module, a feature pyramid network FPN module, a target detection solution terminal and a semantic segmentation solution terminal, and the improved MobileNetv3 backbone network comprises a CBS layer and an improved bnck module;
the camera is carried on the robot main body and used for collecting an orchard environment image;
the processor is arranged in the robot main body and is used for inputting an orchard environment image into the improved MobileNetv3 backbone network, and an output characteristic diagram is obtained through the CBS layer and the improved bneck module in sequence;
the output feature map is generated and fused with features of different scales through a spatial pyramid pool SPP module, and different semantic hierarchy features are generated and fused through a feature pyramid network FPN module;
the method comprises the steps of generating images with different scale features and different semantic hierarchy features, decoding the images through a target detection decoding head and a semantic segmentation decoding head at the same time, obtaining a detection target through the target detection decoding head, and segmenting a drivable area through the semantic segmentation decoding head;
the navigation decision module is arranged inside the robot main body and is used for calculating a corresponding path through the drivable area and the detection target and controlling the movement of the robot main body.
6. The terminal for detecting the scene of the orchard environment based on the multi-task deep learning according to claim 5, wherein the improved bneck module specifically processes:
step a: the input features are subjected to dimension increasing through 1X 1 point-by-point convolution in sequence, the features after dimension increasing are subjected to 3X 3 depth convolution, and then the features are processed through an ECA-Net attention module;
step b: in an ECA-Net attention module, global average pooling operation is adopted for each channel of a feature matrix of an input feature diagram, local cross-channel information interaction is realized through one-dimensional convolution, attention diagram is formed through a sigmoid activation function, and the input feature diagram and the attention diagram are subjected to Hadamard product;
the ECA-Net attention module generates weights ω for each channel by a one-dimensional convolution of size K:
ω=σ(C1D K (y))
wherein ,C1DK Representing one-dimensional convolution with a convolution kernel of size K, y representing the channel, σ representing the sigmoid activation function; the mapping relationship between the channel dimensions C and K is as follows:
C=Φ(K)≈exp(γ×K-b)
i.e. given the channel dimension C, the convolution kernel size K is adaptively determined:
wherein ,|t|odd An odd number representing the nearest distance t, γ and b representing constants, the value of γ being set to 2, the value of b being set to 1;
step c: and convolving the upscales to the sizes of the input feature images point by 1X 1, and finally adding the feature images after upscales and the original feature images to obtain the output feature images.
7. The terminal for detecting an orchard environment scene based on multi-task deep learning according to claim 5, wherein the target detection decoding head detects lossThe method comprises the following steps:
wherein ,indicating the loss of overlap->Indicating center distance loss, < >>Representing the loss of width and height; ioU is the intersection ratio of the bounding box and the real bounding box, b (gt) Respectively representing the center point coordinates of the prediction boundary frame and the center point coordinates, w and w of the real boundary frame (gt) Respectively represents the predicted width, the true width, h and h (gt) Respectively represent the predicted and true height, ρ 2 (. Cndot.) represents Euclidean distance, c represents diagonal distance of minimum rectangle that can wrap the predicted bounding box and the real bounding box; c w ,c h Representing the width and height of the smallest rectangular box surrounding the prediction and real bounding boxes;
Semantic segmentation solves for region loss that can travel in wharfThe method comprises the following steps:
wherein ,pt Representing the probability of correct prediction of the model for classification, p when the prediction is correct t P, otherwise p t =1-p,υ t and υω Is super parameter, and is used for modulating the positive and negative sample weight and controlling the difficult and easy sample weight respectively;
model total loss functionThe method comprises the following steps:
total loss of wherein γ1 ,γ 2 Is a balance weight parameter.
8. A computer readable medium having instructions stored thereon, which when executed on the readable medium cause the readable medium to perform the method of any of claims 1-4.
9. A processing terminal comprising a memory and a processor, the memory storing a computer program executable on the processor, characterized in that the processor implements the method according to any of claims 1-4 when executing the computer program.
CN202310901547.8A 2023-07-21 2023-07-21 Orchard environment scene detection method and terminal based on multitask deep learning Pending CN116935296A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310901547.8A CN116935296A (en) 2023-07-21 2023-07-21 Orchard environment scene detection method and terminal based on multitask deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310901547.8A CN116935296A (en) 2023-07-21 2023-07-21 Orchard environment scene detection method and terminal based on multitask deep learning

Publications (1)

Publication Number Publication Date
CN116935296A true CN116935296A (en) 2023-10-24

Family

ID=88385817

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310901547.8A Pending CN116935296A (en) 2023-07-21 2023-07-21 Orchard environment scene detection method and terminal based on multitask deep learning

Country Status (1)

Country Link
CN (1) CN116935296A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118333825A (en) * 2024-04-22 2024-07-12 仲恺农业工程学院 Pitaya orchard road navigation identification method based on machine vision

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118333825A (en) * 2024-04-22 2024-07-12 仲恺农业工程学院 Pitaya orchard road navigation identification method based on machine vision

Similar Documents

Publication Publication Date Title
CN111046880B (en) Infrared target image segmentation method, system, electronic equipment and storage medium
CN110222626B (en) Unmanned scene point cloud target labeling method based on deep learning algorithm
Bao et al. UAV remote sensing detection of tea leaf blight based on DDMA-YOLO
Blok et al. The effect of data augmentation and network simplification on the image‐based detection of broccoli heads with Mask R‐CNN
Liu et al. Automatic segmentation of overlapped poplar seedling leaves combining Mask R-CNN and DBSCAN
Wang et al. Tea picking point detection and location based on Mask-RCNN
CN112598713A (en) Offshore submarine fish detection and tracking statistical method based on deep learning
Wan et al. A real-time branch detection and reconstruction mechanism for harvesting robot via convolutional neural network and image segmentation
CN107918776A (en) A kind of plan for land method, system and electronic equipment based on machine vision
EP4174792A1 (en) Method for scene understanding and semantic analysis of objects
CN113312999B (en) High-precision detection method and device for diaphorina citri in natural orchard scene
Shuai et al. An improved YOLOv5-based method for multi-species tea shoot detection and picking point location in complex backgrounds
CN113408584A (en) RGB-D multi-modal feature fusion 3D target detection method
CN110163798A (en) Fishing ground purse seine damage testing method and system
CN116935296A (en) Orchard environment scene detection method and terminal based on multitask deep learning
CN110298366B (en) Crop distribution extraction method and device
CN116883650A (en) Image-level weak supervision semantic segmentation method based on attention and local stitching
Xiang et al. PhenoStereo: a high-throughput stereo vision system for field-based plant phenotyping-with an application in sorghum stem diameter estimation
Chen et al. Improved fast r-cnn with fusion of optical and 3d data for robust palm tree detection in high resolution uav images
Jiang et al. Thin wire segmentation and reconstruction based on a novel image overlap-partitioning and stitching algorithm in apple fruiting wall architecture for robotic picking
Wang et al. MeDERT: A metal surface defect detection model
CN117437691A (en) Real-time multi-person abnormal behavior identification method and system based on lightweight network
Islam et al. QuanCro: a novel framework for quantification of corn crops’ consistency under natural field conditions
CN112232403A (en) Fusion method of infrared image and visible light image
CN115995017A (en) Fruit identification and positioning method, device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination