CN116935296A - Orchard environment scene detection method and terminal based on multitask deep learning - Google Patents
Orchard environment scene detection method and terminal based on multitask deep learning Download PDFInfo
- Publication number
- CN116935296A CN116935296A CN202310901547.8A CN202310901547A CN116935296A CN 116935296 A CN116935296 A CN 116935296A CN 202310901547 A CN202310901547 A CN 202310901547A CN 116935296 A CN116935296 A CN 116935296A
- Authority
- CN
- China
- Prior art keywords
- module
- representing
- features
- orchard
- orchard environment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 239000002420 orchard Substances 0.000 title claims abstract description 65
- 238000001514 detection method Methods 0.000 title claims abstract description 58
- 238000013135 deep learning Methods 0.000 title claims abstract description 21
- 238000000034 method Methods 0.000 claims abstract description 42
- 238000010586 diagram Methods 0.000 claims abstract description 34
- 230000011218 segmentation Effects 0.000 claims abstract description 31
- 238000012545 processing Methods 0.000 claims abstract description 6
- 235000013399 edible fruits Nutrition 0.000 claims description 14
- 230000004913 activation Effects 0.000 claims description 13
- 230000006870 function Effects 0.000 claims description 13
- 238000013136 deep learning model Methods 0.000 claims description 8
- 230000003993 interaction Effects 0.000 claims description 6
- 238000002372 labelling Methods 0.000 claims description 6
- 238000013507 mapping Methods 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 6
- 238000011176 pooling Methods 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims description 4
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 claims description 3
- 230000003416 augmentation Effects 0.000 claims description 3
- 238000005286 illumination Methods 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 230000000007 visual effect Effects 0.000 claims description 3
- 230000032669 eclosion Effects 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 241000208340 Araliaceae Species 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/082—Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
- G06V10/267—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/70—Labelling scene content, e.g. deriving syntactic or semantic representations
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The invention discloses an orchard environment scene detection method and terminal based on multitask deep learning, wherein the method comprises the steps of collecting an orchard environment image and constructing a data set; inputting an orchard environment image into an improved MobileNet 3 backbone network, and sequentially obtaining an output characteristic diagram through a CBS layer and an improved bneck module; the output feature map is generated and fused with features of different scales through a spatial pyramid pool SPP module, and different semantic hierarchy features are generated and fused through a feature pyramid network FPN module; and generating images with different scale features and different semantic hierarchy features, decoding the images through a target detection decoding head and a semantic segmentation decoding head simultaneously, obtaining a detection target through the target detection decoding head, and segmenting a travelable region through the semantic segmentation decoding head. According to the invention, semantic segmentation and target detection tasks are combined for processing, so that the simultaneous identification of the drivable area and the obstacle of the orchard is realized.
Description
Technical Field
The invention relates to the technical field of orchard management, in particular to an orchard environment scene detection method and terminal based on multi-task deep learning.
Background
Along with the enlargement of the planting area of the orchard in China, the progress of agricultural mechanization and the rise of labor cost, the standardized development and intelligent management of the orchard are required, and the necessary trend of future development is also realized. In order to improve the efficiency of orchard management, reduce the labor intensity and simultaneously reduce the production cost, fruit farmers need to adopt an orchard environment automatic driving technology. The technology needs to rely on the perception technology of the orchard environment to provide important information for the automatic driving decision-making module. Image recognition is a very important aspect in the perception technology of the orchard environment.
Conventional pavement identification algorithms typically employ extraction of surface features such as texture, color, shape, etc. However, this method has a problem in that it lacks extraction and expression of deep features and advanced semantic information, and thus it does not perform well in recognition of complex unstructured orchard road scenes. The orchard environment recognition algorithm adopts various means to respectively process semantic segmentation and target detection tasks, so that higher accuracy is realized. While these methods work well, if the tasks are processed sequentially, they are time consuming than disposable.
Therefore, how to provide a method and a terminal for detecting an orchard environment scene through multi-task deep learning is a problem that needs to be solved by those skilled in the art.
Disclosure of Invention
In view of the above, the invention provides an orchard environment scene detection method and terminal based on multi-task deep learning, which can realize the joint task of dividing a drivable area and detecting obstacles, share the same backbone network, and have higher real-time performance, higher accuracy and shorter time.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
the orchard environment scene detection method based on multitask deep learning comprises the following steps:
collecting an orchard environment image and constructing a data set;
inputting an orchard environment image into an improved MobileNet 3 backbone network, and sequentially obtaining an output characteristic diagram through a CBS layer and an improved bneck module;
the output feature map is generated and fused with features of different scales through a spatial pyramid pool SPP module, and different semantic hierarchy features are generated and fused through a feature pyramid network FPN module;
and generating images with different scale features and different semantic hierarchy features, decoding the images through a target detection decoding head and a semantic segmentation decoding head simultaneously, obtaining a detection target through the target detection decoding head, and segmenting a travelable region through the semantic segmentation decoding head.
Preferably, the improved bneck module specifically processes:
step a: the input features are subjected to dimension increasing through 1X 1 point-by-point convolution in sequence, the features after dimension increasing are subjected to 3X 3 depth convolution, and then the features are processed through an ECA-Net attention module;
step b: in an ECA-Net attention module, global average pooling operation is adopted for each channel of a feature matrix of an input feature diagram, local cross-channel information interaction is realized through one-dimensional convolution, attention diagram is formed through a sigmoid activation function, and the input feature diagram and the attention diagram are subjected to Hadamard product;
the ECA-Net attention module generates weights ω for each channel by a one-dimensional convolution of size K:
ω=σ(C1 D K (y))
wherein ,C1DK Representing one-dimensional convolution with a convolution kernel of size K, y representing the channel, σ representing the sigmoid activation function; the mapping relationship between the channel dimensions C and K is as follows:
C=Φ(K)≈exp(γ×K-b)
i.e. given the channel dimension C, the convolution kernel size K is adaptively determined:
wherein ,|t|odd An odd number representing the nearest distance t, γ and b representing constants, the value of γ being set to 2, the value of b being set to 1;
step c: and convolving the upscales to the sizes of the input feature images point by 1X 1, and finally adding the feature images after upscales and the original feature images to obtain the output feature images.
Preferably, the target detection decoding head detects lossThe method comprises the following steps:
wherein ,indicating the loss of overlap->Indicating center distance loss, < >>Representing the loss of width and height; ioU is the intersection ratio of the bounding box and the real bounding box, b (gt) Respectively representing the center point coordinates of the prediction boundary frame and the center point coordinates, w and w of the real boundary frame (gt) Respectively represents the predicted width, the true width, h and h (gt) Respectively represent the predicted and true height, ρ 2 (. Cndot.) represents Euclidean distance, c represents diagonal distance of minimum rectangle that can wrap the predicted bounding box and the real bounding box; c w ,c h Representing the width, height of the smallest rectangular box surrounding the prediction bounding box and the real bounding box;
semantic segmentation solves for region loss that can travel in wharfThe method comprises the following steps:
wherein ,pt Representing the probability of correct prediction of the model for classification, p when the prediction is correct t P, otherwise p t =1-p,υ t and υω Is super parameter, and is used for modulating the positive and negative sample weight and controlling the difficult and easy sample weight respectively;
model total loss functionThe method comprises the following steps:
total loss of wherein γ1 ,γ 2 Is a balance weight parameter.
Preferably, acquiring an image of an orchard environment and constructing a data set specifically includes:
step 1.1, respectively photographing an orchard environment image under a natural environment at different time intervals, different illumination angles and different visual angles;
step 1.2, carrying out target detection labeling on fruit tree targets in an orchard environment, and carrying out semantic segmentation labeling on a drivable area;
and 1.3, carrying out data enhancement and augmentation on the orchard environment image through brightness adjustment, gaussian blur, affine transformation, mirror image overturning and eclosion treatment, and constructing a data set.
Orchard environment scene detection terminal based on multitasking deep learning includes: the robot comprises a camera, a processor, a navigation decision module and a robot main body, wherein the processor is provided with a deep learning model, the deep learning model comprises an improved MobileNetv3 backbone network, a spatial pyramid pool SPP module, a feature pyramid network FPN module, a target detection solution terminal and a semantic segmentation solution terminal, and the improved MobileNetv3 backbone network comprises a CBS layer and an improved bnck module;
the camera is carried on the robot main body and used for collecting an orchard environment image;
the processor is arranged in the robot main body and is used for inputting an orchard environment image into the improved MobileNetv3 backbone network, and an output characteristic diagram is obtained through the CBS layer and the improved bneck module in sequence;
the output feature map is generated and fused with features of different scales through a spatial pyramid pool SPP module, and different semantic hierarchy features are generated and fused through a feature pyramid network FPN module;
the method comprises the steps of generating images with different scale features and different semantic hierarchy features, decoding the images through a target detection decoding head and a semantic segmentation decoding head at the same time, obtaining a detection target through the target detection decoding head, and segmenting a drivable area through the semantic segmentation decoding head;
the navigation decision module is arranged inside the robot main body and is used for calculating a corresponding path through the drivable area and the detection target and controlling the movement of the robot main body.
Preferably, the improved bneck module specifically processes:
step a: the input features are subjected to dimension increasing through 1X 1 point-by-point convolution in sequence, the features after dimension increasing are subjected to 3X 3 depth convolution, and then the features are processed through an ECA-Net attention module;
step b: in an ECA-Net attention module, global average pooling operation is adopted for each channel of a feature matrix of an input feature diagram, local cross-channel information interaction is realized through one-dimensional convolution, attention diagram is formed through a sigmoid activation function, and the input feature diagram and the attention diagram are subjected to Hadamard product;
the ECA-Net attention module generates weights ω for each channel by a one-dimensional convolution of size K:
ω=σ(C1D K (y))
wherein ,C1DK Representing one-dimensional convolution with a convolution kernel of size K, y representing the channel, σ representing the sigmoid activation function; the mapping relationship between the channel dimensions C and K is as follows:
C=Φ(K)≈exp(γ×K-b)
i.e. given the channel dimension C, the convolution kernel size K is adaptively determined:
wherein ,|t|odd An odd number representing the nearest distance t, γ and b representing constants, the value of γ being set to 2, the value of b being set to 1;
step c: and convolving the upscales to the sizes of the input feature images point by 1X 1, and finally adding the feature images after upscales and the original feature images to obtain the output feature images.
Preferably, the target detection decoding head detects lossThe method comprises the following steps:
wherein ,indicating the loss of overlap->Indicating center distance loss, < >>Representing the loss of width and height; ioU is the intersection ratio of the bounding box and the real bounding box, b (gt) Respectively representing the center point coordinates of the prediction boundary frame and the center point coordinates, w and w of the real boundary frame (gt) Respectively represents the predicted width, the true width, h and h (gt) Respectively represent the predicted and true height, ρ 2 (. Cndot.) represents Euclidean distance, c represents diagonal distance of minimum rectangle that can wrap the predicted bounding box and the real bounding box; c w ,c h Representing the width, height of the smallest rectangular box surrounding the prediction bounding box and the real bounding box;
semantic segmentation solves for region loss that can travel in wharfThe method comprises the following steps:
wherein ,pt Representing the probability of correct prediction of the model for classification, p when the prediction is correct t P, otherwise p t =1-p,υ t and υω Is super parameter, and is used for modulating the positive and negative sample weight and controlling the difficult and easy sample weight respectively;
model total loss functionThe method comprises the following steps:
total loss of wherein γ1 ,γ 2 Is a balance weight parameter.
A computer readable medium storing instructions that, when executed on the readable medium, cause the readable medium to perform a multitasking deep learning based method of orchard environment scene detection.
A processing terminal comprises a memory and a processor, wherein a computer program capable of running on the processor is stored in the memory, and the processor realizes an orchard environment scene detection method based on multi-task deep learning when executing the computer program.
Compared with the prior art, the invention discloses an orchard environment scene detection method and terminal based on multi-task deep learning, which are used for processing semantic segmentation and target detection tasks in a combined way, so that the simultaneous identification of a drivable area and an obstacle of an orchard is realized. The semantic segmentation task is mainly used for segmenting a exercisable area, and the object detection is mainly used for detecting obstacles. By adopting the method, the efficiency of orchard management can be improved, the labor intensity can be reduced, and the production cost can be reduced.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic diagram of an orchard environment scene detection method based on multi-task deep learning.
FIG. 2 is a schematic diagram of the deep learning model structure of the present invention.
FIG. 3 is a schematic diagram of a modified bneck module structure of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The embodiment of the invention discloses an orchard environment scene detection method based on multi-task deep learning, which is shown in fig. 1 and comprises the following steps:
collecting an orchard environment image and constructing a data set;
inputting an orchard environment image into an improved MobileNet 3 backbone network, and sequentially obtaining an output characteristic diagram through a CBS layer and an improved bneck module;
the output feature map is generated and fused with features of different scales through a spatial pyramid pool SPP module, and different semantic hierarchy features are generated and fused through a feature pyramid network FPN module;
and generating images with different scale features and different semantic hierarchy features, decoding the images through a target detection decoding head and a semantic segmentation decoding head simultaneously, obtaining a detection target through the target detection decoding head, and segmenting a travelable region through the semantic segmentation decoding head. The driving area is divided into 2 minutes: a travelable region, a non-travelable region; obstacle detection targets are classified into people, fruit trees and the like.
In this embodiment, in the modified MobileNetv3 backbone network, the present invention uses the ECA attention mechanism to replace SE blocks in the original bneck, enhancing feature extraction.
As shown in fig. 2, the deep learning model includes an improved MobileNetv3 backbone network, a spatial pyramid pool SPP module, a feature pyramid network FPN module, a target detection solution terminal and a semantic segmentation solution terminal, and the MobileNetv3 backbone network includes a CBS layer and an improved bneck module.
The CBS layer consists of a two-dimensional convolution layer, a BN layer and a SiLU activation function:
SiLU(x)=x·Sigmoid(x)
as shown in fig. 3, the modified bneck module specifically processes:
step a: the input features are subjected to dimension increasing through 1X 1 point-by-point convolution in sequence, the features after dimension increasing are subjected to 3X 3 depth convolution, and then the features are processed through an ECA-Net attention module;
step b: in an ECA-Net attention module, global average pooling operation is adopted for each channel of a feature matrix of an input feature diagram, local cross-channel information interaction is realized through one-dimensional convolution, attention diagram is formed through a sigmoid activation function, and the input feature diagram and the attention diagram are subjected to Hadamard product;
the ECA-Net attention module generates weights ω for each channel by a one-dimensional convolution of size K:
ω=σ(C1D K (y))
wherein ,C1DK Representing one-dimensional convolution with a convolution kernel of size K, y representing the channel, σ representing the sigmoid activation function; the mapping relationship between the channel dimensions C and K is as follows:
C=Φ(K)≈exp(γ×K-b)
i.e. given the channel dimension C, the convolution kernel size K is adaptively determined:
wherein ,|t|odd An odd number representing the nearest distance t, γ and b representing constants, the value of γ being set to 2, the value of b being set to 1;
step c: and convolving the upscales to the sizes of the input feature images point by 1X 1, and finally adding the feature images after upscales and the original feature images to obtain the output feature images.
Preferably, the target detection decoding head detects lossThe method comprises the following steps:
wherein ,indicating the loss of overlap->Indicating center distance loss, < >>Representing the loss of width and height; ioU is the intersection ratio of the bounding box and the real bounding box, b (gt) Respectively representing the center point coordinates of the prediction boundary frame and the center point coordinates, w and w of the real boundary frame (gt) Respectively represents the predicted width, the true width, h and h (gt) Respectively represent the predicted and true height, ρ 2 (. Cndot.) represents Euclidean distance, c represents diagonal distance of minimum rectangle that can wrap the predicted bounding box and the real bounding box; c w ,c h Representing the width, height of the smallest rectangular box surrounding the prediction bounding box and the real bounding box;
semantic segmentation solves for region loss that can travel in wharfThe method comprises the following steps:
wherein ,pt Representing the probability of correct prediction of the model for classification, p when the prediction is correct t P, otherwise p t =1-p,υ t and υω Is prepared from radix Ginseng RubraThe number is used for modulating the positive and negative sample weights and controlling the difficult and easy sample weights respectively;
model total loss functionThe method comprises the following steps:
total loss of wherein γ1 ,γ 2 Is a balance weight parameter.
In this embodiment, capturing an image of an orchard environment and constructing a data set specifically includes:
step 1.1, respectively photographing an orchard environment image under a natural environment at different time intervals, different illumination angles and different visual angles;
step 1.2, carrying out target detection labeling on fruit tree targets in an orchard environment, and carrying out semantic segmentation labeling on a drivable area; manually marking by using labelme marking software;
and 1.3, carrying out data enhancement and augmentation on the orchard environment image through brightness adjustment, gaussian blur, affine transformation, mirror image overturning and rainfall treatment, constructing a data set, and dividing the data set into a training set, a testing set and a verification set.
The embodiment provides an orchard environment scene detection terminal based on multitasking deep learning, which comprises: the robot comprises a camera, a processor, a navigation decision module and a robot main body, wherein the processor is provided with a deep learning model, the deep learning model comprises an improved MobileNetv3 backbone network, a spatial pyramid pool SPP module, a feature pyramid network FPN module, a target detection solution terminal and a semantic segmentation solution terminal, and the MobileNetv3 backbone network comprises a CBS layer and an improved bneck module;
the camera is carried on the robot main body and used for collecting an orchard environment image;
the processor is arranged in the robot main body and is used for inputting an orchard environment image into the improved MobileNetv3 backbone network, and an output characteristic diagram is obtained through the CBS layer and the improved bneck module in sequence;
the output feature map is generated and fused with features of different scales through a spatial pyramid pool SPP module, and different semantic hierarchy features are generated and fused through a feature pyramid network FPN module;
the method comprises the steps of generating images with different scale features and different semantic hierarchy features, decoding the images through a target detection decoding head and a semantic segmentation decoding head at the same time, obtaining a detection target through the target detection decoding head, and segmenting a drivable area through the semantic segmentation decoding head;
the navigation decision module is arranged inside the robot main body and is used for calculating a corresponding path through the drivable area and the detection target and controlling the movement of the robot main body.
In this embodiment, the specific processing procedure of the improved bneck module is as follows:
step a: the input features are subjected to dimension increasing through 1X 1 point-by-point convolution in sequence, the features after dimension increasing are subjected to 3X 3 depth convolution, and then the features are processed through an ECA-Net attention module;
step b: in an ECA-Net attention module, global average pooling operation is adopted for each channel of a feature matrix of an input feature diagram, local cross-channel information interaction is realized through one-dimensional convolution, attention diagram is formed through a sigmoid activation function, and the input feature diagram and the attention diagram are subjected to Hadamard product;
the ECA-Net attention module generates weights ω for each channel by a one-dimensional convolution of size K:
ω=σ(C1D K (y))
wherein ,C1DK Representing one-dimensional convolution with a convolution kernel of size K, y representing the channel, σ representing the sigmoid activation function; the mapping relationship between the channel dimensions C and K is as follows:
C=Φ(K)≈exp(γ×K-b)
i.e. given the channel dimension C, the convolution kernel size K is adaptively determined:
wherein ,|t|odd An odd number representing the nearest distance t, γ and b representing constants, the value of γ being set to 2, the value of b being set to 1;
step c: and convolving the upscales to the sizes of the input feature images point by 1X 1, and finally adding the feature images after upscales and the original feature images to obtain the output feature images.
Target detection decoding head detection lossThe method comprises the following steps:
wherein ,indicating the loss of overlap->Indicating center distance loss, < >>Representing the loss of width and height; ioU is the intersection ratio of the bounding box and the real bounding box, b (gt) Respectively representing the center point coordinates of the prediction boundary frame and the center point coordinates, w and w of the real boundary frame (gt) Respectively represents the predicted width, the true width, h and h (gt) Respectively represent the predicted and true height, ρ 2 (. Cndot.) represents Euclidean distance, c represents diagonal distance of minimum rectangle that can wrap the predicted bounding box and the real bounding box; c w ,c h Representing the width, height of the smallest rectangular box surrounding the prediction bounding box and the real bounding box;
semantic segmentation solves for region loss that can travel in wharfThe method comprises the following steps:
wherein ,pt Representing the probability of correct prediction of the model for classification, p when the prediction is correct t P, otherwise p t =1-p,υ t and υω Is super parameter, and is used for modulating the positive and negative sample weight and controlling the difficult and easy sample weight respectively;
model total loss functionThe method comprises the following steps:
total loss of wherein γ1 ,γ 2 Is a balance weight parameter.
The invention can be used for:
1. and (3) orchard planting management: the invention can be used by fruit farmers to detect the condition of the orchard road, discover problems in time and repair the problems, thereby improving the efficiency of orchard management and the fruit yield.
2. Orchard study and analysis: the method can be used for orchard research and analysis, for example, analysis of influence of road distribution and shape on fruit tree growth, and improvement of fruit tree quality and yield.
3. Automatic fruit picking robot: the fruit picking machine can help the machine to identify roads and paths in an orchard, avoid entering fruit tree areas by mistake, ensure safe operation of the machine, and improve the efficiency and precision of automatic fruit picking. The automatic fruit picking machine is helped to plan the optimal travelling path, so that travelling distance and time are reduced to the maximum extent, and the working efficiency and economic benefit of the machine are improved.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (9)
1. The orchard environment scene detection method based on multitask deep learning is characterized by comprising the following steps of:
collecting an orchard environment image and constructing a data set;
inputting an orchard environment image into an improved MobileNet 3 backbone network, and sequentially obtaining an output characteristic diagram through a CBS layer and an improved bneck module;
the output feature map is generated and fused with features of different scales through a spatial pyramid pool SPP module, and different semantic hierarchy features are generated and fused through a feature pyramid network FPN module;
and generating images with different scale features and different semantic hierarchy features, decoding the images through a target detection decoding head and a semantic segmentation decoding head simultaneously, obtaining a detection target through the target detection decoding head, and segmenting a travelable region through the semantic segmentation decoding head.
2. The method for detecting the environment scene of the orchard based on the multi-task deep learning according to claim 1, wherein the specific processing procedure of the improved bneck module is as follows:
step a: the input features are subjected to dimension increasing through 1X 1 point-by-point convolution in sequence, the features after dimension increasing are subjected to 3X 3 depth convolution, and then the features are processed through an ECA-Net attention module;
step b: in an ECA-Net attention module, global average pooling operation is adopted for each channel of a feature matrix of an input feature diagram, local cross-channel information interaction is realized through one-dimensional convolution, attention diagram is formed through a sigmoid activation function, and the input feature diagram and the attention diagram are subjected to Hadamard product;
the ECA-Net attention module generates weights ω for each channel by a one-dimensional convolution of size K:
ω=σ(C1D K (y))
wherein ,C1DK Representing one-dimensional convolution with a convolution kernel of size K, y representing the channel, σ representing the sigmoid activation function; the mapping relationship between the channel dimensions C and K is as follows:
C=Φ(K)≈exp(γ×K-b)
i.e. given the channel dimension C, the convolution kernel size K is adaptively determined:
wherein ,|t|odd An odd number representing the nearest distance t, γ and b representing constants, the value of γ being set to 2, the value of b being set to 1;
step c: and convolving the upscales to the sizes of the input feature images point by 1X 1, and finally adding the feature images after upscales and the original feature images to obtain the output feature images.
3. The method for detecting an orchard environment scene based on multi-task deep learning according to claim 1, wherein the target detection decoding head detects lossThe method comprises the following steps:
wherein ,indicating the loss of overlap->Indicating center distance loss, < >>Representing the loss of width and height; ioU is the intersection ratio of the prediction boundary frame and the real boundary frame, b (gt) Respectively representing the center point coordinates of the prediction boundary frame and the center point coordinates, w and w of the real boundary frame (gt) Respectively represents the predicted width, the true width, h and h (gt) Respectively represent the predicted and true height, ρ 2 (. Cndot.) represents Euclidean distance, c represents diagonal distance of minimum rectangle that can wrap the predicted bounding box and the real bounding box; c w ,c h Representing the width, height of the smallest rectangular box surrounding the prediction bounding box and the real bounding box;
semantic segmentation solves for region loss that can travel in wharfThe method comprises the following steps:
wherein ,pt Representing the probability of correct prediction of the model for classification, p when the prediction is correct t P, otherwise p t =1-p,υ t and υω Is super parameter, and is used for modulating the positive and negative sample weight and controlling the difficult and easy sample weight respectively;
model total loss functionThe method comprises the following steps:
total loss of wherein γ1 ,γ 2 Is a balance weight parameter.
4. The method for detecting an orchard environment scene based on multi-task deep learning according to claim 1, wherein the steps of acquiring an image of the orchard environment and constructing a data set comprise:
step 1.1, respectively photographing an orchard environment image under a natural environment at different time intervals, different illumination angles and different visual angles;
step 1.2, carrying out target detection labeling on fruit tree targets in an orchard environment, and carrying out semantic segmentation labeling on a drivable area;
and 1.3, carrying out data enhancement and augmentation on the orchard environment image through brightness adjustment, gaussian blur, affine transformation, mirror image overturning and eclosion treatment, and constructing a data set.
5. Orchard environment scene detection terminal based on multitasking deep learning, which is characterized by comprising: the robot comprises a camera, a processor, a navigation decision module and a robot main body, wherein the processor is provided with a deep learning model, the deep learning model comprises an improved MobileNetv3 backbone network, a spatial pyramid pool SPP module, a feature pyramid network FPN module, a target detection solution terminal and a semantic segmentation solution terminal, and the improved MobileNetv3 backbone network comprises a CBS layer and an improved bnck module;
the camera is carried on the robot main body and used for collecting an orchard environment image;
the processor is arranged in the robot main body and is used for inputting an orchard environment image into the improved MobileNetv3 backbone network, and an output characteristic diagram is obtained through the CBS layer and the improved bneck module in sequence;
the output feature map is generated and fused with features of different scales through a spatial pyramid pool SPP module, and different semantic hierarchy features are generated and fused through a feature pyramid network FPN module;
the method comprises the steps of generating images with different scale features and different semantic hierarchy features, decoding the images through a target detection decoding head and a semantic segmentation decoding head at the same time, obtaining a detection target through the target detection decoding head, and segmenting a drivable area through the semantic segmentation decoding head;
the navigation decision module is arranged inside the robot main body and is used for calculating a corresponding path through the drivable area and the detection target and controlling the movement of the robot main body.
6. The terminal for detecting the scene of the orchard environment based on the multi-task deep learning according to claim 5, wherein the improved bneck module specifically processes:
step a: the input features are subjected to dimension increasing through 1X 1 point-by-point convolution in sequence, the features after dimension increasing are subjected to 3X 3 depth convolution, and then the features are processed through an ECA-Net attention module;
step b: in an ECA-Net attention module, global average pooling operation is adopted for each channel of a feature matrix of an input feature diagram, local cross-channel information interaction is realized through one-dimensional convolution, attention diagram is formed through a sigmoid activation function, and the input feature diagram and the attention diagram are subjected to Hadamard product;
the ECA-Net attention module generates weights ω for each channel by a one-dimensional convolution of size K:
ω=σ(C1D K (y))
wherein ,C1DK Representing one-dimensional convolution with a convolution kernel of size K, y representing the channel, σ representing the sigmoid activation function; the mapping relationship between the channel dimensions C and K is as follows:
C=Φ(K)≈exp(γ×K-b)
i.e. given the channel dimension C, the convolution kernel size K is adaptively determined:
wherein ,|t|odd An odd number representing the nearest distance t, γ and b representing constants, the value of γ being set to 2, the value of b being set to 1;
step c: and convolving the upscales to the sizes of the input feature images point by 1X 1, and finally adding the feature images after upscales and the original feature images to obtain the output feature images.
7. The terminal for detecting an orchard environment scene based on multi-task deep learning according to claim 5, wherein the target detection decoding head detects lossThe method comprises the following steps:
wherein ,indicating the loss of overlap->Indicating center distance loss, < >>Representing the loss of width and height; ioU is the intersection ratio of the bounding box and the real bounding box, b (gt) Respectively representing the center point coordinates of the prediction boundary frame and the center point coordinates, w and w of the real boundary frame (gt) Respectively represents the predicted width, the true width, h and h (gt) Respectively represent the predicted and true height, ρ 2 (. Cndot.) represents Euclidean distance, c represents diagonal distance of minimum rectangle that can wrap the predicted bounding box and the real bounding box; c w ,c h Representing the width and height of the smallest rectangular box surrounding the prediction and real bounding boxes;
Semantic segmentation solves for region loss that can travel in wharfThe method comprises the following steps:
wherein ,pt Representing the probability of correct prediction of the model for classification, p when the prediction is correct t P, otherwise p t =1-p,υ t and υω Is super parameter, and is used for modulating the positive and negative sample weight and controlling the difficult and easy sample weight respectively;
model total loss functionThe method comprises the following steps:
total loss of wherein γ1 ,γ 2 Is a balance weight parameter.
8. A computer readable medium having instructions stored thereon, which when executed on the readable medium cause the readable medium to perform the method of any of claims 1-4.
9. A processing terminal comprising a memory and a processor, the memory storing a computer program executable on the processor, characterized in that the processor implements the method according to any of claims 1-4 when executing the computer program.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310901547.8A CN116935296A (en) | 2023-07-21 | 2023-07-21 | Orchard environment scene detection method and terminal based on multitask deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310901547.8A CN116935296A (en) | 2023-07-21 | 2023-07-21 | Orchard environment scene detection method and terminal based on multitask deep learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116935296A true CN116935296A (en) | 2023-10-24 |
Family
ID=88385817
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310901547.8A Pending CN116935296A (en) | 2023-07-21 | 2023-07-21 | Orchard environment scene detection method and terminal based on multitask deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116935296A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118333825A (en) * | 2024-04-22 | 2024-07-12 | 仲恺农业工程学院 | Pitaya orchard road navigation identification method based on machine vision |
-
2023
- 2023-07-21 CN CN202310901547.8A patent/CN116935296A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118333825A (en) * | 2024-04-22 | 2024-07-12 | 仲恺农业工程学院 | Pitaya orchard road navigation identification method based on machine vision |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111046880B (en) | Infrared target image segmentation method, system, electronic equipment and storage medium | |
CN110222626B (en) | Unmanned scene point cloud target labeling method based on deep learning algorithm | |
Bao et al. | UAV remote sensing detection of tea leaf blight based on DDMA-YOLO | |
Blok et al. | The effect of data augmentation and network simplification on the image‐based detection of broccoli heads with Mask R‐CNN | |
Liu et al. | Automatic segmentation of overlapped poplar seedling leaves combining Mask R-CNN and DBSCAN | |
Wang et al. | Tea picking point detection and location based on Mask-RCNN | |
CN112598713A (en) | Offshore submarine fish detection and tracking statistical method based on deep learning | |
Wan et al. | A real-time branch detection and reconstruction mechanism for harvesting robot via convolutional neural network and image segmentation | |
CN107918776A (en) | A kind of plan for land method, system and electronic equipment based on machine vision | |
EP4174792A1 (en) | Method for scene understanding and semantic analysis of objects | |
CN113312999B (en) | High-precision detection method and device for diaphorina citri in natural orchard scene | |
Shuai et al. | An improved YOLOv5-based method for multi-species tea shoot detection and picking point location in complex backgrounds | |
CN113408584A (en) | RGB-D multi-modal feature fusion 3D target detection method | |
CN110163798A (en) | Fishing ground purse seine damage testing method and system | |
CN116935296A (en) | Orchard environment scene detection method and terminal based on multitask deep learning | |
CN110298366B (en) | Crop distribution extraction method and device | |
CN116883650A (en) | Image-level weak supervision semantic segmentation method based on attention and local stitching | |
Xiang et al. | PhenoStereo: a high-throughput stereo vision system for field-based plant phenotyping-with an application in sorghum stem diameter estimation | |
Chen et al. | Improved fast r-cnn with fusion of optical and 3d data for robust palm tree detection in high resolution uav images | |
Jiang et al. | Thin wire segmentation and reconstruction based on a novel image overlap-partitioning and stitching algorithm in apple fruiting wall architecture for robotic picking | |
Wang et al. | MeDERT: A metal surface defect detection model | |
CN117437691A (en) | Real-time multi-person abnormal behavior identification method and system based on lightweight network | |
Islam et al. | QuanCro: a novel framework for quantification of corn crops’ consistency under natural field conditions | |
CN112232403A (en) | Fusion method of infrared image and visible light image | |
CN115995017A (en) | Fruit identification and positioning method, device and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |