Improved YOLOv8-Seg Network For Instance Segmentat
Improved YOLOv8-Seg Network For Instance Segmentat
Improved YOLOv8-Seg Network For Instance Segmentat
Article
Improved YOLOv8-Seg Network for Instance Segmentation of
Healthy and Diseased Tomato Plants in the Growth Stage
Xiang Yue , Kai Qi, Xinyi Na , Yang Zhang , Yanhua Liu and Cuihong Liu *
College of Engineering, Shenyang Agricultural University, Shenyang 110866, China; yuexiang@syau.edu.cn (X.Y.);
2022240067@stu.syau.edu.cn (K.Q.); 2022240052@stu.syau.edu.cn (X.N.); 2022240040@stu.syau.edu.cn (Y.Z.);
2022240006@stu.syau.edu.cn (Y.L.)
* Correspondence: cuihongliu77@syau.edu.cn
Abstract: The spread of infections and rot are crucial factors in the decrease in tomato production.
Accurately segmenting the affected tomatoes in real-time can prevent the spread of illnesses. However,
environmental factors and surface features can affect tomato segmentation accuracy. This study
suggests an improved YOLOv8s-Seg network to perform real-time and effective segmentation of
tomato fruit, surface color, and surface features. The feature fusion capability of the algorithm was
improved by replacing the C2f module with the RepBlock module (stacked by RepConv), adding
SimConv convolution (using the ReLU function instead of the SiLU function as the activation function)
before two upsampling in the feature fusion network, and replacing the remaining conventional
convolution with SimConv. The F1 score was 88.7%, which was 1.0%, 2.8%, 0.8%, and 1.1% higher
than that of the YOLOv8s-Seg algorithm, YOLOv5s-Seg algorithm, YOLOv7-Seg algorithm, and Mask
RCNN algorithm, respectively. Meanwhile, the segment mean average precision (segment mAP@0.5 )
was 92.2%, which was 2.4%, 3.2%, 1.8%, and 0.7% higher than that of the YOLOv8s-Seg algorithm,
YOLOv5s-Seg algorithm, YOLOv7-Seg algorithm, and Mask RCNN algorithm. The algorithm can
perform real-time instance segmentation of tomatoes with an inference time of 3.5 ms. This approach
provides technical support for tomato health monitoring and intelligent harvesting.
environmental conditions and variations in fruit surface color and features, especially in
unstructured environments.
The accurate, efficient, and real-time instance segmentation of growing and diseased
tomatoes is essential in the complex environment of tomato greenhouses. This enables the
timely picking of ripe fruits, helps avoid spoilage, and assists in monitoring diseased fruits
to prevent bacterial infections in the planting field. In the past several years, deep learning
technology has widely been employed for tasks such as instance segmentation and object
detection, owing to its high accuracy and efficiency. To achieve object detection and instance
segmentation, a branch for generating binary masks has been introduced to Mask RCNN [6],
serving as a prototype of Fast RCNN (Faster R-CNN: Toward real-time object detection
using region proposal networks) [7]. Jia et al. [8] used the improved Mask RCNN algorithm
for instance segmentation on overlapping apples. They fused ResNet [9] and DenseNet as
the feature extraction network of the model and achieved an accuracy rate of 97.31% on
120 images in the test set. Huang et al. [10] proposed a fuzzy Mask R-CNN to automatically
identify the maturity of tomato fruits. They distinguished the foreground and background
in the image through the fuzzy c-means model and Hough transform method, located the
edge features for automatic labeling, and achieved a 98.00% accuracy rate in 100 images.
Afonso et al. [11] used the Mask R-CNN model, with ResNet101 as the backbone, to segment
ripe and unripe tomatoes, achieving a segmentation precision of 95% and 94% for each,
respectively. Wang et al. [12] proposed an improved Mask RCNN model that integrates
the attention mechanism for segmenting apple maturity under various conditions, such as
light influence, occlusion, and overlap. The test results showed accuracy and recall rates
of 95.8% and 97.1%, respectively. In addition, the segmentation of tomato fruit [13], the
detection of tomato fruit infection areas [14], the segmentation of tomato maturity [15], and
the segmentation of Soil block [16] based on Mask RCNN have demonstrated that the high
precision and robustness of the Mask RCNN algorithm in object detection and instance
segmentation. Mask RCNN is a conventional two-stage instance segmentation model.
Masks are generated by Mask RCNN through feature positioning. The located features are
then passed to the mask predictor after performing pooling operations on the region of
interest. However, executing these operations sequentially can cause slow segmentation
speed, large model size, and an increased number of computing parameters. In contrast
to conventional instance segmentation algorithms, which rely on feature localization to
generate masks, YOLACT (You Only Look At Coefficients) [17] is a real-time method. It
can rapidly generate high-quality instance masks by parallelizing the tasks of generating
prototype masks and predicting mask coefficients. The task of instance segmentation,
which uses the YOLO framework, builds on the principles of the YOLACT network for
completion. Initially, two parallel sub-tasks are executed: generating prototype masks and
predicting mask coefficients. Subsequently, the prototype is subjected to linear weighting
based on the obtained mask coefficients, which leads to the creation of instance masks.
Mubashiru [18] proposed a lightweight YOLOv5 algorithm for accurately segmenting fruits
from four gourd family plants with similar features. The proposed algorithm achieved a
segmentation accuracy of 88.5%. Although this method attains faster segmentation speed,
there is a need to further optimize the accuracy of the segmentation. In contrast to the
anchor-based detection head of YOLOv5, YOLOv8 adopts a novel anchor-free method. This
method decreases the number of hyperparameters, which improves the model’s scalability
while enhancing segmentation performance.
This paper proposed an improved YOLOv8s-Seg algorithm for segmenting healthy
and diseased tomatoes based on research conducted by scholars worldwide. The research
consisted of the following tasks:
(1) To enhance the edge features of the tomatoes, algorithms such as Gaussian blur,
Sobel operator, and weighted superposition were used to sharpen the 1600 photos in
the original dataset. Further data enhancement operations expanded the dataset to
9600 photos;
Agriculture 2023, 13, x FOR PEER REVIEW 3 of 17
(2) The feature fusion capability of the algorithm was improved by adding SimConv
(2) convolution
The feature [19] before
fusion the twoof
capability upsampling operations
the algorithm in the feature
was improved fusion SimConv
by adding network,
replacing the remaining regular convolutions with SimConv convolution,
convolution [19] before the two upsampling operations in the feature fusion and swap-
net-
ping the C2f module with the RepBlock module [20];
work, replacing the remaining regular convolutions with SimConv convolution, and
(3) An improved
swapping the YOLOv8s-Seg
C2f module withalgorithm was proposed
the RepBlock to address the slow running
module [20];
time, high parameter count, and large number of
(3) An improved YOLOv8s-Seg algorithm was proposed to addresscalculations of the
the slow
two-stage in-
running
stance segmentation
time, high parameter model. Thislarge
count, and algorithm
numberwas designed with
of calculations thetwo-stage
of the aim of effective,
instance
real-time instance segmentation of healthy and diseased tomatoes.
segmentation model. This algorithm was designed with the aim of effective, real-time
instance segmentation of healthy and diseased tomatoes.
2. Materials and Methods
2. Materials
2.1. and Methods
Data Acquisition
2.1. Data Acquisition
The dataset includes photographs of tomatoes at four stages of maturity, including
young The dataset
fruit, includes
immature, photographs
half-ripe, and ripe.of tomatoes at fourimages
It also includes stages of
of six
maturity,
common including
tomato
young fruit,
diseases: greyimmature, half-ripe,
mold, umbilical rot,and ripe.
crack, It also includes
bacterial canker, lateimages of and
blight, six common tomato
virus disease. A
diseases: grey mold, umbilical rot, crack, bacterial canker, late blight,
total of 788 photos were captured from tomato cultivation plots 26 and 29 at Shenyang and virus disease.
A total of 788University
Agricultural photos were captured
(latitude: from
41.8° N),tomato cultivation
with a seedling plots 26
spacing andm.
of 0.3 29Furthermore,
at Shenyang
Agricultural University (latitude: 41.8 ◦ N), with a seedling spacing of 0.3 m. Furthermore,
an additional 812 photos which demonstrate the previously mentioned five diseases,
an additional
namely 812 photos
grey mold, which
umbilical demonstrate
rot, bacterial canker,thelate
previously
blight, andmentioned five diseases,
virus disease, were re-
trieved from Wikipedia. This brings the total number of photos in the dataset to 1600.were
namely grey mold, umbilical rot, bacterial canker, late blight, and virus disease, The
retrieved from Wikipedia. This brings the total number of photos in
images were taken using an iPhone 13, which captured them in JPG format with a resolu-the dataset to 1600.
The images
tion of 1280 were
× 720 taken
pixels.using an iPhone
Images retrieved 13,from
which capturedwere
Wikipedia thempreserved
in JPG format
in thewith
samea
resolution of 1280 × 720 pixels. Images retrieved from Wikipedia were preserved in the
format and resolution. The dataset was divided into training and validation sets at a 7:3
same format and resolution. The dataset was divided into training and validation sets at
ratio. As a result, there were 1120 photos in the training set and 480 photos in the valida-
a 7:3 ratio. As a result, there were 1120 photos in the training set and 480 photos in the
tion set. Example images are shown in Figure 1.
validation set. Example images are shown in Figure 1.
Example images.
Figure 1. Example images. (a)
(a) young
young fruit,
fruit, (b)
(b) immature,
immature, (c)
(c) half-ripe,
half-ripe, (d)
(d) ripe,
ripe, (e,f)
(e,f) ripe,
ripe, immature,
immature,
(g) umbilical rot, (h) grey mold, (i) crack, (j) virus disease,
disease, (k)
(k) late
late blight,
blight, (l)
(l) bacterial
bacterial canker.
canker.
2.2. Image
2.2. Image Preprocessing
Preprocessing
In this
In this study,
study, the
thedataset
datasetofof1600
1600 photos
photos was
was enhanced
enhanced using
using several
several techniques,
techniques, in-
including the Sobel operator and weighted overlay. These methods aimed to improve
cluding the Sobel operator and weighted overlay. These methods aimed to improve the the
clarity of tomato fruit edges for better image annotation and feature extraction. Initially,
clarity of tomato fruit edges for better image annotation and feature extraction. Initially,
Gaussian blur was applied to the images to reduce noise. Then, the Sobel operator was
Gaussian blur was applied to the images to reduce noise. Then, the Sobel operator was
utilized to calculate the image gradients and extract edge features. Finally, the gradient
utilized to calculate the image gradients and extract edge features. Finally, the gradient
images were combined with the original images using a weighted overlay technique to
images were combined with the original images using a weighted overlay technique to
enhance the edge features. Figure 2 compares the photos before and after the image
sharpening process.
Agriculture 2023, 13, x FOR PEER REVIEW 4 of 17
enhance the edge features. Figure 2 compares the photos before and after the image sharp-
Agriculture 2023, 13, 1643
enhance the edge features. Figure 2 compares the photos before and after the image
ening process. 4 of 15
sharp-
ening process.
Figure 3. Data
Figure enhancement.
3. Data enhancement.(a)
(a)sharpened image;(b)
sharpened image; (b)sharpened
sharpened image
image adjusted
adjusted brightness
brightness (bright-
(bright-
ened); (c)(c)
ened); sharpened
sharpenedimage adjustedbrightness
image adjusted brightness (darkened);
(darkened); (d) sharpened
(d) sharpened imageimage subjected
subjected to mir-
to mirror-
roring operation;(e)(e)
ing operation; sharpened
sharpened image
image rotated
rotated by◦ ;180°;
by 180 (f) sharpened
(f) sharpened image image by 30◦ . by 30°.
rotatedrotated
Figure 3. Data enhancement. (a) sharpened image; (b) sharpened image adjusted brightness (bright-
ened); (c) sharpened image adjusted brightness (darkened); (d) sharpened image subjected to mir-
roring operation; (e) sharpened image rotated by 180°; (f) sharpened image rotated by 30°.
Agriculture 2023, 13, 1643 5 of 15
Figure 6. Structure of tomato segmentation network based on improved YOLOv8s-Seg. The ability to fuse features of the network was improved by replacing
the C2f module with the RepBlock module, adding SimConv convolution before two upsampling in the neck module, and replacing the remaining conventional
convolution with SimConv.
Figure 6. Structure of tomato segmentation network based on improved YOLOv8s-Seg. The ability to fuse features of the network was improved by replacing the
C2f module with the RepBlock module, adding SimConv convolution before two upsampling in the neck module, and replacing the remaining conventional
convolution with SimConv.
Agriculture
Agriculture2023, 13,13,
2023, x FOR PEER
x FOR REVIEW
PEER REVIEW 9 of 17 9 of 1
Agriculture 2023, 13, 1643 The backbone network of YOLOv8s-Seg consists of a 3 × 3 convolution, a C2f module,
The backbone network of YOLOv8s-Seg consists of a 3 × 3 convolution, 8aofC2f 15 modul
and an SPPF (spatial pyramid pooling fusion) module. In contrast to the YOLOv5 net-
and an SPPF (spatial pyramid pooling fusion) module. In contrast to the YOLOv5 ne
work, YOLOv8s-Seg replaces the initial 6 × 6 convolution with a 3 × 3 convolution in the
work,TheYOLOv8s-Seg
backbone replaces the initialconsists
6 × 6 convolution with a 3 ×a 3C2f
of a 3 × Additionally, convolution in th
backbone network,network
makingofthe YOLOv8s-Seg
model more lightweight. 3 convolution, the C3 module,
module
backbone
and an 7)SPPF network, making
(spatial pyramid the
pooling model more lightweight. Additionally, the C3 modu
(Figure in YOLOv5 is replaced withfusion)
the C2f module.
module Inin
contrast to the YOLOv5
YOLOv8s-Seg. The C2f network,
module,
(Figure
YOLOv8s-Seg 7) in YOLOv5
replaces theis replaced
initial 6 × 6 with the
convolution C2f module
with a 3 × 3 in YOLOv8s-Seg.
convolution
designed with skip connections and additional split operations, enriches the gradient flow in the The C2f modul
backbone
designed
network,
during with skip
making
backpropagation connections
the model
andmore and the
additional
lightweight.
improves split operations,
Additionally,
performance enriches
the C3 module
of the model. the gradient
(Figure
YOLOv8s-Seg 7) in
uti- flow
YOLOv5
during is replaced
backpropagationwith the C2f
and module
improves in YOLOv8s-Seg.
the performance The C2f
ofCSP module,
the model. designed
YOLOv8s-Seg
lizes two versions of the cross stage partial network (CSP). The [26] in the backbone ut
with
lizesskip
two connections
versions ofand
the additional
cross split
stage operations,
partial networkenriches
(CSP). theThe
gradient
CSP flow in
[26] during
theuses
backbon
network employs residual connections (as shown in Figure 6), while the head part
backpropagation and improves the performance of the model. YOLOv8s-Seg utilizes two
network
direct employs
connections. Theresidual
SPPF connections
structure (as shown in Figurethe
in YOLOv8s-Seg 6), same
whileasthe head part use
versions of the cross stage partial network (CSP). The CSPremains [26] in the backbone in YOLOv5
network
direct
(version connections. The SPPF structure in YOLOv8s-Seg remains the same as in YOLOv
employs6.1), utilizing
residual cascaded
connections (as5 shown
× 5 pooling
in Figure kernels to accelerate
6), while the head network
part usesoperation
direct
(version 6.1),
speed.
connections. The utilizing cascaded
SPPF structure 5 × 5 pooling
in YOLOv8s-Seg kernels
remains the same to accelerate
as in YOLOv5 network
(versionoperatio
speed.
6.1), utilizing cascaded 5 × 5 pooling kernels to accelerate network operation speed.
Figure
Figure
The7. 7. Structure
Structure
head module of the
of the C3 module.of the neck and segment parts. The neck module in-
C3comprised
is module.
corporates
The headthe module
path aggregation
is comprised network (PANet)
of the neck and [27]
segmentand parts.
featureThepyramid network
neck module
(FPN) The head
[28] as the
incorporates
module
feature
pathfusion
is comprised
networks.
aggregation
of the
Unlike
network
neck
YOLOv5
(PANet)
and segment
[27] and
and YOLOv6,
parts. The neck
YOLOv8s-Seg
feature pyramid
module in
network re-
corporates
moves
(FPN) the
[28]1 as the
× 1feature path
convolution aggregation
fusionbefore network
upsampling
networks. Unlike and (PANet)
fusesand
YOLOv5 [27] and
the feature
YOLOv6, feature
maps pyramid networ
directly from
YOLOv8s-Seg
(FPN) stages
removes
different [28]
the 1as× offeature
1the fusion before
convolution
backbone networks. Unlike
upsampling
network. This study YOLOv5
and and
fusestothe
aimed YOLOv6,
feature
enhance maps
the YOLOv8s-Seg
directly
network per- re
moves
from the
different 1 × 1
stagesconvolution
of the before
backbone upsampling
network. This and
study fuses
aimed
formance of YOLOv8s-Seg by improving its neck module. Specifically, before each up- the
to feature
enhance maps
the directly
network from
different
performance
sampling stages oftwo
the1backbone
of YOLOv8s-Seg
operation, network.
by improving
× 1 SimConv itsThis
convolutions neckstudy
module.
were aimed to enhance
Specifically,
added, and the each
before
the remainingnetwork
reg- pe
upsampling
formance
ular convolutionsoperation,
of in thetwo
YOLOv8s-Seg 1 ×part
neck 1 SimConv
by improvingconvolutions
were replaced neck 3were
its with module.added,
× 3 SimConv and the remaining
Specifically, beforeThe
convolutions. each up
regular convolutions
C2fsampling operation,
module (Figure in
8) wasthe neck part
tworeplaced were
1 × 1 SimConv replaced with
with theconvolutions 3
RepBlock module× 3 SimConv
were added, convolutions.
(Figure and The
theRepBlock
6). The remaining reg
C2f module (Figure 8) was replaced with the RepBlock module (Figure 6). The RepBlock
module is composed of
ular convolutions instacked
the neck RepConv
part wereconvolutions,
replaced with and the
3 × structure
3 SimConv of the RepConv Th
convolutions.
module is composed of stacked RepConv convolutions, and the structure of the RepConv
convolution
C2f module is depicted
(Figure 8) in was
Figure 6.
replaced with the RepBlock module (Figure 6). The RepBloc
convolution is depicted in Figure 6.
module is composed of stacked RepConv convolutions, and the structure of the RepCon
convolution is depicted in Figure 6.
Agriculture 2023, 13, 1643 The YOLOv5 network employs a static allocation strategy to assign 9 of positive
15 a
ative samples based on the intersection over union (IOU) between the predicte
and ground truth. However, the YOLOv8s-Seg network has improved this aspe
The YOLOv5
troducing network employs
a superior dynamic a static allocation
allocation strategy toItassign
strategy. positive and
incorporates theneg-TaskAli
ative samples based on the intersection over union (IOU) between the predicted boxes
signer (TOOD), which selects positive samples based on a weighted score that com
and ground truth. However, the YOLOv8s-Seg network has improved this aspect by intro-
the classification
ducing and regression
a superior dynamic scores. ItThe
allocation strategy. computation
incorporates is represented by Form
the TaskAlignedAssigner
(TOOD), which selects positive samples based on a weighted
score that comes from the
t s u
classification and regression scores. The computation is represented by Formula (1).
Figure
Figure 9. 9. Mosaic
Mosaic datadata enhancement.
enhancement.
In this study, we assess the performance of the improved YOLOv8s-Seg using precision,
recall, F1 score, and segment mAP@0.5 . Tomato locations were assessed using precision,
recall, and F1 score, while segmentation results were evaluated using segment mAP [29].
Equations (2)–(5) are used to calculate the precision, recall, F1 score, and segment mAP
scores. The higher the four parameters are, the better the segmentation results.
TP
precision = × 100% (2)
(TP + FP)
TP
recall = × 100% (3)
(TP + FN)
recall
F1 = 2 × precision × (4)
precision + recall
c
AP(i )
segmAP = ∑ C
(5)
i =1
where TP denotes an actual positive sample with a positive prediction, while FP indicates
an actual negative sample with a positive prediction, and FN indicates an actual positive
sample with a negative prediction. AP represents the average precision of segmentation.
The segmentation performance of the model increases with the AP score. C represents the
number of segmentation categories.
10. Examples of instance segmentation of tomatoes. (a): ripe tomatoes and immature toma-
Figure 10.
shaded by
toes shaded by leaves,
leaves, half-ripe
half-ripe tomatoes
tomatoes with
with intact
intact fruit
fruit characteristics;
characteristics; (b):
(b): immature
immature tomatoes
tomatoes
with overlapping
with overlapping fruit
fruit and
and ripe
ripe tomatoes
tomatoes with
with intact
intact fruit
fruit characteristics;
characteristics; (c):
(c): immature
immature tomatoes
tomatoes
affected by
affected by changes
changes in in light;
light; (d):
(d): example
example segmentation
segmentation of of tomatoes
tomatoes shaded
shaded by by leaves;
leaves; (e):
(e): example
example
segmentation of overlapping fruit; (f): example segmentation of tomatoes affected by changes in
segmentation of overlapping fruit; (f): example segmentation of tomatoes affected by changes in light;
light; (g): immature and ripe tomatoes affected by changes in angle; (h): immature tomatoes, half-
(g): immature and ripe tomatoes affected by changes in angle; (h): immature tomatoes, half-ripe toma-
ripe tomatoes, and young fruit; (i): cracked tomatoes, ripe tomatoes. (j): example segmentation of
toes, and young
tomatoes affectedfruit; (i): cracked
by changes tomatoes,
in angle; ripe tomatoes.
(k): example (j): example
segmentation segmentation
for immature of tomatoes
tomatoes, half-ripe
affected by changes in angle; (k): example segmentation for immature tomatoes,
tomatoes, and young fruit; (l): example segmentation for cracked tomatoes and ripe tomatoes. half-ripe tomatoes,
and young fruit; (l): example segmentation for cracked tomatoes and ripe tomatoes.
In Table 5, the results show the performance of the improved YOLOv8s-Seg algorithm
compared to other models. The improved YOLOv8s-Seg algorithm achieves precision,
recall, F1 score, and segment mAP@0.5 of 91.9%, 85.8%, 88.7%, and 0.922, respectively.
Compared to the YOLOv8s-Seg algorithm, the improvements were 1.6%, 0.4%, 1.0%, and
2.4%, respectively. Compared to the YOLOv5s-Seg algorithm, the improvements were 2.9%,
2.8%, 2.8%, and 3.2%, respectively. Compared to the YOLOv7-Seg algorithm, this algorithm
showed increases of 0.5%, 1.0%, 0.8%, and 1.8%. Compared to the Mask RCNN algorithm,
this algorithm had increments of 2.1%, 0.3%, 1.1%, and 0.7%, respectively. Additionally, the
inference time of 3.5 ms signifies a minor increase over YOLOv5s-Seg and YOLOv8s-Seg
(0.4 ms and 0.6 ms) but a significant reduction over YOLOv7-Seg and Mask RCNN (11.7 ms
and 86.5 ms), supporting real-time instance segmentation. In conclusion, the improved
YOLOv8s-Seg algorithm stands out in precision, recall, F1 score, and segment mAP@0.5 ,
with effective inference time. Figure 11 provides the comparison of Segment mAP@0.5 for
five algorithms.
4. Conclusions
An improved YOLOv8s-Seg network based on instance segmentation for tomato
illness and maturity was suggested in this paper. The feature fusion capability of the
algorithm was improved by replacing the C2f module with the RepBlock module, adding
SimConv convolution before two upsampling in the feature fusion network, and replacing
the remaining conventional convolution with SimConv. The improved YOLOv8s-Seg
network achieved a segment mAP@0.5 of 92.2% on the validation set. This showed an
improvement of 2.4% compared to the original YOLOv8s-Seg network, an improvement of
3.2% over the YOLOv5s-Seg network, an improvement of 1.8% relative to the YOLOv7-Seg
network, and an improvement of 0.7% over the Mask RCNN network. Regarding inference
time, the improved YOLOv8s-Seg network reached a speed of 3.5 ms, an increase of 0.4 ms
and 0.6 ms compared to the YOLOv8s-Seg and YOLOv5s-Seg networks, but a significant
reduction compared to the YOLOv7-Seg and Mask RCNN algorithms, reduced by 11.7 ms
and 86.5 ms respectively. This capability facilitates the real-time segmentation of both
healthy and diseased tomatoes. Overall, the improved YOLOv8s-Seg network exhibits
precise segmentation performance on tomatoes affected by factors such as leaf occlusion,
fruit overlap, lighting variations, and angle changes. Meanwhile, the analysis of instance
segmentation results for tomatoes at different growth stages and diseases shows that the
algorithm effectively reduces the impact of surface color and features on performance.
In conclusion, the algorithm shows notable segmentation performance on tomatoes
affected by environmental factors during growth stages and disease. Future research will
continue to optimize the algorithm to improve the segment mAP@0.5 . Efforts will also
be directed toward simplifying the YOLOv8s-Seg network structure to increase computa-
tional efficiency.
Author Contributions: Conceptualization, X.Y.; methodology, K.Q.; software, K.Q.; validation, X.N.
and Y.Z.; formal analysis, Y.L.; investigation, Y.L.; resources, Y.Z.; data curation, X.N.; writing—
original draft preparation, K.Q.; writing—review and editing, X.Y.; visualization, K.Q.; supervision,
C.L.; project administration, X.Y.; funding acquisition, X.Y. All authors have read and agreed to the
published version of the manuscript.
Funding: This work was supported in part by the Youth Program of the Liaoning Education Depart-
ment under Grant LSNQN202025.
Institutional Review Board Statement: Not applicable.
Data Availability Statement: Data will be made available on request.
Conflicts of Interest: We have no affiliations with any organization with a direct or indirect financial
interest in the subject matter discussed in the manuscript.
References
1. Lee, J.; Nazki, H.; Baek, J.; Hong, Y.; Lee, M. Artificial intelligence approach for tomato detection and mass estimation in precision
agriculture. Sustainability 2020, 12, 9138. [CrossRef]
2. Fan, Y.Y.; Zhang, Z.M.; Chen, G.P. Application of vision sensor in the target fruit recognition system of picking robot. Agric. Mech.
Res. 2019, 41, 210–214.
3. Gongal, A.; Amatya, S.; Karkee, M.; Zhang, Q.; Lewis, K. Sensors and systems for fruit detection and localization: A review.
Comput. Electron. Agric. 2015, 116, 8–19. [CrossRef]
4. Si, Y.; Liu, G.; Feng, J. Location of apples in trees using stereoscopic vision. Comput. Electron. Agric. 2015, 112, 68–74. [CrossRef]
Agriculture 2023, 13, 1643 15 of 15
5. Yin, H.; Chai, Y.; Yang, S.X.; Mittal, G.S. Ripe tomato recognition and localization for a tomato harvesting robotic system. In
Proceedings of the International Conference of Soft Computing and Pattern Recognition, Malacca, Malaysia, 4–7 December 2009;
pp. 557–562. [CrossRef]
6. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision,
Venice, Italy, 22–29 October 2017; pp. 2961–2969. [CrossRef]
7. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings
of the Advances in Neural Information Processing Systems 28 (NIPS 2015), NeurIPS, Montreal, BC, Canada, 7–12 December 2015;
p. 28. [CrossRef]
8. Jia, W.; Tian, Y.; Luo, R.; Zhang, Z.; Lian, J.; Zheng, Y. Detection and segmentation of overlapped fruits based on optimized mask
R-CNN application in apple harvesting robot. Comput. Electron. Agric. 2020, 172, 105380. [CrossRef]
9. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [CrossRef]
10. Huang, Y.P.; Wang, T.H.; Basanta, H. Using fuzzy mask R-CNN model to automatically identify tomato ripeness. IEEE Access
2020, 8, 207672–207682. [CrossRef]
11. Afonso, M.; Fonteijn, H.; Fiorentin, F.S.; Lensink, D.; Mooij, M.; Faber, N.; Polder, G.; Wehrens, R. Tomato fruit detection and
counting in greenhouses using deep learning. Front. Plant Sci. 2020, 11, 571299. [CrossRef]
12. Wang, D.; He, D. Fusion of Mask RCNN and Attention Mechanism for Instance Segmentation of Apples under Complex
Background. Comput. Electron. Agric. 2022, 196, 106864. [CrossRef]
13. Wang, C.; Yang, G.; Huang, Y.; Liu, Y.; Zhang, Y. A Transformer-based Mask R-CNN for Tomato Detection and Segmentation.
Intell. Fuzzy Syst. 2023, 44, 8585–8595. [CrossRef]
14. Wang, Q.; Qi, F.; Sun, M.; Qu, J.; Xue, J. Identification of Tomato Disease Types and Detection of Infected Areas Based on Deep
Convolutional Neural Networks and Object Detection Techniques. Comput. Intell. Neurosci. 2019, 2019, 9142753. [CrossRef]
15. Hsieh, K.W.; Huang, B.Y.; Hsiao, K.Z.; Tuan, Y.H.; Shih, F.P.; Hsieh, L.C.; Chen, S.; Yang, I.C. Fruit maturity and location
identification of beef tomato using R-CNN and binocular imaging technology. Food Meas. Charact. 2021, 15, 5170–5180. [CrossRef]
16. Liu, L.; Bi, Q.; Liang, J.; Li, Z.; Wang, W.; Zheng, Q. Farmland Soil Block Identification and Distribution Statistics Based on Deep
Learning. Agriculture 2022, 12, 2038. [CrossRef]
17. Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, Seoul, South Korea, 27 October–2 November 2019; pp. 9157–9166. [CrossRef]
18. Mubashiru, L.O. YOLOv5-LiNet: A lightweight network for fruits instance segmentation. PLoS ONE 2023, 18, e0282297. [CrossRef]
19. Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection
framework for industrial applications. arXiv 2022. [CrossRef]
20. Weng, K.; Chu, X.; Xu, X.; Huang, J.; Wei, X. EfficientRep: An Efficient Repvgg-style ConvNets with Hardware-aware Neural
Network Design. arXiv 2023. [CrossRef]
21. Magalhães, S.A.; Castro, L.; Moreira, G.; Dos Santos, F.N.; Cunha, M.; Dias, J.; Moreira, A.P. Evaluating the single-shot multibox
detector and YOLO deep learning models for the detection of tomatoes in a greenhouse. Sensors 2021, 21, 3569. [CrossRef]
[PubMed]
22. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object
Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada,
18–22 June 2023; pp. 7464–7475. [CrossRef]
23. Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. Tood: Task-aligned one-stage object detection. In Proceedings of the 2021
IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3490–3499.
[CrossRef]
24. Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss: Learning qualified and distributed
bounding boxes for dense object detection. Adv. Neural Inf. Process. Syst. 2020, 33, 21002–21012. [CrossRef]
25. Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021. [CrossRef]
26. Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning
capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA,
13–19 June 2020; pp. 390–391. [CrossRef]
27. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [CrossRef]
28. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125.
[CrossRef]
29. Tian, Y.; Yang, G.; Wang, Z.; Li, E.; Liang, Z. Instance segmentation of apple flowers using the improved mask R–CNN model.
Biosyst. Eng. 2020, 193, 264–278. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.