1 Introduction
Artificial intelligence (AI), especially
computer vision (CV), is finding an ever-broadening range of applications in modern agriculture. The next stage of agricultural technological development, Agriculture 5.0 [
15,
100,
232,
352], will constitute AI-driven autonomous decision-making as a central component. The term Agriculture 5.0 stems from a chronology [
352] that begins with Agriculture 1.0, which heavily depends on human labor and animal power, and Agriculture 2.0, enabled by synthetic fertilizers, pesticide, and combustion-powered machinery, and develops to Agriculture 3.0 and 4.0, characterized by GPS-enabled precision control, and
Internet-of-Things (IoT)-driven data collection [
250]. Built upon the rich agricultural data collected, Agriculture 5.0 holds the promise to further increase productivity, satiate the food demand of a growing global population, and mitigate the negative environmental impact of existing agricultural practices.
As an integral component of Agriculture 5.0,
controlled-environment agriculture (CEA), a farming practice carried out within urban, indoor, resource-controlled, and sensor-driven factories, is particularly suitable for the application of AI and CV. This is because CEA provides ample infrastructure support for data collection and autonomous execution of algorithmic decisions. In terms of productivity, CEA could produce higher yield per unit area of land [
8,
9] and boost the nutritional content of agricultural products [
159,
304]. In terms of environmental impact, CEA farms can insulate environmental influences, relieve the need for fertilizer and pesticides, and efficiently utilize recycled resources like water, thereby may be much more environmentally friendly and self-sustainable than traditional farming.
In light of current global challenges, such as disruptions to global supply chains and the threat of climate change, CEA appears especially appealing as a food source for urban population centers. Under pressures of deglobalization brought by geopolitical tensions [
362] and global pandemics [
233,
268], CEA provides the possibility to build farms close to large cities, which shortens the transportation distance and maintains secure food supplies even when long-distance routes are disrupted. The city-state Singapore, for example, has promised to source 30% of its food domestically by 2030 [
1,
306], which is only possible through suburban farms such as CEAs. Furthermore, CEA, as a form of precision agriculture, is by itself a viable solution to the reduction of the emission of greenhouse gasses [
9,
37,
243]. CEA can also shield plants from adverse climate conditions exacerbated by climate change, as its environments are fully controlled [
112], and is able to effectively reuse the arable land eroded due to climate change [
364].
We argue that AI and CV are critical to the economic viability and long-term sustainability of CEAs, as these technologies could save expenses associated with production and improve productivity. Suburban CEAs have high land costs. An analysis in Victoria, Australia [
38], shows that, due to the higher land cost resulting from proximity to cities, with an estimated 50-fold productivity improvement per land area, it still takes six to seven years for a CEA to reach the break-even point. Thus, further productivity improvement from AI would act as strong drivers for CEA adoption. Moreover, vertical or stacked setup of vertical farms impose additional difficulty for farmers to perform daily surveillance and operations. Automated solutions empowered by computer vision could effectively solve this problem. Finally, AI and CV technologies have the potential to fully characterize the complex, individually different, time-varying, and dynamic conditions of living organisms [
39], which will enable precise and individualized management and further elevate yield. Thus, AI and CV technologies appear to be a natural fit to CEAs.
Most of the recent development of AI can be attributed to the newly discovered capability to train deep neural networks [
172] that can (1) automatically learn multi-level representations of input data that are transferable to diverse downstream tasks [
65,
136], (2) easily scale up to match the growing size of data [
283], and (3) conveniently utilize massively parallel hardware architectures like GPUs [
114,
328]. As function approximators, deep learning proves to be surprisingly effective in generalizing to previously unseen data [
354]. Deep learning has achieved tremendous success in computer vision [
293], natural language processing [
47,
83,
118], multimedia [
23,
88], robotics [
291], game playing [
270], and many other areas.
The AI revolution in agriculture is already underway. State-of-the-art neural network technologies, such as ResNet [
134] and MobileNet [
138] for image recognition, and Faster R-CNN [
239], Mask R-CNN [
133], and YOLO [
235] for object detection, have been applied to the management of crops [
194], livestock [
140,
299], and plants in indoor and vertical farms [
240,
357]. AI has been used to provide decision support in myriad tasks, from DNA analysis [
194] and growth monitoring [
240,
357] to disease detection [
254] and profit prediction [
28].
While several surveys have explored the use of CV techniques in agriculture, none of them specifically focus on CEA applications. Some surveys summarize studies based on aspects of practical applications in agriculture. References [
74,
89,
123,
146,
278] survey pest and disease detection studies. References [
40,
111,
303] discuss fruit and vegetable quality grading and disease detection. Reference [
298] summarizes studies in six sub-fields, including crop growth monitoring, pest and disease detection, automatic harvesting/fruit detection, fruit quality testing, automated management of modern farms, and the monitoring of farmland information with
Unmanned Aerial Vehicle (UAV). Other surveys organize existing works from a technical perspective, namely, algorithms used [
237] or formats of data [
56]. Reference [
151], as an exception, introduces the development history of CV and AI in smart agriculture without investigating any individual studies. Our work aims to address this gap and provide insights tailored to CEA-specific contexts.
As the volume of research in smart agriculture grows rapidly, we hope the current review article can bridge researchers from both areas of AI and agriculture and create a mild learning curve when they wish to familiarize themselves in the other area. We believe computer vision has the closest connections with, and is the most immediately applicable in, urban agriculture and CEAs. Hence, in this article, we focus on reviewing deep-learning-based computer vision technologies in urban farming and CEAs. We focus on deep learning, because it is the predominant approach in AI and CV research. The contributions of this article are two-fold, with the former targeted at AI researchers and the latter targeted at agriculture researchers:
We identify five major CV applications in CEA and analyze their requirements and motivation. Further, we survey the state-of-the-art as reflected in 68 technical papers and 14 vision-based CEA datasets.
We discuss five key subareas of computer vision and how they relate to CEA. In addition, we identify four potential future directions for research in CV for CEA.
In Figure
1, we provide an graphical preview of our content. It illustrates the end-to-end agriculture process of CEAs, from seed planting to harvest and sales, with five major deep learning-based CV applications—Growth Monitoring, Fruit and Flower Detection, Fruit Counting, Maturity Level Classification, and Pest and Disease Detection—mapped to the corresponding applicable plant growth stages. We do not survey the autonomous seed planting and harvesting step, as they are more relevant to robot functioning and robotic control, i.e., grasping, carrying, and placing of objects rather than computer vision (we do include the localization of fruit in the fruit and flower detection section that facilitates harvesting robot to locate the targeted object and perform action). However, we provide here some literature related to agriculture robot and end-effector design for reference: [
36,
57,
92,
231,
353].
We structure the survey following the process in the figure: First, to provide a bird’s-eye view of CV capabilities available to researchers in smart agriculture, we summarize several major CV problems and influential technical solutions in Section
2. Next, we review 68 papers with respect to the application of computer vision in the CEA system in Section
3. The discussion is organized into five subsections: Growth Monitoring, Fruit and Flower Detection, Fruit Counting, Maturity Level Classification, and Pest and Disease Detection. In the discussion, we focus on fruits and vegetables that are suitable for CEA, including tomato [
10,
13,
127,
351], mango [
7], guava [
269,
333], strawberry [
107,
346], capsicum [
174], banana [
5], lettuce [
359], cucumber [
10,
128,
200], citrus [
4], and blueberry [
2]. Next, we provide a summary of 14 publicly available datasets of plants and fruits in Section
4 to facilitate future studies in controlled-environment agriculture. Finally, we highlight a few research directions that could generate high-impact research in the near future in Section
5.
One thing to note here is that, except for the Leaf Instance Segmentation task under the Growth Monitoring section, all the tasks are performed with model trained from different datasets and evaluated on different metrics. Tables
3 4,
5,
6 showcase the variety in datasets and evaluation metrics. This variation results in incomparable performance between studies. Such a phenomenon further indicates the necessity of our survey, which summarizes the current progress in literature and encourages the development of general benchmarks to promote consistency and comparability in future research.
3 Controlled-environment Agriculture
CEA is the farming practice carried out within urban, indoor, resource-controlled factories, often accompanied by stacked growth levels (i.e., vertical farming), renewable energy, and recycling of water and waste. CEA has recently been adopted in nations around the world [
38,
82], such as Singapore [
161], North America [
6], Japan [
9,
264], and the UK [
8].
CEA has economic and environmental benefits. Compared to traditional farming, CEA farms produce higher yield per unit area of land [
8,
9]. Controlled environments shield the plants from seasonality and extreme weather, so plants can grow all year round given suitable lighting, temperature, and irrigation [
38]. The growing conditions can as well be further optimized to boost growth and nutritional content [
159,
304]. Rapid turnover increases farmers’ flexibility in plant choice to catch the trend of consumption [
35]. Moreover, farms investment on pesticides, herbicides, and transportation can be cut down due to reduced contamination from the outside environment and proximity to urban consumers.
CEA farms, when designed properly, can become much more environmentally friendly and self-sustainable than traditional farming. With optimized growing conditions and limited external interference, the need for fertilizer and pesticides decreases, so we can reduce the amount of chemicals that go into the environment as well as the resulting pollution. Furthermore, CEA farms can save water and energy through the use of renewable energy and aggressive water recycling. For instance, CEA farms from Spread, a Japanese company, recycle 98% of used water and reduce the energy cost per head of lettuce by 30% with LED lighting [
9]. Finally, CEA farm can be situated in urban or suburban areas, thereby reducing transportation and storage cost. A simulation for different farm designs in Lisbon shows vertical tomato farms with appropriate designs emit less greenhouse gas than conventional farms, mainly due to reduced water use and transportation distance [
37].
A significant drawback of CEA, however, lies in its high cost, which may be partially addressed by computer vision technologies. According to Reference [
38], the higher land cost in Victoria, Australia, means that the yield of vertical farms has to be at least 50 times more than traditional farming to break even. Computer vision holds the promise of boosting the level of automation and increasing yield, thereby making CEA farms economically viable. As would be discussed in the following sections, CV techniques can reduce a major amount of variable costs, such as wastage cost induced by incorrect or delayed decisions on harvesting, and provide long-term benefit.
Carrying the potential to reduce a significant amount of cost, setting up computer vision systems in the field costs significantly less than expected when compared to the expenses of constructing a CEA building. Building a CEA structure involves high upfront costs, including construction, insulation, lighting, and HVAC systems. According to Reference [
264], a 1,300-square-meter CEA building with a production area of 4,536 square meters would require a capital investment of $7.4 million and incur annual operational costs of approximately $3.4 million.
However, setting up hardware systems for CV models is relatively inexpensive. The necessary components include servers (CPU, GPU, memory, storage), sensors, cameras, networking, as well as cooling system. For example, a server with specifications such as a 32-Core 2.80 GHz Intel Xeon Platinum 8462Y+, 128 G memory, 4 NVIDIA RTX A6000 “Ada” GPUs, and 2 TB storage costs around $60,000. Using this server for training purposes, assuming a standard VGG-16 architecture, training on 5,000 images of size 224 \(\times\) 224 pixels, with a batch size of 64 and 50 training epochs, and utilizing 4 NVIDIA A6000 GPUs, the estimated training time is less than an hour. Such a server is sufficient for daily training and inference of commonly used CV models. For a camera system, if we consider 10 surveillance cameras such as the Hikvision DS-2CD2142FWD-I, then the total cost would be around $1,400. Additionally, a high-speed network infrastructure is required to transfer data between the computer hardware, storage, and camera systems. Typically, it necessitates 4 to 7 routers to cover an area of 1,300 square meters, costing approximately $2,000. Finally, a liquid cooling system could cost between $1,000 and $2,000. In summary, a hardware system with a total cost of around $70,000 is sufficient for the daily operation, training, and inference of CV systems.
CEA can take diverse form factors [
35], and the form factors may pose different requirements for computer vision technologies. Typical forms for CEA are glasshouses with transparent shells or completely enclosed facilities. Depending on the cultivars being planted, internal arrangement of the farm can be classified into stacked horizontal systems, vertical growth surfaces, and multi-floor towers. Form factors have influence on lighting, which is an important consideration in CV applications. For example, glasshouses with transparent shells utilize natural light to reduce energy consumption but may not provide sufficient lighting for CV around the clock. In comparison, a completely enclosed facility can have greater control of lighting conditions. Moreover, internal arrangement of the farm also affects camera angle. If cultivars being planted change frequently as a result of the high turnover rate in CEAs, then the arrangement of shelves and plants might change. This would affect the camera angles and thus the resulting inference performance. CV systems need adapt to the change of the environment.
Nevertheless, with the autonomous setup of CEAs, which allow easy new data collection, training a new CV model or fine-tuning a previous model to adapt to the above-mentioned changeable environment would be a cinch. Besides, there are also few-shot learning [
286,
319], weakly-supervised learning[
16,
218,
373], and unsupervised learning techniques [
49,
253], which require minimal or zero annotations, that can facilitate the adjustment of the models.
Besides environmental change, there also exist other factors that need to be taken into account when applying CV techniques in CEA. Two typical problems to consider would be (1) How to cope with sub-optimal data with label noise and how to address unbalanced class distribution. (2) How to interpret the prediction from models or measure the uncertainty of prediction so users can use the models with confidence. Quantitative measure of the confidence or uncertainty would allow farmers to understand the decision generation process and make decisions with more confidence. Table
1 maps the above factors - environmental change, sub-optimal data quality, human factor - to CV problems and lists corresponding solutions and the respective sections that discuss the solutions.
In the following, we investigate the application of autonomous computer vision techniques on Growth Monitoring, Fruit and Flower Detection, Fruit Counting, Maturity Level Classification, and Pest and Disease Detection to increase production efficiency. In addition to existing applications, we include techniques that can be easily applied to vertical farms even though they have not yet been applied to them.
3.1 Growth Monitoring
Growth monitoring, a critical component of plant phenotyping, aims to understand the life cycle of plants and estimate yield by monitoring various growth indicators such as the plant size, number of leaves, leaf sizes, land area covered by the plant, and so on. Plant growth monitoring facilitates in quantifying the effects of biological/environmental factors on growth and thus is crucial for finding the optimal growing condition and developing high-yield crops [
212,
294].
As early as 1903, Wilhelm Pfeffer recognized the potential of image analysis in monitoring plant growth [
225,
279]. Traditional machine vision techniques such as gray-level pixel thresholding [
220], Bayesian statistics [
45], and shallow learning techniques [
147,
348] have been applied to segment the objects of interest, such as leaves and stems, from the background to analyze plant growth. Compared to traditional methods, deep-learning techniques provide automatic representation learning and are less sensitive to image quality variations. For this reason, deep learning techniques for growth monitoring have recently gained popularity.
Among various growth indicators, leaf size and number of leaves per plant are the most commonly used [
121,
169,
252]. Therefore, in the section below, we first discuss leaf instance segmentation, which can support both indicators at the same time, followed by a discussion of techniques for only leaf counting or for other growth indicators.
3.1.1 Leaf Instance Segmentation.
Due to the popularity of the CVPPP dataset [
204], the segmentation of leaf instance has attracted special attention from the computer vision community and warrants its own section. leaf instance segmentation methods include recurrent network methods [
238,
245] and pixel embedding methods [
62,
80,
221,
324,
331]. Parallel proposal methods are popular for general-purpose segmentation (see Section
2.3), but are ill-suited for leaf segmentation. As most leaves have irregular shapes, the rectangle proposal boxes used in these methods do not fit the leaves well, resulting in many poorly positioned boxes. In addition, the density of leaves causes many proposal boxes to overlap and compounds the fitting problem. As a result, it is difficult to pick out the best proposal box from the large number of parallel proposals. Therefore, we focus on recurrent network-based methods and pixel embedding-based methods in this section. Quality metrics for leaf segmentation include
Symmetric Best Dice (SBD) and
Absolute Difference in Count (|DiC|). SBD calculates the average overlap between the predicted mask and the ground truth for all leaves. DiC calculates the average number of miscalculated leaves over the entire test set.
Recurrent network-based methods output a mask for a single leaf sequentially. Their decisions are usually informed by the already segmented parts of the image, which are summarized by the recurrent network. Reference [
238] applies LSTM and DeconvNet to segment one leaf at a time. The network first locates a bounding box for the next leaf and performs segmentation within that box. After that, leaves segmented in all previous iterations are aggregated by the recurrent network and passed to the next iteration as contextual information. Reference [
245] employs
convolution-based LSTMs (ConvLSTM) with FCN feature maps as input. At each time step, the network outputs a single-leaf mask and a confidence score. During inference, the segmentation stops when the confidence score drops below 0.5. Reference [
251] proposes another similar method that combines feature maps with different abstraction levels for prediction.
Pixel embedding methods learn vector representations for the pixels so pixels in irregularly shaped leaves can become regularly shaped clusters in the representation space. With that, we can directly cluster the pixels. Reference [
324] performs simultaneous instance segmentation of leaves and plants. The authors propose an encoder-decoder framework, based on ERFNet [
244], with two decoders. One decoder predicts the centroids of plants and leaves. The other decoder predicts the offset of each leaf pixel to the leaf centroid. The pixel location plus the offset vector hence should be very close to the leaf centroid. The dispersion among all pixels of the same leaf can be modeled as a Gaussian distribution, whose covariance matrix is also predicted by the second decoder and whose mean is from the first decoder. The training maximizes the Gaussian likelihood for all pixels of the same leaf. The same process is applied to pixels of the same plant.
References [
62,
221,
331] are three similar pixel embedding methods. They encourage pixels from the same leaf to have similar embeddings and pixels from different neighboring leaves to have different embeddings to enable clustering in the embedding space. Their network consists of two modules, the distance regression module and pixel embedding module. References [
221,
331] arrange the two modules in sequence, while Reference [
62] places them in parallel. The distance regression module predicts the distance between the pixel and the closest object boundary. The pixel embedding module generates an embedding vector for each pixel, so pixels from the same leaves have similar embeddings and pixels from different neighboring leaves have different embeddings. During inference, pixels are clustered around leaf centers, which are identified as local maxima in the distance map from the distance regression module.
Last, References [
80,
327] take a large-margin approach. They ensure that embeddings of pixels from the same leaf are within a circular margin of the leaf center, and the embedding of leaf centers are far away from each other. This removes the need to determine the leaf centroids during inference because the embeddings are already well separated. Reference [
327] built upon the method in Reference [
80] to perform pixel embedding and clustering of leaves under weak supervision, with annotation on only a subset of instances in the images. In addition, a differentiable instance-level loss for a single leaf is formed to overcome the non-differentiability of assigning pixels to instances by comparing a Gaussian shape soft mask with the corresponding ground truth mask. Finally, consistency regularization, which encourages accordance of two embedding frameworks, is applied to improve embedding for unlabeled pixels.
Comparing different approaches, proposal-free pixel embedding techniques seem to be the best choice for the leaf segmentation problem. As can be seen from Table
2, pixel embedding methods obtain both the highest SBD and lowest |DiC|. One thing to note here, however, is that superior results of W-Net [
331] and SPOCO [
327] could be attributed to the inclusion of ground-truth foreground masks during inference. Even though the recurrent approach does not generate a large number of proposal boxes at once, it still uses rectangular proposals, which means that it still suffers from the fitting problem to irregular leaf shapes. Moreover, the recurrent methods are usually slower than pixel embeddings, due to the temporal dependence between the leaves.
3.1.2 Leaf Count and Other Growth Metrics.
Leaf counts may be estimated without leaf segmentation. Reference [
305] utilizes synthetic data in the leaf counting task. The authors employ the L-system-based plant simulator
lpfg [
3,
228] to generate Arabidopsis rosette images. The authors test a CNN, trained with only synthetic data, on real data from CVPPP and obtain superior result than a model trained with CVPPP data only. In addition, CNN trained with the combination of synthetic and real data obtained approximately 27% reduction in the mean absolute count error compared to CNN using only real data. These results demonstrate the potential of synthetic data in plant phenotyping.
Besides leaf size and leaf count, leaf fresh weight, leaf dry weight, and plant coverage (the area of land covered by the plant) are also used as metrics of growth. Reference [
359] applies CNN to regress leaf fresh weight, leaf dry weight, and leaf area of lettuce on RGB images. Reference [
240] makes use of Mask R-CNN, a parallel proposal method, for lettuce instance segmentation. The authors derive plant attributes such as contour, side view area, height, and width from the segmentation masks and bounding boxes, using preset formulas. They also estimate growth rate from the changes in area of the plant at each time step; they estimate fresh weight by linearly regressing from the attributes. Reference [
198] leverages COCO dataset pretrained Mask R-CNN with ResNet-50 as backbone to segment lettuce leaves. The daily change of mean leaf area is used for growth rate calculation.
3.2 Fruit and Flower Detection
Algorithms for fruit and flower detection find the location and spatial distribution of fruits and fruit flowers. This task supports various downstream applications such as fruit count estimation, size estimation, weight estimation, robotic pruning, robotic harvesting, and disease detection [
31,
108,
199,
342]. In addition, fruit or flower detection may help devise plantation management strategies [
108,
127], because fruit or flower statistics such as positions, facing directions (the directions the flowers face), and spatial scatter can reveal the status of the plant and the suitability of environmental conditions. For example, the knowledge of flower distribution may allow pruning strategies that focus on regions of excessive density and achieve even distribution of fruits, which optimizes the delivery of nutrients to the fruits.
Traditional approaches for fruit detection rely on manual feature engineering and feature fusion. As fruits tend to have unique colors and shapes, one natural thought is to apply thresholding on color [
219,
322] and shape information [
192,
217]. Additionally, References [
55,
184,
208] employ a combination of color, shape, and texture features. However, manual feature extraction suffers from brittleness when the image distribution changes with different camera resolutions, camera angles, illumination, and species [
30].
Deep learning methods for fruit detection include object detection and segmentation. Reference [
351] applies SSD for cherry tomato detection. Reference [
139] leverages Faster R-CNN to detect tomatoes. Inside the generated bounding boxes, color thresholding and fuzzy-rule-based morphological processing methods are applied to remove image background and obtain the contours of individual tomatoes. Reference [
249] leverages Faster R-CNN with VGG-16 as the backbone for sweet pepper detection. RGB and
near-infrared (NIR) images are used together for detection. Two fusion approaches, early and late fusion, are proposed. Early fusion alters the first pretrained layer to allow four input channels (RGB and NIR), whereas late fusion aggregates the two modalities by training independent proposal models for each modality and then combining the proposed boxes by averaging the predicted class probabilities. Reference [
356] trains three
multi-task cascaded convolutional networks (MTCNN) [
355] for detecting apples, strawberries, and oranges. MTCNN contains a proposal network, a bounding box refinement network, and an output network in a feature pyramid architecture with gradually increased input sizes for each network. The model is trained on synthetic images, which are random combinations of cropped negative patches and fruits patches, in addition to real-world images. Reference [
346] proposed R-YOLO with MobileNet-V1 as the backbone to detect ripe strawberries. Different from regular horizontal bounding boxes in object detection, the model generates rotated bounding boxes by adding a rotation-angle parameter to the anchors.
Delicate fruits, such as strawberries and tomatoes, are particularly vulnerable to damage during harvesting. Therefore, much research has been devoted to segmenting such fruits from backgrounds to determine the precise picking point. Precise fruit masks are expected to enable robotic fruit picking while avoiding damages on the neighboring fruits. Reference [
185] performs semantic segmentation for guava fruits and determines their poses using FCN with RGB-D images as input. The FCN outputs a binary mask for fruits and another binary mask for branches. With the fruit binary mask, the authors employ Euclidean clustering [
248] to cluster single guava fruit. From the clustering result and the branch binary mask, fruit centroids and the closest branch are located. Finally, the system predicts the vertical axis of the fruit as the direction perpendicular to the closest branch to facilitate robotic harvesting. Similarly, Reference [
13] leverages Mask R-CNN with ResNet as backbone for semantic segmentation of tomatoes. In addition, the authors filter the false positive detection of tomatoes from the non-targeted rows by setting a depth threshold. Reference [
107] utilizes Mask R-CNN with a ResNet101 backbone to perform instance segmentation of ripe strawberries, raw strawberries, straps, and tables. Depth images are aligned with the segmentation mask to project the shape of strawberries into 3D space to facilitate automatic harvesting. Reference [
347] also applies Mask R-CNN with a ResNet101 + FPN backbone to perform instance segmentation and ripeness classification on strawberries. Reference [
141] leverages a similar network for instance segmentation of tomatoes. With the segmentation mask, the systems determine the cut points of the fruits.
Besides accuracy, the processing speed of neural networks is also important for their deployment on mobile devices or agricultural robots. Reference [
262] performs network pruning on YOLOv3-tiny to form a lightweight mango detection network. A YOLOv3-tiny pretrained on the COCO dataset has learned to extract fruit-relevant features, because the COCO dataset contains apple and orange images, but it also has learned irrelevant features. The authors thus use a generalized attribution method [
266] to determine the contribution of each layer to fruit features extraction and remove convolution kernels responsible for detecting non-fruit classes. They find that the lower-level features are shared across all classes detection and pruning in the higher layers does not harm fruit detection performance. After pruning, the network achieves significantly lower
float-point operations (FLOPs) at the same level of accuracy.
Object detection is also applied for flower detection. Reference [
199] proposes a modified YOLOv4-Tiny with cascade fusion (CFNet) to detect citrus buds, citrus flowers, and gray mold, which is a disease commonly found on citrus plants. The authors propose additionally a block module with channel shuffle and depth separable convolution for YOLOv4-Tiny. Reference [
284] shrinks the anchor boxes of Faster-RCNN to fit small fruits and applies soft non-maximum suppression to retain boxes that may contain occluded objects. As flowers usually have similar morphological characteristics, flowers from other non-targeted species could possibly be used as training data in a transfer learning scenario. In Reference [
285], the authors fine-tune a DeepLab-ResNet model [
63] for fruit flower detection. The model is trained on apple flower dataset but achieves high F1 scores on pear and peach flower images (0.777 and 0.854, respectively).
3.3 Fruit Counting
Pre-harvest estimation of yields plays an important role in the planning of harvesting resources and marketing strategies [
135,
338]. As fruits are usually sold to consumers as a pack of uniformly sized fruits or individual fruits, the fruit count also provides an effective yield metric [
157], besides the distribution of fruit sizes. Traditional yield estimation is obtained through manual counting of samples from a few randomly selected areas [
135]. Nonetheless, when the production is large-scale, to counteract the effect of plant variability, accurate estimation would require a large quantity of samples from different areas of the field, resulting in high cost. Thus, researchers resort to CV-based counting methods.
A direct counting method is to regress on the image and output the fruit count. In Reference [
234], the authors apply a modified version of Inception-ResNet for direct tomato counting. The authors train the model on simulated images and test on real images, which suggests, once again, the viability of using simulated images to circumvent the cost for formulating a large dataset.
Besides direct regression, object detection [
157,
320], semantic segmentation [
154], and instance segmentation [
215] have also been used for fruit counting. These methods provide an intermediate level of results from which the count can be easily gathered. Reference [
157] proposes MangoYOLO based on YOLOv2-tiny and YOLOv3 for mango detection and counting. The authors increase the resolution of the feature map to facilitate detection of small fruits. Reference [
124] proposes pre-trained Faster R-CNN network, building upon DeepFruits [
249], to estimate the quantity of sweet pepper. The authors design a tracking sub-system for sweet pepper counting. The sub-system identifies new fruits by measuring the IoU between and comparing the boundary of detected and new fruits. Reference [
154] performs semantic segmentation for mango counting using a modification of FCN. The coordinates of blob-like regions in the semantic segmentation mask are used to generate bounding boxes corresponding to mango fruits. Finally, Reference [
215] applies Mask R-CNN to for instance segmentation of blueberries. The model also classifies the maturity of individual blueberries and counts the number of berries according to the masks.
Occlusion poses a difficult challenge for counting. Due to this issue, automatic count from detection or segmentation results is almost always lower than the actual number of fruits. To solve this, [
157] calculates and applies the ratio between the actual hand harvest count and the automatic fruit count; it also uses both front and back views of mango trees to mitigate occlusion from one angle. Taking this idea one step further, Reference [
320] uses dual-view videos to detect and track mangoes when the camera moves. Utilizing different views of the same tree in a video, Reference [
320] recognizes around 20% more fruits. However, the detected count is still significantly lower than the actual number, underscoring the research challenge of exhaustive and accurate counting.
3.4 Maturity-level Classification
Maturity-level classification aims to determine the ripeness of fruits or vegetables to aid in proper harvesting and food quality assurance. Premature harvesting results in plants that are unpalatable or incapable of ripening, while delayed harvesting can result in overripe plants or food decay [
141].
The optimal maturity level differs for different targeted products and destinations. Fruits and vegetables can be consumed at different growing stages. For example, lettuce can be consumed either as baby lettuce or fully grown lettuce. The same situation happens with baby corn and normal corn. Products are to be transported to different destinations, so we must consider the length of transportation and ripening speed when deciding the correct maturity level at harvest [
358].
Manually distinguishing the subtle differences in maturity levels is time-consuming, prone to inconsistency, and costly. The labor cost of harvesting accounts for a large percentage of operation cost in farms, with 42% of variable production expenses in U.S. fruit and vegetable farms being spent on labor for harvesting [
142]. Automatic maturity-level classification with computer vision, in contrast, can assist automatic harvesting [
20,
107,
358] and reduce cost.
Similar to fruit detection, we can apply thresholding methods on color to detect ripeness. For example, Reference [
25] applies color thresholding on HSI and YIQ color spaces. Reference [
296] applies linear color models. Reference [
176] utilizes the combination of color and texture features. References [
96,
165,
256,
257,
329] apply shallow learning methods based on a multitude of features.
More recently, researchers evaluate the performance of deep learning-based computer vision methods on maturity level classification and attain satisfactory results. For example, Reference [
357] applies CNN to classify tomato maturity into five levels. However, to further facilitate automatic harvesting, object detection and instance segmentation are more commonly used for getting the exact shape, location and maturity level of fruits, and position of peduncles for robotic end-effectors to cut on.
With object detection, Reference [
346] applies the R-YOLO network described in the fruit detection section (Section
3.2) to detect ripe strawberries. Reference [
124], as mentioned in the fruit counting section (Section
3.3), proposes pre-trained Faster R-CNN network to estimate both the ripeness and quantity of sweet pepper. Two formulations of the model are tested. One treats ripe/unripe as additional classes on top of foreground/background, and the other performs foreground/background classification first and then performs ripeness classification on foreground regions. The second approach generates better ripeness classification results, as the ripe/unripe classes are more balanced when only the foreground regions are considered.
Using the segmentation methods discussed in Section
3.2, Reference [
13] classifies semantic segmentation masks of tomatoes into raw and ripe tomatoes. References [
107,
347] perform instance segmentation and classify instance masks into ripe and raw strawberries. Reference [
141] performs instance segmentation on tomatoes first. After transforming the mask region into HSV color space, the authors employ a fuzzy system to classify tomatoes into four classes: immature (completely green), breaker (green to tannish), preharvest (light red), and harvest (fully colored).
3.5 Pest and Disease Detection
Plants are susceptible to environmental disorders caused by temperature, humidity, nutritional excess/deficiency, light changes, and biotic disorders due to fungi, bacteria, virus or other pests [
103,
272]. Infectious diseases or pest pandemic induce inferior plant quality or plant death, resulting in at least 10% of global food production losses [
282].
Although controlled vertical farming restricts the entry of pests and diseases, it cannot eliminate them. Pests and diseases can enter the farm from accidental contamination from employees, seeds, irrigation water and nutrient solution, poorly maintained environment or phytosanitation protocols, unsealed entrance, and ventilation systems [
242]. For this reason, pest and disease detection is still worth studying in the context of CEA.
Manual diagnosis of plant is complex due to the large quantity of vertically arranged plants in the field and numerous possible symptoms of diseases on different species. In addition, plants show different patterns along infection cycles, and their symptoms can vary in different parts of the plant [
43]. Consequently, autonomous computer vision systems that recognize diseases according to the species and plant organs are gaining traction. From a technological perspective, we sort existing techniques into three parts, single- and multi-label classification, handling unbalanced class distributions, as well as label noise and uncertainty estimates.
3.5.1 Single- and Multi-label Classification.
Studies perform single-label, or one-label-per-image, classification of diseases of either one single species [
24,
254,
272,
361] or multiple species [
95]. Reference [
361] creates a lightweight version of AlexNet, replacing the fully connected network with a global pooling layer, to classify six types of cucumber diseases. Reference [
272] leverages CNNs for classifying leaves into mango leaves, diseased mango leaves, and other plant leaves. Reference [
24] utilizes AlexNet and VGG16 to recognize five types of pests and diseases of tomatoes. Reference [
95] applies AlexNet, AlexNetOWTBn [
162], GoogLeNet, Overfeat [
258], and VGG for classifying 25 different healthy or diseased plants.
Having a single label per image can be inaccurate. In the real world, one plant or one leaf can carry multiple diseases or contain multiple diseased regions. By detecting multiple targeted areas or disease classes, the multi-label setting can lead to improved efficiency and accuracy.
To deal with the possibility of having multiple diseases or multiple areas of diseases on one plant simultaneously, two types of methods are proposed. Reference [
201] first segments out different infection areas on cucumber leaves using color thresholding following Reference [
200], then applies DCNN on segmented areas to classify four types of cucumber diseases. Nevertheless, the color thresholding technique may not generalize to other plant species and environment. Another type of method leverages object detection or segmentation for locating and classifying infection areas. Reference [
254] locates multiple diseased regions of banana plants simultaneously using object detection but assigns only one disease label to each image. Reference [
103] compared Faster R-CNN, R-FCN, and SSD for detecting nine classes of diseases and pests that affect tomato plants. Multiple diseases and pests in one plant are detected simultaneously. Reference [
349] applies improved DeepLab v3+ for segmentation of multiple black rot spots on grape leaves. The efficient channel attention mechanism [
315] is added to the backbone of DeepLab v3+ for capturing local cross-channel interaction. Feature pyramid network and Atrous Spatial Pyramid Pooling [
64] are utilized for fusing feature maps from the backbone network at different scales to improve segmentation.
3.5.2 Handling Unbalanced Class Distributions.
A common obstacle encountered in disease detection is unbalanced disease class distributions. There are typically much fewer diseased plants than healthy plants; the unequal frequencies introduce difficulties in finding images of rare diseases; the data unbalance leads to difficulty for model training. To remedy such problem, researchers propose weakly supervised learning [
44],
generative adversarial network (GAN) [
116], and few-shot learning [
182,
216].
Specifically, Reference [
44] applies
multiple instance learning (MIL), a type of weakly supervised learning method, for multi-class classification of six mite species of citrus. In MIL, the learner receives a set of labeled bags, containing multiple image instances. We know that at least one instance is associated with the class label but do not know the exact instance. The MIL algorithm tries to identify the common characteristic shared by images in the positively labeled bags. In this work, a CNN is first trained with labeled bags. Next, by calculating saliency maps of images in bags, the model identifies salient patches that have a high probability of containing mites. These patches inherit labels from their bags and are used to refine the CNN trained above.
Reference [
116] leverages GAN to generate realistic image patches of tip-burn lettuce and trains U-net for tip-burn segmentation. For the generation stage, lettuce canopy image patches are inputted into Wasserstein GANs [
26] to generate stressed (tip-burned) patches so there are an equal number of stressed and healthy patches. Then, in the segmentation stage, the authors generate a binary label map for the images using a classifier and an edge map. The binary label map labels each mini-patches (super-pixels) as stressed or healthy. The authors then feed the label map, alongside the original images, as input to U-net for mask segmentation.
In few-shot meta-learning, we are given a meta-train set and a meta-test set, with the two sets containing mutually exclusive image classes (i.e., classes in the training set do not appear in the testing set). Meta-train or meta-test sets contain a number of episodes, each of which consists of some training (supporting) images and some test (query) images. The rationale of meta-learning is to equip the model with the ability to quickly learn to classify the test images from a small number of training images within each episode. The model acquires this meta-learning capability on the meta-train set and is evaluated on the meta-test set.
As an example, Reference [
216] performs pests and diseases classification with few-shot meta-learning. The model framework consists of an embedding module and a distance module. The embedding module first projects supporting images into an embedding space using ResNet-18, then feeds embedding vectors into a transformer to incorporate information of other support samples in the same episode. After that, the distance module calculates the Mahalanobis distance [
104] of the query and support samples to classify the query. Similarly, Reference [
182] uses a shallow CNN for embedding and the Euclidean distance for calculating the similarity between the embeddings of the query and support samples.
3.5.3 Label Noise and Uncertainty Estimates.
Reference [
263] is another example of meta-learning, but it is used to improve the network’s robustness against label noise. The model consists of two phrases. The first phrase is the conventional training of a CNN for classification. In the second phrase, the authors generate 10 synthetic mini batches of images, containing real images with the labels taken from similar images. As a result, these mini-batches could contain noisy labels. After one step update on the synthetic instances, the network is trained to output similar predictions with the CNN from the first phrase. The result is a model that is not easily affected by noisy training data.
Finally, having a confidence score associated with the model prediction allows farmers to make decisions selectively under different confidence levels and boost the acceptance of deep learning models in agriculture. As an example, Reference [
99] performs classification of tomato diseases and pairs the prediction with a confidence score following Reference [
79]. The confidence score, calculated using Bayes’ rule, is defined as the probability of the true class label conditioned on the class probability predicted by the CNN. In addition, the authors build an ontology of disease classification. For example, the parent node “stressed plant” has as children “bacteria infection” and “virus infection,” which in turn has “mosaic virus” as a child. If the confidence score of a specific terminal disease label is below a certain threshold, then the model switches to its more general parent label in the tree for higher confidence. By the axiom of probability, the predicted probability of the parent label is the summation of all the predicted probability of its direct descendants. For a general discussion of machine learning techniques that create well-calibrated uncertainty estimates, we refer readers to Section
2.4.
6 Conclusions
Smart agriculture, and particularly computer vision for controlled-environment agriculture (CV4CEA), are rapidly emerging as an interdisciplinary area of research that could potentially lead to enormous economic, environmental, and social benefits. In this survey, we first provide brief overviews of existing CV technologies that range from image recognition to structured understanding such as segmentation; from uncertain quantification to interpretable machine learning. Next, we systematically review existing applications of CV4CEA, including growth monitoring, fruit and flower detection, fruit counting, maturity-level classification, and pest/disease detection. Finally, we highlight a few research directions that could generate high-impact research in the near future.
Like any interdisciplinary area, research progress in CV4CEA requires expertise in both computer vision and agriculture. However, it could take a substantial amount of time for any researcher to acquire in-depth understanding of both subjects. By reviewing existing applications, available CV technologies, and identifying possible future research directions, we aim to provide a quick introduction of CV4CEA to researchers with expertise in agriculture or computer vision alone. It is our hope that the current survey will serve as a bridge between researchers from diverse backgrounds and contribute to accelerated innovation in the next decade.