survey

Open access

A Survey of Computer Vision Technologies in Urban and Controlled-environment Agriculture

Authors:

Jiayun Luo,

Boyang Li,

Cyril LeungAuthors Info & Claims

ACM Computing Surveys, Volume 56, Issue 5

Article No.: 118, Pages 1 - 39

https://doi.org/10.1145/3626186

Published: 27 November 2023 Publication History

PDF eReader

Abstract

In the evolution of agriculture to its next stage, Agriculture 5.0, artificial intelligence will play a central role. Controlled-environment agriculture, or CEA, is a special form of urban and suburban agricultural practice that offers numerous economic, environmental, and social benefits, including shorter transportation routes to population centers, reduced environmental impact, and increased productivity. Due to its ability to control environmental factors, CEA couples well with computer vision (CV) in the adoption of real-time monitoring of the plant conditions and autonomous cultivation and harvesting. The objective of this article is to familiarize CV researchers with agricultural applications and agricultural practitioners with the solutions offered by CV. We identify five major CV applications in CEA, analyze their requirements and motivation, and survey the state-of-the-art as reflected in 68 technical papers using deep learning methods. In addition, we discuss five key subareas of computer vision and how they related to these CEA problems, as well as 14 vision-based CEA datasets. We hope the survey will help researchers quickly gain a bird’s-eye view of the striving research area and will spark inspiration for new research and development.

1 Introduction

Artificial intelligence (AI), especially computer vision (CV), is finding an ever-broadening range of applications in modern agriculture. The next stage of agricultural technological development, Agriculture 5.0 [15, 100, 232, 352], will constitute AI-driven autonomous decision-making as a central component. The term Agriculture 5.0 stems from a chronology [352] that begins with Agriculture 1.0, which heavily depends on human labor and animal power, and Agriculture 2.0, enabled by synthetic fertilizers, pesticide, and combustion-powered machinery, and develops to Agriculture 3.0 and 4.0, characterized by GPS-enabled precision control, and Internet-of-Things (IoT)-driven data collection [250]. Built upon the rich agricultural data collected, Agriculture 5.0 holds the promise to further increase productivity, satiate the food demand of a growing global population, and mitigate the negative environmental impact of existing agricultural practices.

As an integral component of Agriculture 5.0, controlled-environment agriculture (CEA), a farming practice carried out within urban, indoor, resource-controlled, and sensor-driven factories, is particularly suitable for the application of AI and CV. This is because CEA provides ample infrastructure support for data collection and autonomous execution of algorithmic decisions. In terms of productivity, CEA could produce higher yield per unit area of land [8, 9] and boost the nutritional content of agricultural products [159, 304]. In terms of environmental impact, CEA farms can insulate environmental influences, relieve the need for fertilizer and pesticides, and efficiently utilize recycled resources like water, thereby may be much more environmentally friendly and self-sustainable than traditional farming.

In light of current global challenges, such as disruptions to global supply chains and the threat of climate change, CEA appears especially appealing as a food source for urban population centers. Under pressures of deglobalization brought by geopolitical tensions [362] and global pandemics [233, 268], CEA provides the possibility to build farms close to large cities, which shortens the transportation distance and maintains secure food supplies even when long-distance routes are disrupted. The city-state Singapore, for example, has promised to source 30% of its food domestically by 2030 [1, 306], which is only possible through suburban farms such as CEAs. Furthermore, CEA, as a form of precision agriculture, is by itself a viable solution to the reduction of the emission of greenhouse gasses [9, 37, 243]. CEA can also shield plants from adverse climate conditions exacerbated by climate change, as its environments are fully controlled [112], and is able to effectively reuse the arable land eroded due to climate change [364].

We argue that AI and CV are critical to the economic viability and long-term sustainability of CEAs, as these technologies could save expenses associated with production and improve productivity. Suburban CEAs have high land costs. An analysis in Victoria, Australia [38], shows that, due to the higher land cost resulting from proximity to cities, with an estimated 50-fold productivity improvement per land area, it still takes six to seven years for a CEA to reach the break-even point. Thus, further productivity improvement from AI would act as strong drivers for CEA adoption. Moreover, vertical or stacked setup of vertical farms impose additional difficulty for farmers to perform daily surveillance and operations. Automated solutions empowered by computer vision could effectively solve this problem. Finally, AI and CV technologies have the potential to fully characterize the complex, individually different, time-varying, and dynamic conditions of living organisms [39], which will enable precise and individualized management and further elevate yield. Thus, AI and CV technologies appear to be a natural fit to CEAs.

Most of the recent development of AI can be attributed to the newly discovered capability to train deep neural networks [172] that can (1) automatically learn multi-level representations of input data that are transferable to diverse downstream tasks [65, 136], (2) easily scale up to match the growing size of data [283], and (3) conveniently utilize massively parallel hardware architectures like GPUs [114, 328]. As function approximators, deep learning proves to be surprisingly effective in generalizing to previously unseen data [354]. Deep learning has achieved tremendous success in computer vision [293], natural language processing [47, 83, 118], multimedia [23, 88], robotics [291], game playing [270], and many other areas.

The AI revolution in agriculture is already underway. State-of-the-art neural network technologies, such as ResNet [134] and MobileNet [138] for image recognition, and Faster R-CNN [239], Mask R-CNN [133], and YOLO [235] for object detection, have been applied to the management of crops [194], livestock [140, 299], and plants in indoor and vertical farms [240, 357]. AI has been used to provide decision support in myriad tasks, from DNA analysis [194] and growth monitoring [240, 357] to disease detection [254] and profit prediction [28].

While several surveys have explored the use of CV techniques in agriculture, none of them specifically focus on CEA applications. Some surveys summarize studies based on aspects of practical applications in agriculture. References [74, 89, 123, 146, 278] survey pest and disease detection studies. References [40, 111, 303] discuss fruit and vegetable quality grading and disease detection. Reference [298] summarizes studies in six sub-fields, including crop growth monitoring, pest and disease detection, automatic harvesting/fruit detection, fruit quality testing, automated management of modern farms, and the monitoring of farmland information with Unmanned Aerial Vehicle (UAV). Other surveys organize existing works from a technical perspective, namely, algorithms used [237] or formats of data [56]. Reference [151], as an exception, introduces the development history of CV and AI in smart agriculture without investigating any individual studies. Our work aims to address this gap and provide insights tailored to CEA-specific contexts.

As the volume of research in smart agriculture grows rapidly, we hope the current review article can bridge researchers from both areas of AI and agriculture and create a mild learning curve when they wish to familiarize themselves in the other area. We believe computer vision has the closest connections with, and is the most immediately applicable in, urban agriculture and CEAs. Hence, in this article, we focus on reviewing deep-learning-based computer vision technologies in urban farming and CEAs. We focus on deep learning, because it is the predominant approach in AI and CV research. The contributions of this article are two-fold, with the former targeted at AI researchers and the latter targeted at agriculture researchers:

We identify five major CV applications in CEA and analyze their requirements and motivation. Further, we survey the state-of-the-art as reflected in 68 technical papers and 14 vision-based CEA datasets.

We discuss five key subareas of computer vision and how they relate to CEA. In addition, we identify four potential future directions for research in CV for CEA.

In Figure 1, we provide an graphical preview of our content. It illustrates the end-to-end agriculture process of CEAs, from seed planting to harvest and sales, with five major deep learning-based CV applications—Growth Monitoring, Fruit and Flower Detection, Fruit Counting, Maturity Level Classification, and Pest and Disease Detection—mapped to the corresponding applicable plant growth stages. We do not survey the autonomous seed planting and harvesting step, as they are more relevant to robot functioning and robotic control, i.e., grasping, carrying, and placing of objects rather than computer vision (we do include the localization of fruit in the fruit and flower detection section that facilitates harvesting robot to locate the targeted object and perform action). However, we provide here some literature related to agriculture robot and end-effector design for reference: [36, 57, 92, 231, 353].

Fig. 1.

We structure the survey following the process in the figure: First, to provide a bird’s-eye view of CV capabilities available to researchers in smart agriculture, we summarize several major CV problems and influential technical solutions in Section 2. Next, we review 68 papers with respect to the application of computer vision in the CEA system in Section 3. The discussion is organized into five subsections: Growth Monitoring, Fruit and Flower Detection, Fruit Counting, Maturity Level Classification, and Pest and Disease Detection. In the discussion, we focus on fruits and vegetables that are suitable for CEA, including tomato [10, 13, 127, 351], mango [7], guava [269, 333], strawberry [107, 346], capsicum [174], banana [5], lettuce [359], cucumber [10, 128, 200], citrus [4], and blueberry [2]. Next, we provide a summary of 14 publicly available datasets of plants and fruits in Section 4 to facilitate future studies in controlled-environment agriculture. Finally, we highlight a few research directions that could generate high-impact research in the near future in Section 5.

One thing to note here is that, except for the Leaf Instance Segmentation task under the Growth Monitoring section, all the tasks are performed with model trained from different datasets and evaluated on different metrics. Tables 3 4, 5, 6 showcase the variety in datasets and evaluation metrics. This variation results in incomparable performance between studies. Such a phenomenon further indicates the necessity of our survey, which summarizes the current progress in literature and encourages the development of general benchmarks to promote consistency and comparability in future research.

Table 1.

Factors	CV Problems	Example Countermeasures
Environmental Change	OOD Generalization	Collect New Data, Few-shot learning, Weakly-supervised learning, Unsupervised-learning (see Section 3)
Sub-optimal Data Quality	Unbalanced Class Distribution, Label Noise	Multiple-instance Learning, Generate Image of Minority Classes with GANs, Few-shot Meta-learning (see Sections 3.5.2, 3.5.3, and 5.1)
Human Factor	Interpretability, Uncertainty Estimates	Paired Confidence Scores, Meta-learning (see Sections 2.4, 2.5, and 5.2)

Table 1. Factors to Consider when Applying CV Techniques in CEA and Some Corresponding Countermeasures

Table 2.

Category	Technique	SBD ($\uparrow$)	\|DiC\| ($\downarrow$)
Sequential	End-to-end instance segmentation [238]	84.9	0.8
Sequential	RNN-SIS [251]	74.7	1.1
	RIS [245]	66.6	1.1
Pixel Embedding	Semantic Instance Segmentation [80]	84.2	1.0
	Object-aware Embedding [62]	83.1	0.73
	RHN + Cosine Embeddings [221]	84.5	1.5
	Crop Leaf and Plant Instance Segmentation [324]	91.1	1.8
	W-Net (GT-FG) [331]	91.9	-
	SPOCO (GT-FG) [327]	93.2	1.7

Table 2. Performance of Various Leaf Instance Segmentation Techniques on the CVPPP A1 Test Set

Higher SBD and lower |DiC| indicate better performance. (GT-FG) indicates model making use of ground-truth foregrounds.

Table 3.

Category	Technique	Evaluation Metric	Performance	Dataset
FruitObject Detection	[351]	Precision (IoU > 0.5)	94%	1,730 images of cherry tomatoes
	[139]	Accuracy (IoU unspecified)	95.50%	800 images of tomatoes
	[249]	F1 scores (IoU unspecified)	83.80%	122 images of 7 fruits
	[356]	True positive rate and False positive rate (IoU unspecified)	98%, 17%	2,116 self-acquired images of fruits and 511 images of fruits from ImageNet
	[346]	Precision and Recall (IoU > 0.9)	94.4%, 93.5%	2,000 images of strawberries
	[262]	F1 scores (IoU unspecified)	93.5%–95.1%	Mango Image Dataset [157]
FruitSegmentation	[185]	Precision and Recall (IoU unspecified)	98.3%, 94.8%	437 RGB-D images of guavas
	[13]	Precision, Recall, and F1 scores (IoU > 0.5)	96%, 91%, 93%	123 images RGB-D images of tomatoes
	[107]	Precision, Recall, F1 score, and Average Precision (IoU > 0.9)	97%, 92%, 94%, 90%	120 images RGB-D images of strawberries
	[347]	Mean IoU	89.85%	1,900 images of strawberries
	[141]	Accuracy (IoU unspecified)	98%	900 images of strawberries
Flower Object Detection	[199]	Average Precision and F1 scores (IoU > 0.5)	96.2%, 89.0%	1,078 images of citrus buds and flowers
	[284]	Average Precision (IoU > 0.5)	90.50%	5,624 images of tomato flower and fruit
	[285]	IoU, F1 scores, Recall, and Precision (IoU unspecified)	81.1%, 89.6%, 91.9%, 87.3%	Multi-species fruit flower detection [85]

Table 3. Performance of Various Fruit and Flower Detection Techniques

Datasets without reference are unpublished datasets.

Table 4.

Category	Technique	Evaluation Metric	Performance	Dataset
Count Regression	[234]	Accuracy	91.0%–93%	4,800 synthetic tomato images
Count Fruit Bounding Boxes	[157]	F1 scores, Average Precision(IoU > 0.24)	96.8%, 98.3%	MangoYolo Dataset [157]
Count Fruit Bounding Boxes	[320]	$R^{2}$, RMSE	0.66, 2.1	MangoYolo Dataset [157]
Count Fruit Segmentation Masks	[154]	Accuracy, F1 score (IoU > 0.6)	73.6%, 84.4%	12,590 images of mangoes
Count Fruit Segmentation Masks	[215]	Average Precision (IoU > 0.5), RMSE	71.6%, 1.484	724 images of blueberries

Table 4. Performance of Various Fruit Counting Techniques

Datasets without reference are unpublished datasets. Reference [234] uses direct regression method and thus does not need IoU threshold.

Table 5.

Category	Technique	Evaluation Metric	Performance	Dataset
Classification	[357]	Accuracy	91.9%	200 images of tomatoes
Classification on Bounding Boxes	[346]	Precision, Recall (IoU > 0.9)	94.4%, 93.5%	2,000 images of strawberries
Classification on Bounding Boxes	[124]	F1 score (IoU > 0.4)	77.30%	285 images of capsicums
Classification on Segmentation Masks	[13]	Precision, Recall, and F1 scores (IoU > 0.5)	-	123 images RGB-D images of tomatoes
	[107]	Precision, Recall, F1 score, and Average Precision (IoU > 0.9)	-	120 images RGB-D images of strawberries
	[347]	Precision, Recall (IoU > 0.9)	95.78%, 95.41%	1,900 images of strawberries
	[141]	Class frequency weighted precision and recall (IoU Unspecified)	96.1%, 96.0%	900 images of strawberries

Table 5. Performance of Various Maturity-level Classification Techniques

Datasets without reference are unpublished datasets. Performance “-” are papers with unsummarizable metric results. Reference [357] uses direct classification method and thus does not need IoU threshold.

Table 6.

Category	Technique	Evaluation Metric	Performance	Dataset
Single- and Multi-label Classification	[361]	Accuracy$^{*}$	94.65%	700 diseased and normal leaf images
	[272]	Accuracy$^{*}$	97.13%	1,070 self-acquired leaf images, and 1,130 images from the Plant Village dataset [143]
	[24]	Accuracy$^{*}$	93.33%	Images of 643 leaf samples
	[254]	mAP$^{*}$	72.8%–97.9%	12,600 images of bananas
	[95]	Accuracy$^{*}$	99.50%	87,848 images of leaves
	[201]	Accuracy$^{*}$	93.40%	Plant Village dataset [143]
	[103]	mAP (IoU > 0.5)	86%	5,000 images of diseases and pests of tomatoes
	[349]	mIOU, recall, and F1-score (IoU unspecified)	84.8%, 88.1%, 91.8%	Plant Village dataset [143]
Handling Unbalanced Class Distribution	[44]	Accuracy$^{*}$	60.7%–91.8%	IP102 [330], Citrus Pest Benchmark [44]
	[116]	Average Precision	67%–85%	Plant Village dataset [143] and Plant Leaves [267]
	[216]	Accuracy$^{*}$	88.5%–95.5%	Plant Village dataset [143] and Plant and Pest [182]
	[182]	Accuracy$^{*}$	43.9%–81%	Plant Village [143], Crop Pests Recognition [181]
Noise and Uncertainty Estimate	[263]	Accuracy$^{*}$	94.58%	Plant Village [143]
Noise and Uncertainty Estimate	[99]	Accuracy$^{*}$	-	15,892 images of tomatoes from Plant Village [143], extra 8,911 images of corn, 6,635 images of soybeans

Table 6. Performance of Various Pest and Disease Detection Techniques

Datasets without reference are unpublished datasets. Performance “-” are papers with unsummarizable metric results. *$^{*}$Studies perform direct classification on image and thus do not need IoU threshold. Reference [116] uses patch-level segmentation that does not need IoU threshold as well.

2 Computer Vision Capabilities Relevant to Smart Agriculture

2.1 Image Recognition

The classic problem of image recognition is to classify an image containing a single object to the corresponding object class. The success of deep convolutional networks in this area dates (at least) back to LeNet [173] of 1998, which recognizes hand-written digits. The fundamental building block of such networks is the convolution operation. Using the principles of local connections and weight sharing, convolutional networks benefit from an inductive bias of translational invariance. That is, a convolutional network applies (approximately) the same operation to all pixel locations of the image.

The victory of AlexNet [163] in the 2012 ImageNet Large Scale Visual Recognition Challenge [247] is often considered as a landmark event that introduced deep neural networks into the AI mainstream. Subsequently, many variants of convolutional networks [148, 170, 271, 289] have been proposed. Due to space limits, here, we provide a brief review of a few influential works, which is by no means exhaustive. ResNet [134] introduces residual connections that allow the training of networks of more than 100 layers. ResNeXT [336] and MobileNet [138] employ grouped convolution that reduces interaction between channels and improves the efficiency of the network parameters. ShuffleNet [365] utilizes the shuffling of channels, which complements group convolution. EfficientNet [292] shows simultaneous scaling of the network width, height, and image resolution is key to efficient use of parameters.

Recently, the transformer model has proven to be a highly competitive architecture for image recognition and other computer vision tasks [90]. These models cut the input image into a sequence of small image patches and often apply strong regularization such as RandAugment [75]. Variants such as CaiT [302], CeiT [350], Swin Transformer [195], and others [72, 78, 337, 371] achieve outstanding performance on ImageNet.

Despite the maturity of the technology for image classification, the assumption that an image contains only one object may not be easily satisfied in real-world scenarios. Thus, it is often necessary to adopt a problem formulation as object detection or semantic/instance segmentation.

2.2 Object Detection

The object detection task is to identify and locate all objects in the image. It can be understood as the task resulted from relaxing the assumption that the input image contains a single object. This is one natural problem formulation for real-world images and has seen wide adoption in agricultural applications.

In broad strokes, contemporary object detection methods can be categorized into anchor-box-based and point-based/proposal-free approaches. In anchor-box methods [110, 239], the process starts with a number of predefined anchor boxes that are periodically tiled to cover the entire input image. For each anchor box, the network makes two types of predictions. First, it determines if the anchor box contains one of the predefined object classes. Second, if the box contains an object, then the network attempts to move and reshape the box to become closer to the ground-truth location of the object. One-stage anchor-box detectors [77, 101, 187, 193, 236, 367] make these predictions all at once. In comparison, two-stage detectors [110, 132, 186, 239], in the first stage discard anchor boxes that do not contain any object and classify the remaining boxes into finer object categories in the second stage. The location adjustment, known as bounding box regression, can happen in both stages. It is also possible to employ more than two stages [48]. When the objects have diverse shapes and scales, these methods must create a large number of proposal boxes and evaluate them all, which can lead to high computational cost.

While point-based object detectors [91, 158, 171, 300, 372] still need to identify rectangular boxes around the objects, they make predictions at the level of grid locations on the feature maps. The networks predict if a grid location is a corner or the center of an object bounding box. After that, the algorithm assembles the corners and centers into bounding boxes. The point-based approaches can reduce the total number of decisions to be made. A careful comparison and analysis of anchor-box methods and point-based methods can be found in Reference [360].

2.3 Semantic, Instance, and Panoptic Segmentation

Segmentation is a pixel-level classification task, aiming to classify every pixel in the image into a type of object or an object instance. The variations of the task differ by their definitions of the classes. In semantic segmentation [73, 94, 113, 167, 196], each type of object, such as cat, cow, grass, or sky, is its own class, but different instances of the same object type (e.g., two cats) share the same class. In instance segmentation [76, 129, 131, 226], different instances of the same object type become unique classes, so two cats are no longer the same class. However, object types such as sky or grass, which are not easily divided into instances, are ignored. In the recently proposed panoptic segmentation [69, 109, 155, 178, 189, 363], objects are first separated into things and stuff. Things are countable and each instance of things is its own class, whereas stuff is uncountable, impossible to separate into instances, appearing as texture or amorphous regions [12], and remains as one class. We note that the distinction between things and stuff is not rigid and can change, depending on the application. For example, grass is typically considered as stuff, but in the leaf instance segmentation task, each leaf of a plant becomes an instance and is a separate class.

The primary requirement of pixel-level classification is to learn pixel-level representations that consider sufficient context and within reasonable computational budget. A typical solution is to introduce a series of downsampling followed by a series of upsampling operations. Since classic works such as the Fully Convolutional Network (FCN) [196] and U-Net [246], this has been the mainstream strategy for various segmentation strategies.

Due to its use in leaf segmentation, a problem in plant phenotyping, instance segmentation may be the most relevant segmentation formulation for urban farming. Despite the apparent similarity to semantic segmentation, instance segmentation poses challenges due to the variable number of instance classes and possible permutation of class indices [80]. This could be handled by combining proposal-based object detection and segmentation [61, 68, 129, 180, 227]. Mask-RCNN [132] exemplifies this approach. Leveraging its object detection capability, the network associates each object with a bounding box. After that, the network predicts a binary mask for the object within the bounding box. However, such methods may not perform well when there is substantial occlusion among objects or when objects are of irregular shapes [80].

Departing from the detect-then-segment paradigm, recurrent methods [238, 245, 251] that output one segmentation mask at one time may be considered as implicitly modeling occlusion. Pixel embedding methods [62, 80, 213, 221, 326, 331, 344] learn vector representations for every pixel and cluster the vectors. These methods are especially suitable for segmenting plant leaves, and we will discuss them in greater detail in Section 3.1. Taking a page from the proposal-free object detector YOLO [235], SOLO [316] and SOLOv2 [317] divide the image into grids. The grid that the center an object falls into is responsible for predicting the segmentation mask of the object.

2.4 Uncertainty Quantification

Real-world applications often require qualification of the amount of uncertainty in the predictions made by machine learning, especially when the predictions carry serious implications. For example, if the system incorrectly determines that fruits are not mature enough, then it may delay harvesting and cause overripe fruits with diminished values. Thus, users of the ML system are justified to ask how certain we are about the decision. In addition, when facing real-world input, it is desirable for the network to answer “I don’t know” when facing an input that it does not recognize [183]. Well-calibrated uncertainty measurements may enable such a capability.

However, research shows that deep neural networks exhibit severe vulnerability to overconfidence, or under-estimation of the uncertainty in its own decisions [117, 203]. That is, the accuracy of the network decision is frequently lower than the probability that the network assigns to the decision. As a result, proper calibration of the networks should be a concern for systems built for real-world applications.

Calibration of deep neural networks may be performed post hoc (after training) using temperature scaling and histogram binning [87, 117, 312]. Also, regularization during training such as label smoothing [289] and mixup [137] have been shown to improve calibration [211, 224, 297]. Researchers propose new loss functions to replace existing ones that are susceptible to overconfidence [210, 343]. Moreover, ensemble methods such as Vertical Voting [335], Batch Ensemble [323], and Multi-input Multi-output [130] can derive uncertainty estimates.

2.5 Interpretability

Modern AI systems are known for their inability to provide faithful and human-understandable explanations for its own decisions. The unique characteristics of deep learning, such as network over-parameterization, large amount of training data, and stochastic optimization, while being beneficial to the predictive accuracy (e.g., References [27, 179, 274, 281]), all create obstacles toward understanding how and why a neural network reaches its decisions. The lack of human-understandable explanations leads to difficulties in the verification and trust of network decisions [52, 366].

We categorize model interpretation techniques into a few major classes, including visualization, feature attribution, instance attribution, inherently explainable models, and approximation by simple models. Visualization techniques present holistically what the model has learned from the training data by visualizing the model weights for direct visual inspection [34, 93, 97, 202, 209, 290]. In comparison, feature attribution and instance attribution are often considered as local explanations, as they aim to explain model predictions on individual samples. Feature attribution methods [22, 58, 60, 207, 230, 255, 265, 273, 287, 340] generate a saliency map of an image or video frame, which highlights the pixels that contribute the most to its prediction. Instance attribution methods [32, 46, 67, 156, 229, 265, 341] attribute a network decision to training instances that, through the training process, exert positive or negative influence on the particular decision. Moreover, inherently explainable models [33, 59, 175, 259, 345] incorporate explainable components into the network architecture, which reduces the need to apply post hoc interpretation techniques. In contrast, researchers also try to post hoc approximate complex neural networks with simple models such as rule-based models [84, 102, 115, 153, 222, 318] or linear models [14, 105, 106, 160, 241] that are easily understandable.

The most significant benefit of interpretation in the context of CEA lies in its ability to aid with the auditing and debugging of AI systems and datasets. With feature attribution, users can make sure the system captures the robust features, or semantically meaningful features, that generalize to real-world data. As in the well-known case of husky vs. wolf image classification, due to a spurious correlation, the neural network learns to classify all images with white backgrounds as wolf and those with green backgrounds as husky [206]. Such shortcut learning can be identified by feature attribution and subsequently corrected. Moreover, instance attribution allows researchers to pinpoint outliers or incorrectly labeled training data that may lead to misclassification [67].

3 Controlled-environment Agriculture

CEA is the farming practice carried out within urban, indoor, resource-controlled factories, often accompanied by stacked growth levels (i.e., vertical farming), renewable energy, and recycling of water and waste. CEA has recently been adopted in nations around the world [38, 82], such as Singapore [161], North America [6], Japan [9, 264], and the UK [8].

CEA has economic and environmental benefits. Compared to traditional farming, CEA farms produce higher yield per unit area of land [8, 9]. Controlled environments shield the plants from seasonality and extreme weather, so plants can grow all year round given suitable lighting, temperature, and irrigation [38]. The growing conditions can as well be further optimized to boost growth and nutritional content [159, 304]. Rapid turnover increases farmers’ flexibility in plant choice to catch the trend of consumption [35]. Moreover, farms investment on pesticides, herbicides, and transportation can be cut down due to reduced contamination from the outside environment and proximity to urban consumers.

CEA farms, when designed properly, can become much more environmentally friendly and self-sustainable than traditional farming. With optimized growing conditions and limited external interference, the need for fertilizer and pesticides decreases, so we can reduce the amount of chemicals that go into the environment as well as the resulting pollution. Furthermore, CEA farms can save water and energy through the use of renewable energy and aggressive water recycling. For instance, CEA farms from Spread, a Japanese company, recycle 98% of used water and reduce the energy cost per head of lettuce by 30% with LED lighting [9]. Finally, CEA farm can be situated in urban or suburban areas, thereby reducing transportation and storage cost. A simulation for different farm designs in Lisbon shows vertical tomato farms with appropriate designs emit less greenhouse gas than conventional farms, mainly due to reduced water use and transportation distance [37].

A significant drawback of CEA, however, lies in its high cost, which may be partially addressed by computer vision technologies. According to Reference [38], the higher land cost in Victoria, Australia, means that the yield of vertical farms has to be at least 50 times more than traditional farming to break even. Computer vision holds the promise of boosting the level of automation and increasing yield, thereby making CEA farms economically viable. As would be discussed in the following sections, CV techniques can reduce a major amount of variable costs, such as wastage cost induced by incorrect or delayed decisions on harvesting, and provide long-term benefit.

Carrying the potential to reduce a significant amount of cost, setting up computer vision systems in the field costs significantly less than expected when compared to the expenses of constructing a CEA building. Building a CEA structure involves high upfront costs, including construction, insulation, lighting, and HVAC systems. According to Reference [264], a 1,300-square-meter CEA building with a production area of 4,536 square meters would require a capital investment of $7.4 million and incur annual operational costs of approximately $3.4 million.

However, setting up hardware systems for CV models is relatively inexpensive. The necessary components include servers (CPU, GPU, memory, storage), sensors, cameras, networking, as well as cooling system. For example, a server with specifications such as a 32-Core 2.80 GHz Intel Xeon Platinum 8462Y+, 128 G memory, 4 NVIDIA RTX A6000 “Ada” GPUs, and 2 TB storage costs around $60,000. Using this server for training purposes, assuming a standard VGG-16 architecture, training on 5,000 images of size 224 $\times$ 224 pixels, with a batch size of 64 and 50 training epochs, and utilizing 4 NVIDIA A6000 GPUs, the estimated training time is less than an hour. Such a server is sufficient for daily training and inference of commonly used CV models. For a camera system, if we consider 10 surveillance cameras such as the Hikvision DS-2CD2142FWD-I, then the total cost would be around $1,400. Additionally, a high-speed network infrastructure is required to transfer data between the computer hardware, storage, and camera systems. Typically, it necessitates 4 to 7 routers to cover an area of 1,300 square meters, costing approximately $2,000. Finally, a liquid cooling system could cost between $1,000 and $2,000. In summary, a hardware system with a total cost of around $70,000 is sufficient for the daily operation, training, and inference of CV systems.

CEA can take diverse form factors [35], and the form factors may pose different requirements for computer vision technologies. Typical forms for CEA are glasshouses with transparent shells or completely enclosed facilities. Depending on the cultivars being planted, internal arrangement of the farm can be classified into stacked horizontal systems, vertical growth surfaces, and multi-floor towers. Form factors have influence on lighting, which is an important consideration in CV applications. For example, glasshouses with transparent shells utilize natural light to reduce energy consumption but may not provide sufficient lighting for CV around the clock. In comparison, a completely enclosed facility can have greater control of lighting conditions. Moreover, internal arrangement of the farm also affects camera angle. If cultivars being planted change frequently as a result of the high turnover rate in CEAs, then the arrangement of shelves and plants might change. This would affect the camera angles and thus the resulting inference performance. CV systems need adapt to the change of the environment.

Nevertheless, with the autonomous setup of CEAs, which allow easy new data collection, training a new CV model or fine-tuning a previous model to adapt to the above-mentioned changeable environment would be a cinch. Besides, there are also few-shot learning [286, 319], weakly-supervised learning[16, 218, 373], and unsupervised learning techniques [49, 253], which require minimal or zero annotations, that can facilitate the adjustment of the models.

Besides environmental change, there also exist other factors that need to be taken into account when applying CV techniques in CEA. Two typical problems to consider would be (1) How to cope with sub-optimal data with label noise and how to address unbalanced class distribution. (2) How to interpret the prediction from models or measure the uncertainty of prediction so users can use the models with confidence. Quantitative measure of the confidence or uncertainty would allow farmers to understand the decision generation process and make decisions with more confidence. Table 1 maps the above factors - environmental change, sub-optimal data quality, human factor - to CV problems and lists corresponding solutions and the respective sections that discuss the solutions.

In the following, we investigate the application of autonomous computer vision techniques on Growth Monitoring, Fruit and Flower Detection, Fruit Counting, Maturity Level Classification, and Pest and Disease Detection to increase production efficiency. In addition to existing applications, we include techniques that can be easily applied to vertical farms even though they have not yet been applied to them.

3.1 Growth Monitoring

Growth monitoring, a critical component of plant phenotyping, aims to understand the life cycle of plants and estimate yield by monitoring various growth indicators such as the plant size, number of leaves, leaf sizes, land area covered by the plant, and so on. Plant growth monitoring facilitates in quantifying the effects of biological/environmental factors on growth and thus is crucial for finding the optimal growing condition and developing high-yield crops [212, 294].

As early as 1903, Wilhelm Pfeffer recognized the potential of image analysis in monitoring plant growth [225, 279]. Traditional machine vision techniques such as gray-level pixel thresholding [220], Bayesian statistics [45], and shallow learning techniques [147, 348] have been applied to segment the objects of interest, such as leaves and stems, from the background to analyze plant growth. Compared to traditional methods, deep-learning techniques provide automatic representation learning and are less sensitive to image quality variations. For this reason, deep learning techniques for growth monitoring have recently gained popularity.

Among various growth indicators, leaf size and number of leaves per plant are the most commonly used [121, 169, 252]. Therefore, in the section below, we first discuss leaf instance segmentation, which can support both indicators at the same time, followed by a discussion of techniques for only leaf counting or for other growth indicators.

3.1.1 Leaf Instance Segmentation.

Due to the popularity of the CVPPP dataset [204], the segmentation of leaf instance has attracted special attention from the computer vision community and warrants its own section. leaf instance segmentation methods include recurrent network methods [238, 245] and pixel embedding methods [62, 80, 221, 324, 331]. Parallel proposal methods are popular for general-purpose segmentation (see Section 2.3), but are ill-suited for leaf segmentation. As most leaves have irregular shapes, the rectangle proposal boxes used in these methods do not fit the leaves well, resulting in many poorly positioned boxes. In addition, the density of leaves causes many proposal boxes to overlap and compounds the fitting problem. As a result, it is difficult to pick out the best proposal box from the large number of parallel proposals. Therefore, we focus on recurrent network-based methods and pixel embedding-based methods in this section. Quality metrics for leaf segmentation include Symmetric Best Dice (SBD) and Absolute Difference in Count (|DiC|). SBD calculates the average overlap between the predicted mask and the ground truth for all leaves. DiC calculates the average number of miscalculated leaves over the entire test set.

Recurrent network-based methods output a mask for a single leaf sequentially. Their decisions are usually informed by the already segmented parts of the image, which are summarized by the recurrent network. Reference [238] applies LSTM and DeconvNet to segment one leaf at a time. The network first locates a bounding box for the next leaf and performs segmentation within that box. After that, leaves segmented in all previous iterations are aggregated by the recurrent network and passed to the next iteration as contextual information. Reference [245] employs convolution-based LSTMs (ConvLSTM) with FCN feature maps as input. At each time step, the network outputs a single-leaf mask and a confidence score. During inference, the segmentation stops when the confidence score drops below 0.5. Reference [251] proposes another similar method that combines feature maps with different abstraction levels for prediction.

Pixel embedding methods learn vector representations for the pixels so pixels in irregularly shaped leaves can become regularly shaped clusters in the representation space. With that, we can directly cluster the pixels. Reference [324] performs simultaneous instance segmentation of leaves and plants. The authors propose an encoder-decoder framework, based on ERFNet [244], with two decoders. One decoder predicts the centroids of plants and leaves. The other decoder predicts the offset of each leaf pixel to the leaf centroid. The pixel location plus the offset vector hence should be very close to the leaf centroid. The dispersion among all pixels of the same leaf can be modeled as a Gaussian distribution, whose covariance matrix is also predicted by the second decoder and whose mean is from the first decoder. The training maximizes the Gaussian likelihood for all pixels of the same leaf. The same process is applied to pixels of the same plant.

References [62, 221, 331] are three similar pixel embedding methods. They encourage pixels from the same leaf to have similar embeddings and pixels from different neighboring leaves to have different embeddings to enable clustering in the embedding space. Their network consists of two modules, the distance regression module and pixel embedding module. References [221, 331] arrange the two modules in sequence, while Reference [62] places them in parallel. The distance regression module predicts the distance between the pixel and the closest object boundary. The pixel embedding module generates an embedding vector for each pixel, so pixels from the same leaves have similar embeddings and pixels from different neighboring leaves have different embeddings. During inference, pixels are clustered around leaf centers, which are identified as local maxima in the distance map from the distance regression module.

Last, References [80, 327] take a large-margin approach. They ensure that embeddings of pixels from the same leaf are within a circular margin of the leaf center, and the embedding of leaf centers are far away from each other. This removes the need to determine the leaf centroids during inference because the embeddings are already well separated. Reference [327] built upon the method in Reference [80] to perform pixel embedding and clustering of leaves under weak supervision, with annotation on only a subset of instances in the images. In addition, a differentiable instance-level loss for a single leaf is formed to overcome the non-differentiability of assigning pixels to instances by comparing a Gaussian shape soft mask with the corresponding ground truth mask. Finally, consistency regularization, which encourages accordance of two embedding frameworks, is applied to improve embedding for unlabeled pixels.

Comparing different approaches, proposal-free pixel embedding techniques seem to be the best choice for the leaf segmentation problem. As can be seen from Table 2, pixel embedding methods obtain both the highest SBD and lowest |DiC|. One thing to note here, however, is that superior results of W-Net [331] and SPOCO [327] could be attributed to the inclusion of ground-truth foreground masks during inference. Even though the recurrent approach does not generate a large number of proposal boxes at once, it still uses rectangular proposals, which means that it still suffers from the fitting problem to irregular leaf shapes. Moreover, the recurrent methods are usually slower than pixel embeddings, due to the temporal dependence between the leaves.

3.1.2 Leaf Count and Other Growth Metrics.

Leaf counts may be estimated without leaf segmentation. Reference [305] utilizes synthetic data in the leaf counting task. The authors employ the L-system-based plant simulator lpfg [3, 228] to generate Arabidopsis rosette images. The authors test a CNN, trained with only synthetic data, on real data from CVPPP and obtain superior result than a model trained with CVPPP data only. In addition, CNN trained with the combination of synthetic and real data obtained approximately 27% reduction in the mean absolute count error compared to CNN using only real data. These results demonstrate the potential of synthetic data in plant phenotyping.

Besides leaf size and leaf count, leaf fresh weight, leaf dry weight, and plant coverage (the area of land covered by the plant) are also used as metrics of growth. Reference [359] applies CNN to regress leaf fresh weight, leaf dry weight, and leaf area of lettuce on RGB images. Reference [240] makes use of Mask R-CNN, a parallel proposal method, for lettuce instance segmentation. The authors derive plant attributes such as contour, side view area, height, and width from the segmentation masks and bounding boxes, using preset formulas. They also estimate growth rate from the changes in area of the plant at each time step; they estimate fresh weight by linearly regressing from the attributes. Reference [198] leverages COCO dataset pretrained Mask R-CNN with ResNet-50 as backbone to segment lettuce leaves. The daily change of mean leaf area is used for growth rate calculation.

3.2 Fruit and Flower Detection

Algorithms for fruit and flower detection find the location and spatial distribution of fruits and fruit flowers. This task supports various downstream applications such as fruit count estimation, size estimation, weight estimation, robotic pruning, robotic harvesting, and disease detection [31, 108, 199, 342]. In addition, fruit or flower detection may help devise plantation management strategies [108, 127], because fruit or flower statistics such as positions, facing directions (the directions the flowers face), and spatial scatter can reveal the status of the plant and the suitability of environmental conditions. For example, the knowledge of flower distribution may allow pruning strategies that focus on regions of excessive density and achieve even distribution of fruits, which optimizes the delivery of nutrients to the fruits.

Traditional approaches for fruit detection rely on manual feature engineering and feature fusion. As fruits tend to have unique colors and shapes, one natural thought is to apply thresholding on color [219, 322] and shape information [192, 217]. Additionally, References [55, 184, 208] employ a combination of color, shape, and texture features. However, manual feature extraction suffers from brittleness when the image distribution changes with different camera resolutions, camera angles, illumination, and species [30].

Deep learning methods for fruit detection include object detection and segmentation. Reference [351] applies SSD for cherry tomato detection. Reference [139] leverages Faster R-CNN to detect tomatoes. Inside the generated bounding boxes, color thresholding and fuzzy-rule-based morphological processing methods are applied to remove image background and obtain the contours of individual tomatoes. Reference [249] leverages Faster R-CNN with VGG-16 as the backbone for sweet pepper detection. RGB and near-infrared (NIR) images are used together for detection. Two fusion approaches, early and late fusion, are proposed. Early fusion alters the first pretrained layer to allow four input channels (RGB and NIR), whereas late fusion aggregates the two modalities by training independent proposal models for each modality and then combining the proposed boxes by averaging the predicted class probabilities. Reference [356] trains three multi-task cascaded convolutional networks (MTCNN) [355] for detecting apples, strawberries, and oranges. MTCNN contains a proposal network, a bounding box refinement network, and an output network in a feature pyramid architecture with gradually increased input sizes for each network. The model is trained on synthetic images, which are random combinations of cropped negative patches and fruits patches, in addition to real-world images. Reference [346] proposed R-YOLO with MobileNet-V1 as the backbone to detect ripe strawberries. Different from regular horizontal bounding boxes in object detection, the model generates rotated bounding boxes by adding a rotation-angle parameter to the anchors.

Delicate fruits, such as strawberries and tomatoes, are particularly vulnerable to damage during harvesting. Therefore, much research has been devoted to segmenting such fruits from backgrounds to determine the precise picking point. Precise fruit masks are expected to enable robotic fruit picking while avoiding damages on the neighboring fruits. Reference [185] performs semantic segmentation for guava fruits and determines their poses using FCN with RGB-D images as input. The FCN outputs a binary mask for fruits and another binary mask for branches. With the fruit binary mask, the authors employ Euclidean clustering [248] to cluster single guava fruit. From the clustering result and the branch binary mask, fruit centroids and the closest branch are located. Finally, the system predicts the vertical axis of the fruit as the direction perpendicular to the closest branch to facilitate robotic harvesting. Similarly, Reference [13] leverages Mask R-CNN with ResNet as backbone for semantic segmentation of tomatoes. In addition, the authors filter the false positive detection of tomatoes from the non-targeted rows by setting a depth threshold. Reference [107] utilizes Mask R-CNN with a ResNet101 backbone to perform instance segmentation of ripe strawberries, raw strawberries, straps, and tables. Depth images are aligned with the segmentation mask to project the shape of strawberries into 3D space to facilitate automatic harvesting. Reference [347] also applies Mask R-CNN with a ResNet101 + FPN backbone to perform instance segmentation and ripeness classification on strawberries. Reference [141] leverages a similar network for instance segmentation of tomatoes. With the segmentation mask, the systems determine the cut points of the fruits.

Besides accuracy, the processing speed of neural networks is also important for their deployment on mobile devices or agricultural robots. Reference [262] performs network pruning on YOLOv3-tiny to form a lightweight mango detection network. A YOLOv3-tiny pretrained on the COCO dataset has learned to extract fruit-relevant features, because the COCO dataset contains apple and orange images, but it also has learned irrelevant features. The authors thus use a generalized attribution method [266] to determine the contribution of each layer to fruit features extraction and remove convolution kernels responsible for detecting non-fruit classes. They find that the lower-level features are shared across all classes detection and pruning in the higher layers does not harm fruit detection performance. After pruning, the network achieves significantly lower float-point operations (FLOPs) at the same level of accuracy.

Object detection is also applied for flower detection. Reference [199] proposes a modified YOLOv4-Tiny with cascade fusion (CFNet) to detect citrus buds, citrus flowers, and gray mold, which is a disease commonly found on citrus plants. The authors propose additionally a block module with channel shuffle and depth separable convolution for YOLOv4-Tiny. Reference [284] shrinks the anchor boxes of Faster-RCNN to fit small fruits and applies soft non-maximum suppression to retain boxes that may contain occluded objects. As flowers usually have similar morphological characteristics, flowers from other non-targeted species could possibly be used as training data in a transfer learning scenario. In Reference [285], the authors fine-tune a DeepLab-ResNet model [63] for fruit flower detection. The model is trained on apple flower dataset but achieves high F1 scores on pear and peach flower images (0.777 and 0.854, respectively).

3.3 Fruit Counting

Pre-harvest estimation of yields plays an important role in the planning of harvesting resources and marketing strategies [135, 338]. As fruits are usually sold to consumers as a pack of uniformly sized fruits or individual fruits, the fruit count also provides an effective yield metric [157], besides the distribution of fruit sizes. Traditional yield estimation is obtained through manual counting of samples from a few randomly selected areas [135]. Nonetheless, when the production is large-scale, to counteract the effect of plant variability, accurate estimation would require a large quantity of samples from different areas of the field, resulting in high cost. Thus, researchers resort to CV-based counting methods.

A direct counting method is to regress on the image and output the fruit count. In Reference [234], the authors apply a modified version of Inception-ResNet for direct tomato counting. The authors train the model on simulated images and test on real images, which suggests, once again, the viability of using simulated images to circumvent the cost for formulating a large dataset.

Besides direct regression, object detection [157, 320], semantic segmentation [154], and instance segmentation [215] have also been used for fruit counting. These methods provide an intermediate level of results from which the count can be easily gathered. Reference [157] proposes MangoYOLO based on YOLOv2-tiny and YOLOv3 for mango detection and counting. The authors increase the resolution of the feature map to facilitate detection of small fruits. Reference [124] proposes pre-trained Faster R-CNN network, building upon DeepFruits [249], to estimate the quantity of sweet pepper. The authors design a tracking sub-system for sweet pepper counting. The sub-system identifies new fruits by measuring the IoU between and comparing the boundary of detected and new fruits. Reference [154] performs semantic segmentation for mango counting using a modification of FCN. The coordinates of blob-like regions in the semantic segmentation mask are used to generate bounding boxes corresponding to mango fruits. Finally, Reference [215] applies Mask R-CNN to for instance segmentation of blueberries. The model also classifies the maturity of individual blueberries and counts the number of berries according to the masks.

Occlusion poses a difficult challenge for counting. Due to this issue, automatic count from detection or segmentation results is almost always lower than the actual number of fruits. To solve this, [157] calculates and applies the ratio between the actual hand harvest count and the automatic fruit count; it also uses both front and back views of mango trees to mitigate occlusion from one angle. Taking this idea one step further, Reference [320] uses dual-view videos to detect and track mangoes when the camera moves. Utilizing different views of the same tree in a video, Reference [320] recognizes around 20% more fruits. However, the detected count is still significantly lower than the actual number, underscoring the research challenge of exhaustive and accurate counting.

3.4 Maturity-level Classification

Maturity-level classification aims to determine the ripeness of fruits or vegetables to aid in proper harvesting and food quality assurance. Premature harvesting results in plants that are unpalatable or incapable of ripening, while delayed harvesting can result in overripe plants or food decay [141].

The optimal maturity level differs for different targeted products and destinations. Fruits and vegetables can be consumed at different growing stages. For example, lettuce can be consumed either as baby lettuce or fully grown lettuce. The same situation happens with baby corn and normal corn. Products are to be transported to different destinations, so we must consider the length of transportation and ripening speed when deciding the correct maturity level at harvest [358].

Manually distinguishing the subtle differences in maturity levels is time-consuming, prone to inconsistency, and costly. The labor cost of harvesting accounts for a large percentage of operation cost in farms, with 42% of variable production expenses in U.S. fruit and vegetable farms being spent on labor for harvesting [142]. Automatic maturity-level classification with computer vision, in contrast, can assist automatic harvesting [20, 107, 358] and reduce cost.

Similar to fruit detection, we can apply thresholding methods on color to detect ripeness. For example, Reference [25] applies color thresholding on HSI and YIQ color spaces. Reference [296] applies linear color models. Reference [176] utilizes the combination of color and texture features. References [96, 165, 256, 257, 329] apply shallow learning methods based on a multitude of features.

More recently, researchers evaluate the performance of deep learning-based computer vision methods on maturity level classification and attain satisfactory results. For example, Reference [357] applies CNN to classify tomato maturity into five levels. However, to further facilitate automatic harvesting, object detection and instance segmentation are more commonly used for getting the exact shape, location and maturity level of fruits, and position of peduncles for robotic end-effectors to cut on.

With object detection, Reference [346] applies the R-YOLO network described in the fruit detection section (Section 3.2) to detect ripe strawberries. Reference [124], as mentioned in the fruit counting section (Section 3.3), proposes pre-trained Faster R-CNN network to estimate both the ripeness and quantity of sweet pepper. Two formulations of the model are tested. One treats ripe/unripe as additional classes on top of foreground/background, and the other performs foreground/background classification first and then performs ripeness classification on foreground regions. The second approach generates better ripeness classification results, as the ripe/unripe classes are more balanced when only the foreground regions are considered.

Using the segmentation methods discussed in Section 3.2, Reference [13] classifies semantic segmentation masks of tomatoes into raw and ripe tomatoes. References [107, 347] perform instance segmentation and classify instance masks into ripe and raw strawberries. Reference [141] performs instance segmentation on tomatoes first. After transforming the mask region into HSV color space, the authors employ a fuzzy system to classify tomatoes into four classes: immature (completely green), breaker (green to tannish), preharvest (light red), and harvest (fully colored).

3.5 Pest and Disease Detection

Plants are susceptible to environmental disorders caused by temperature, humidity, nutritional excess/deficiency, light changes, and biotic disorders due to fungi, bacteria, virus or other pests [103, 272]. Infectious diseases or pest pandemic induce inferior plant quality or plant death, resulting in at least 10% of global food production losses [282].

Although controlled vertical farming restricts the entry of pests and diseases, it cannot eliminate them. Pests and diseases can enter the farm from accidental contamination from employees, seeds, irrigation water and nutrient solution, poorly maintained environment or phytosanitation protocols, unsealed entrance, and ventilation systems [242]. For this reason, pest and disease detection is still worth studying in the context of CEA.

Manual diagnosis of plant is complex due to the large quantity of vertically arranged plants in the field and numerous possible symptoms of diseases on different species. In addition, plants show different patterns along infection cycles, and their symptoms can vary in different parts of the plant [43]. Consequently, autonomous computer vision systems that recognize diseases according to the species and plant organs are gaining traction. From a technological perspective, we sort existing techniques into three parts, single- and multi-label classification, handling unbalanced class distributions, as well as label noise and uncertainty estimates.

3.5.1 Single- and Multi-label Classification.

Studies perform single-label, or one-label-per-image, classification of diseases of either one single species [24, 254, 272, 361] or multiple species [95]. Reference [361] creates a lightweight version of AlexNet, replacing the fully connected network with a global pooling layer, to classify six types of cucumber diseases. Reference [272] leverages CNNs for classifying leaves into mango leaves, diseased mango leaves, and other plant leaves. Reference [24] utilizes AlexNet and VGG16 to recognize five types of pests and diseases of tomatoes. Reference [95] applies AlexNet, AlexNetOWTBn [162], GoogLeNet, Overfeat [258], and VGG for classifying 25 different healthy or diseased plants.

Having a single label per image can be inaccurate. In the real world, one plant or one leaf can carry multiple diseases or contain multiple diseased regions. By detecting multiple targeted areas or disease classes, the multi-label setting can lead to improved efficiency and accuracy.

To deal with the possibility of having multiple diseases or multiple areas of diseases on one plant simultaneously, two types of methods are proposed. Reference [201] first segments out different infection areas on cucumber leaves using color thresholding following Reference [200], then applies DCNN on segmented areas to classify four types of cucumber diseases. Nevertheless, the color thresholding technique may not generalize to other plant species and environment. Another type of method leverages object detection or segmentation for locating and classifying infection areas. Reference [254] locates multiple diseased regions of banana plants simultaneously using object detection but assigns only one disease label to each image. Reference [103] compared Faster R-CNN, R-FCN, and SSD for detecting nine classes of diseases and pests that affect tomato plants. Multiple diseases and pests in one plant are detected simultaneously. Reference [349] applies improved DeepLab v3+ for segmentation of multiple black rot spots on grape leaves. The efficient channel attention mechanism [315] is added to the backbone of DeepLab v3+ for capturing local cross-channel interaction. Feature pyramid network and Atrous Spatial Pyramid Pooling [64] are utilized for fusing feature maps from the backbone network at different scales to improve segmentation.

3.5.2 Handling Unbalanced Class Distributions.

A common obstacle encountered in disease detection is unbalanced disease class distributions. There are typically much fewer diseased plants than healthy plants; the unequal frequencies introduce difficulties in finding images of rare diseases; the data unbalance leads to difficulty for model training. To remedy such problem, researchers propose weakly supervised learning [44], generative adversarial network (GAN) [116], and few-shot learning [182, 216].

Specifically, Reference [44] applies multiple instance learning (MIL), a type of weakly supervised learning method, for multi-class classification of six mite species of citrus. In MIL, the learner receives a set of labeled bags, containing multiple image instances. We know that at least one instance is associated with the class label but do not know the exact instance. The MIL algorithm tries to identify the common characteristic shared by images in the positively labeled bags. In this work, a CNN is first trained with labeled bags. Next, by calculating saliency maps of images in bags, the model identifies salient patches that have a high probability of containing mites. These patches inherit labels from their bags and are used to refine the CNN trained above.

Reference [116] leverages GAN to generate realistic image patches of tip-burn lettuce and trains U-net for tip-burn segmentation. For the generation stage, lettuce canopy image patches are inputted into Wasserstein GANs [26] to generate stressed (tip-burned) patches so there are an equal number of stressed and healthy patches. Then, in the segmentation stage, the authors generate a binary label map for the images using a classifier and an edge map. The binary label map labels each mini-patches (super-pixels) as stressed or healthy. The authors then feed the label map, alongside the original images, as input to U-net for mask segmentation.

In few-shot meta-learning, we are given a meta-train set and a meta-test set, with the two sets containing mutually exclusive image classes (i.e., classes in the training set do not appear in the testing set). Meta-train or meta-test sets contain a number of episodes, each of which consists of some training (supporting) images and some test (query) images. The rationale of meta-learning is to equip the model with the ability to quickly learn to classify the test images from a small number of training images within each episode. The model acquires this meta-learning capability on the meta-train set and is evaluated on the meta-test set.

As an example, Reference [216] performs pests and diseases classification with few-shot meta-learning. The model framework consists of an embedding module and a distance module. The embedding module first projects supporting images into an embedding space using ResNet-18, then feeds embedding vectors into a transformer to incorporate information of other support samples in the same episode. After that, the distance module calculates the Mahalanobis distance [104] of the query and support samples to classify the query. Similarly, Reference [182] uses a shallow CNN for embedding and the Euclidean distance for calculating the similarity between the embeddings of the query and support samples.

3.5.3 Label Noise and Uncertainty Estimates.

Reference [263] is another example of meta-learning, but it is used to improve the network’s robustness against label noise. The model consists of two phrases. The first phrase is the conventional training of a CNN for classification. In the second phrase, the authors generate 10 synthetic mini batches of images, containing real images with the labels taken from similar images. As a result, these mini-batches could contain noisy labels. After one step update on the synthetic instances, the network is trained to output similar predictions with the CNN from the first phrase. The result is a model that is not easily affected by noisy training data.

Finally, having a confidence score associated with the model prediction allows farmers to make decisions selectively under different confidence levels and boost the acceptance of deep learning models in agriculture. As an example, Reference [99] performs classification of tomato diseases and pairs the prediction with a confidence score following Reference [79]. The confidence score, calculated using Bayes’ rule, is defined as the probability of the true class label conditioned on the class probability predicted by the CNN. In addition, the authors build an ontology of disease classification. For example, the parent node “stressed plant” has as children “bacteria infection” and “virus infection,” which in turn has “mosaic virus” as a child. If the confidence score of a specific terminal disease label is below a certain threshold, then the model switches to its more general parent label in the tree for higher confidence. By the axiom of probability, the predicted probability of the parent label is the summation of all the predicted probability of its direct descendants. For a general discussion of machine learning techniques that create well-calibrated uncertainty estimates, we refer readers to Section 2.4.

4 Datasets

High-quality datasets with human annotations are one of the most important factors in the success of a machine learning project [205, 325]. In this section, we review established datasets that enable training of CV models. We exclude datasets of plants for which we have not found literature regarding their suitability in CEA, such as apples [41, 126], broccoli [166], and dates [21]. We have manually checked every dataset listed and assure that they are available for downloading at the time of writing. By summarizing the dataset related to CEA, we aim to facilitate interested researchers on their future studies. In the meantime, we would like to encourage scholars to publish more datasets dedicated to CEA.

As listed in Tables 7 and 8, we discover 14 datasets in CEA, with 3 for Growth Monitoring, 5 for Fruit Detection, and 6 for Pest and Disease Detection. Each targeted task contains at least one dataset that covers multiple species to facilitate training of generalizable and transferable models. The largest dataset is CVPPP with 6,287 and 165,120 RGB images for Arabidopsis and Tobacco, respectively, aiming for growth monitoring-related tasks. All the available datasets are composed of real images. While real images provide realistic data, we also want to encourage publication of synthetic datasets, which usually feature balanced class distribution and accurate labeling. Another point noteworthy is that many real images are collected under simplified laboratory environments, which may bias the data toward specific lighting conditions, backgrounds, plant orientation, or camera positions. For real-world application, practitioners may need to further fine-tune the trained models on more realistic data.

Table 7.

Target Task	Dataset	Release Year	Data Description	URL
Growth Monitoring	CVPPP dataset [204]	2014	6,287 and 165,120 RGB images (resolution 72 $\times$ 72) of Arabidopsis and Tobacco, respectively. Annotations include bounding boxes and segmentation masks for every plant and every leaf, and the leaf centers.	https://www.plant-phenotyping.org/datasets-download
	Oil Radish dataset [164]	2019	129 RGB images (resolution 1 $\times$ 1) of oil radish with binary semantic segmentation mask and respective plant fresh and dry weight, as well as nutrient content.	https://competitions.codalab.org/competitions/20981#learn_the_details
	Leaf Counting dataset [295]	2018	9,372 RGB images (resolution 72 $\times$ 72) of weeds with the number of leaves counted.	https://vision.eng.au.dk/leaf-counting-dataset/
Fruit and Flower Detection	DeepFruits [249]	2016	RGB images (resolution 72 $\times$ 72 to 400 $\times$ 400) of sweet pepper, rock melon, apple, mango, orange, and strawberry images annotated with rectangular bounding boxes. Each fruit has 42–170 images.	https://drive.google.com/drive/folders/1CmsZb1caggLRN7ANfika8WuPiywo4mBb
	Orchard Fruit [30]	2016	1,120, 1,964, and 620 RGB images (resolution 72 $\times$ 72) of apple, mango, and almond, respectively. Apples are annotated with bounding circles; mango and almond are annotated with rectangular bounding boxes.	http://data.acfr.usyd.edu.au/ag/treecrops/2016-multifruit/
	MangoYOLO [157]	2019	1,730 RGB images of mango (resolution 72 $\times$ 72 and 300 $\times$ 300), annotated with rectangular bounding boxes; photos are under artificial lighting.	https://figshare.com/articles/dataset/MangoYOLO_data_set/13450661
	MangoNet Semantic Dataset[154]	2019	45 training images and 4 test images (resolution 180 $\times$ 180) of mango. Each image is annotated with semantic segmentation mask that is colored green in regions of mangoes and black in non-mango regions.	https://github.com/avadesh02/MangoNet-Semantic-Dataset
	Fruit Flower detection [86]	2018	162, 20, and 15 images (resolution 72 $\times$ 72) of apple, peach, and pear flowers annotated with binary semantic segmentation mask with white representing flower pixels.	https://data.nal.usda.gov/dataset/data-multi-species-fruit-flower-detection-using-refined-semantic-segmentation-network

Table 7. Dataset for CV tasks in CEA

Table 8.

Target Task	Dataset	Release Year	Data Description	URL
Pest and Disease Detection	Plant Village [143]	2019	61,486 RGB images (resolution 72 $\times$ 72) of plant leaves, with 39 different classes of diseased and healthy plant leaves.	https://data.mendeley.com/datasets/tywbtsjrjv/1
	Crop Pests Recognition [181]	2020	5,629 RGB images (resolution 72 $\times$ 72) of 10 pest classes, each class containing over 400 images.	https://bit.ly/2DdUFza
	Plant and Pest [182]	2021	6,000 RGB images (resolution 72 $\times$ 72) of 20 different classes of plant leaves and pests from Plant Village [143] and Crop Pests Recognition [181].	https://zenodo.org/record/4529076#.YupE_-xBzlw
	Citrus Pest Benchmark [44]	2022	10,816 multi-class RGB images (resolution 1,200 $\times$ 1,200) categorized into seven classes of pests.	https://github.com/edsonbollis/Citrus-Pest-Benchmark
	IP102 [330]	2019	75,000 images (resolution 400 $\times$ 300) of 102 insect classes and among these 19,000 are annotated with bounding boxes.	https://github.com/xpwu95/IP102
	Plant Leaves [267]	2022	4,503 images (resolution 6,000 $\times$ 4,000), which includes 2,278 images of healthy leaves and 2,225 images of the diseased leaves.	https://data.mendeley.com/datasets/hb74ynkjcn/1

Table 8. Dataset for CV Tasks in CEA

5 Future Research Directions

So far, we have discussed the objectives, benefits, and realizations of Growth Monitoring, Fruit and Flower Detection, Fruit Counting, Maturity Level Classification, and Pest and Disease Detection in CEA precision farming. Based on the current research status and existing technical capabilities of computer vision, we would like to point out several areas where computer vision technologies could provide short- to mid-term benefits to urban and suburban CEA. We identify three such areas, including realistic datasets that are unbalanced and noisy, uncertainty quantification, and multi-task learning/system integration.

5.1 Handling Realistic Data

The ability to handle realistic data is a critical competence that has not received sufficient research attention (with a few notable exceptions [44, 116, 182, 216, 263]). Unlike well-curated datasets that have accurate and abundant labels and relatively balanced label distributions, real-world data exhibit skewed label distribution as well as substantial noise in the labels. For effective real-world application, it is important that the CV algorithms can maintain good predictive performance under these conditions. In addition, the algorithmic tolerance of data imperfection can lower annotation cost and enable wider applications of CV. There has been substantial research on these topics in the computer vision community, such as long-tail recognition [81, 190, 261, 313, 369, 370], few-shot and zero-shot learning [177, 275, 276, 277, 334], as well as noise-resistant classification [17, 70, 150, 321, 368] and metric learning [145, 188, 310]. We believe that research on smart agriculture could benefit from the existing body of literature.

5.2 Quantifying Uncertainty and Interpretability

Real-world applications call for reliable estimation of the quality of automated decisions. An incorrect prediction made by an AI system may have profound implications. For example, if the system incorrectly determines that fruits are not mature enough, then it may delay harvesting and cause overripe fruits with diminished values. However, it is impossible to eliminate incorrect or uncertain predictions, as they originate from factors difficult to control and precisely measure, including model assumptions, test data shift, incomplete training data, and so on [11, 144]. Thus, we argue that uncertainty quantification is another crucial factor for real-world deployment. Such quantification would allow farmers to make informed decisions on whether to follow the machine recommendation or not. For the convenience of readers, we provide a brief review of such deep learning techniques in Section 2.4.

Besides uncertainty quantification, pairing the model with explanation on its decisions could enhance user confidence and assist auditing and debugging of the AI system. Specifically, instance attribution methods, as discussed in Section 2.5, enable detection of the biased or low-quality data points with extreme influence on prediction [67]. For example, if the model is trained with an image of dry leaves with dust that resembles a certain disease of the plant, then in the inference process, the model might misclassify diseased leaves as normal dry leaves or vice versa and induce plant death or unnecessary treatments. With instance attribution interpretation, researchers can identify misleading data points and perform adversarial training to improve model accuracy.

5.3 Multi-task Learning and System Integration

Real-world deployment usually requires the coordination of multiple CV capabilities provided by different networks. When the system is designed well, these networks could facilitate each other and achieve synergistic effects. For example, instance segmentation can be used for fruit and flower localization (Section 3.2), growth monitoring (Section 3.1), and fruit maturity level detection (Section 3.4). However, academic research tends to study these problems in isolation, thereby unable to reap benefits of multi-task learning.

Multi-task learning [29, 51, 191] focuses on leveraging mutually beneficial supervisory signals from multiple correlated tasks. Recently, CV researchers have built large-scale networks [66, 71, 120, 149, 152, 197, 314, 374] that perform a wide range of tasks and achieve state-of-the-art results on most tasks. This demonstrates the benefits of multi-task learning and could inspire similar work dedicated to smart farming in CEAs.

Another motivation for considering multi-task learning and system integration is that errors can propagate in a pipeline architecture. For example, a network could first incorrectly detect a leaf occluding a mature fruit as the fruit and then classify it as an immature fruit. As a result, simply concatenating multiple techniques will result in inferior overall performance than what practitioners may expect. Thus, we encourage system designers to consider end-to-end training or other innovative techniques [119, 332, 339] for aligning and interfacing different components within a system.

Finally, multi-task learning handles multiple tasks simultaneously, which saves computation power, enhances data efficiency, and alleviates the necessity to maintain and iterate multiple models. Such benefits are crucial for popularizing CEAs, as they facilitate the efficient use of energy, computation power, and human resources. Consequently, both the initial setup and ongoing maintenance investments for CEA farms can be reduced, expediting the emergence of economically viable CEAs. Furthermore, mindful selection and combination of targeted tasks have the potential to further improve overall efficiency [280].

5.4 Effective Use of Multimodality

Fusion of multi-modal data enhances inference ability of models by incorporating complementary view of data [168]. In the context of CEA, thermal or depth images capture the depth or temperature differences between foreground and background and enable filtering of non-target objects (e.g., fruits or leaves). Abnormal temperature changes during growth cycle can also indicate disease infection before visual symptoms appear [53, 54]. Furthermore, as different materials absorb, reflect, and transmit light in different ways and at different wavelengths, multi-spectral imaging (MSI) and hyper-spectral imaging (HSI), which capture images at multiple wavelengths of light, can be used to perform more specific internal inspection of leaves, fruits, and plants as compared to thermal and depth images. Finally, LiDAR and RGB-D systems allow the generation of high-density 3D point clouds of plants, fruits [107, 185], or environment [309], which facilitate 3D volume measurement or cut-point detection during harvesting.

Existing works have demonstrated the efficacy of MSI and HSI [13, 42, 311]. MSI has been utilized for yield prediction [301] and early disease detection [223, 307]. However, current literature explored majorly the power of MSI with shallow machine learning. We found only one work that leverages deep learning on MSI input [301], which applies a pruned VGG-16 for wheat yield estimation. HSI provides finer-grained resolution and divides the range of wavelength into many more spectral bands than MSI, typically ranging from tens to hundreds of bands, though at a higher cost. Hyper-spectral images have been used as the sole modality in early disease detection with both shallow machine learning methods [18, 19, 288] and deep learning methods [98, 122, 214, 311]. Due to relevancy and space limit, we will only talk about the deep learning methods here. Specifically, with a GAN-based data augmentation method, Reference [311] performs early detection of tomato spotted wilt virus before visible symptoms appear using hyper-spectral images. Reference [214] performs early detection of grapevine vein-clearing virus and shows the discriminative power of HSI in combination with CNN and shallow machine learning algorithms. Reference [98] attains early barley disease detection through generating future prediction of hyper-spectral barley leaf images using GAN. Moreover, HSI has also been utilized for yield prediction through fruit counting. Reference [122] leverages CNN and HSI to segment semantic mango masks and count the number of fruits.

However, systematic exploration of fusion techniques for multimodal inputs remains relatively rare in CEA applications. Many existing approaches adopt pipeline-based multimodal integration techniques that do not exhaust the potential of deep learning due to the lack of end-to-end training. For example, in Reference [13], the authors set a depth threshold to filter false positive tomato detection from the background. Reference [42] first performs broccoli segmentation on the RGB image. Within the segmentation mask, the authors find the mode of the depth value distribution, which is used to calculate the diameter of the broccoli head. Reference [185] conducts semantic segmentation for guava fruits using RGB images and reconstructs their 3D positions from the depth input. Reference [107] utilizes Mask R-CNN to perform instance segmentation of strawberries and align depth image with the segmentation mask to obtain 3D shape of strawberries. These methods use the two modalities separately and do not apply end-to-end training of the pipeline. As exceptions, Reference [249] proposes late fusion of RGB and near-infrared images in sweet pepper detection. Reference [308] incorporates depth information by replacing the blue channel with depth channel and applies masked R-CNN to locate tomatoes.

In computer vision research, numerous techniques for fusing and joint utilization of multimodal information have been proposed over the years, which we believe could contribute to CV applications in CEA. Due to space limits, we list only a few examples here. Reference [260] proposes two different ways to combine multiple modalities in object detection, Concatenation and Element-wise Cross Product. The former combines feature maps from different modalities along the channel dimension and lets the network discover the best way to combine them from data. The latter technique, Element-wise Cross Product, applies element-wise multiplication to every possible pair of feature maps from the two modalities. Reference [50] experiments with a variety of fusion techniques for RGB and optical flow and discovers a high-performing late-fusion strategy in action recognition. In self-supervised learning, Reference [125] identifies similar data points using one modality and treats them as positive pairs in another modality. This technique provides another paradigm to leverage the complementary nature of multimodality.

6 Conclusions

Smart agriculture, and particularly computer vision for controlled-environment agriculture (CV4CEA), are rapidly emerging as an interdisciplinary area of research that could potentially lead to enormous economic, environmental, and social benefits. In this survey, we first provide brief overviews of existing CV technologies that range from image recognition to structured understanding such as segmentation; from uncertain quantification to interpretable machine learning. Next, we systematically review existing applications of CV4CEA, including growth monitoring, fruit and flower detection, fruit counting, maturity-level classification, and pest/disease detection. Finally, we highlight a few research directions that could generate high-impact research in the near future.

Like any interdisciplinary area, research progress in CV4CEA requires expertise in both computer vision and agriculture. However, it could take a substantial amount of time for any researcher to acquire in-depth understanding of both subjects. By reviewing existing applications, available CV technologies, and identifying possible future research directions, we aim to provide a quick introduction of CV4CEA to researchers with expertise in agriculture or computer vision alone. It is our hope that the current survey will serve as a bridge between researchers from diverse backgrounds and contribute to accelerated innovation in the next decade.

References

[1]

Singapore Food Agency. 2023. 30 by 30. Retrieved 15 August 2023 from https://www.ourfoodfuture.gov.sg/30by30/

Category	Technique	Evaluation Metric	Performance	Dataset
Single- and Multi-label Classification	[361]	Accuracy\(^{*}\)	94.65%	700 diseased and normal leaf images
	[272]	Accuracy\(^{*}\)	97.13%	1,070 self-acquired leaf images, and 1,130 images from the Plant Village dataset [143]
	[24]	Accuracy\(^{*}\)	93.33%	Images of 643 leaf samples
	[254]	mAP\(^{*}\)	72.8%–97.9%	12,600 images of bananas
	[95]	Accuracy\(^{*}\)	99.50%	87,848 images of leaves
	[201]	Accuracy\(^{*}\)	93.40%	Plant Village dataset [143]
	[103]	mAP (IoU > 0.5)	86%	5,000 images of diseases and pests of tomatoes
	[349]	mIOU, recall, and F1-score (IoU unspecified)	84.8%, 88.1%, 91.8%	Plant Village dataset [143]
Handling Unbalanced Class Distribution	[44]	Accuracy\(^{*}\)	60.7%–91.8%	IP102 [330], Citrus Pest Benchmark [44]
	[116]	Average Precision	67%–85%	Plant Village dataset [143] and Plant Leaves [267]
	[216]	Accuracy\(^{*}\)	88.5%–95.5%	Plant Village dataset [143] and Plant and Pest [182]
	[182]	Accuracy\(^{*}\)	43.9%–81%	Plant Village [143], Crop Pests Recognition [181]
Noise and Uncertainty Estimate	[263]	Accuracy\(^{*}\)	94.58%	Plant Village [143]
Noise and Uncertainty Estimate	[99]	Accuracy\(^{*}\)	-	15,892 images of tomatoes from Plant Village [143], extra 8,911 images of corn, 6,635 images of soybeans

Abstract

1 Introduction

2 Computer Vision Capabilities Relevant to Smart Agriculture

2.1 Image Recognition

2.2 Object Detection

2.3 Semantic, Instance, and Panoptic Segmentation

2.4 Uncertainty Quantification

2.5 Interpretability

3 Controlled-environment Agriculture

3.1 Growth Monitoring

3.1.1 Leaf Instance Segmentation.

3.1.2 Leaf Count and Other Growth Metrics.

3.2 Fruit and Flower Detection

3.3 Fruit Counting

3.4 Maturity-level Classification

3.5 Pest and Disease Detection

3.5.1 Single- and Multi-label Classification.

3.5.2 Handling Unbalanced Class Distributions.

3.5.3 Label Noise and Uncertainty Estimates.

4 Datasets

5 Future Research Directions

5.1 Handling Realistic Data

5.2 Quantifying Uncertainty and Interpretability

5.3 Multi-task Learning and System Integration

5.4 Effective Use of Multimodality

6 Conclusions

References

Cited By

Index Terms

Recommendations

LiDAR applications in precision agriculture for cultivating crops: A review of recent advances

Climate change impacts on crop yields: A review of empirical findings, statistical crop models, and machine learning methods

Agriculture 4.0 and beyond: Evaluating cyber threat intelligence sources and techniques in smart farming ecosystems

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations