1. Introduction
Conservation and monitoring of forest ecosystems is highly important for economic, ecological and recreational reasons. Being a major place of carbon sequestration, forests play a crucial role globally in mitigating climate change [
1]. They are an important setting for biodiversity [
2] and serve different functions for humans, such as recreation, hunting, timber production, material and energy utilization, tourism, education, and science, as well as soil, water, and air protection. Worldwide, forest ecosystems are endangered. For example, according to the Federal Forest Inventory of Germany in 2011/2012, approximately one-third of the country was covered by forests [
3]. Since 2018, there has been a significant decrease in the forest areas in Germany due to climate change and insect infestations such as the bark beetle. From January 2018 to April 2021, a loss of 5% of all forest areas was detected using satellite data [
4]. The federal forest condition report 2023 stated, that due to the drought period between 2018 and 2022, four out of five trees in Germany show visible needle or leaf loss [
5]. Thus, monitoring, conservation and sustainable management of forests are critical issues that need to be addressed [
4]. In order to obtain an overview of the condition of these forests, forest inventories are implemented throughout the entire German territory as well as in other countries. These inventories gather different types of information such as tree species, age, and diameter at breast height. The fourth forest inventory in Germany was launched in 2021, and will take more than three years from data collection to publication of the results [
6].
Consequently, the development of automated airborne acquisition of data is the logical step to take as satellite data do not provide the level of detail required, yet. The lack of data between 2012 and 2021 on the one hand, and the limitations of their spatial reliability on the other hand underlines the need for up-to-date, highly accurate data. The ForestCare project, funded by the German Federal Ministry of Education and Research under the program Digital Green Tech, is one example of research aiming at detecting and classifying individual tree features such as tree species, defoliation, stem form, or diseases, using machine learning techniques on drone or satellite imagery. Its goals were the evaluation of forest growth and the optimization of reforestation measures, as a response to stress factors such as the bark beetle. The project addressed three main aspects: first, the aerial imaging of trees using drones and mobile data collection of the features on the ground; second, the use of these training data to develop a model for tree segmentation, which is the focus of this paper; and finally, based on this, the development of a model that can classify the features [
7].
The necessity for fast and precise tree crown segmentation for forest monitoring is related to the constant change in the crown geometries in a forest stand. Thinning, calamities and even annual growth change the canopy structure of forest stands acquired in the preceding year and, therefore, segmentation shapes cannot be used in subsequent years. For reliable forest monitoring on a single tree basis, an annual segmentation of large forest stands is necessary and large efforts by many research groups have been undertaken to tackle the problem of automated segmentation in recent decades: In the 20th century, aerial imagery was manually used for tasks such as detecting diseased trees or assessing tree heights [
8,
9,
10]. Initial attempts to automate tree detection were made in the 1990s [
11,
12]. During the 2000s, approaches from the Airborne Light Detecting and Ranging (LiDAR) domain dominated tree segmentation [
13,
14,
15]. Currently, in the field of LiDAR segmentation, Deep Learning methods are state of the art [
16,
17,
18]. The emergence of deep neural networks has allowed for segmentation of individual trees in aerial images. Those models are used on data from Unmanned Aerial Vehicles (UAVs) [
19,
20,
21], as well as high-resolution multispectral satellite imagery covering larger geographic areas [
22,
23,
24].
Tree segmentation can generally be grouped into two categories: (1) studies using three dimensional point cloud data in combination with orthophotos and (2) studies using only two dimensional imagery data. Furthermore, segmentation studies investigate different complexities of forests, increasing from agglomerations of solitary trees of one or few species (e.g., parks, plantations) to dense multispecies deciduous forests with heterogeneous tree heights and a closed canopy (e.g., tropical deciduous forests). Coniferous stands can be segmented with higher accuracy than deciduous forests or mixed forests, due to the similar crown geometry of the trees, which also shows more contrasted margins between single crowns in comparison to deciduous trees. The GSD of the analyzed data is also an important factor determining the accuracy of the segmentation. GSDs below 1 cm are time consuming regarding data acquisition (drone flight time) and the orthophoto generation, while they improve segmentation accuracy. In contrast, satellite data (e.g., Sentinel-2 with 10 m GSD in RGB-NIR) do not allow for single tree segmentation due to the comparably poor image resolution. A convenient GSD range used in segmentation studies is 2–10 cm, though commercial satellite data with, e.g., 30 cm GSD are used, as well. Due to the differences in forest complexity and GSD, the segmentation quality varies highly in and between these studies. An Intersection over Union of, e.g., 0.6 can be considered as a sufficient segmentation result, while a tree health assessment based on this segmentation will be affected by an error of 40%. When this 40% includes, e.g., soil pixels or dry grasses on the forest floor this will lead to a false positive identification when searching for trees with low vitality. The effect of a neighbouring tree will depend on the vitality difference between both trees. Due to these implications the Intersection over Union of the segmentation has to be optimized to obtain reliable results on tree properties.
The following paragraph gives an overview on nine recent studies that investigated AI-based single tree segmentation in order of increasing complexity. In cases of the presence of several forest types in one study, only the one with the highest complexity is listed.
Three-dimensional segmentation can be considered as an easier method for segmentation compared to two-dimensional segmentation, as the height information allows one to process tree tops as local maxima and tree margins as local minima. However, it requires collecting and processing a larger amount of data. A study based on LiDAR data segmentation was performed by Zaforemska et al. in 2019 with a GSD of approximately 14 cm [
25]. The authors applied five different segmentation algorithms and achieved an F1-Score of 0.87 for non-optimized generalized parameters with a continuously adaptive mean shift algorithm on data from a mixed species forest (sycamore and oak). The data acquisition was performed after the vegetation period, which decreased the complexity of the tree crown structures due to leaf loss and the respective visibility of the organized branch geometries. Wang et al. investigated an urban forest of low complexity and achieved an F1-Score of 0.9 with a watershed-based algorithm at a GSD of 1.7 cm [
26]. Liu et al. used three dimensional data to segment more than five different tree species in a forest stand of comparably low complexity [
27]. They achieved an F1-Score of 0.9 with a PointNet++ algorithm and on average 8.5 cm GSD for LiDAR and spectral data. Qin et al. segmented trees in a subtropical broadleaf forest with a watershed-spectral-texture-controlled normalized cut (WST-Ncut) algorithm [
28]. They achieved a maximum F1-Score of 0.93 for data with a GSD between 10 and 30 cm. Two-dimensional segmentation is more complex, as the information of local minima and maxima cannot be processed. Some studies use effects like sharpening or brightness adjustment in order to highlight light-illuminated tree tops and shadowed crown margins, thereby creating a surrogate for three-dimensional information. Yu et al. [
29] achieved an F1-Score of 94% for RGB data analysis of a plantation with low complexity and a downsampled GSD of 2 cm, but they did not state the IoU except that it was larger than 50%, which is insufficient for a reliable analysis of tree features. Lv et al. used RGB spectral data with a GSD of 10 cm of an urban forest with medium complexity [
30]. The application of an MCAN algorithm, which is a modification of the Mask R-CNN algorithm, yielded an F1-Score of 88%, while the authors do not state concise IoU values. Safonova et al. developed a three-step tree crown delineation, which they applied on multispectral forest images with a GSD of 3.7 cm [
31]. The investigated Bulgarian forest sites were not dense and of intermediate complexity. The intersection of union was very high for mixed forest with approximately 95%, while the F1-Score was not calculated. Dersch et al. investigated the segmentation with CIR sensors, among others, in three forest types (coniferous, deciduous, mixed) with a GSD of 5.5 cm [
32]. The best results for the mixed stands were achieved with the single CIR sensor and Mask-CNN (83% F1-Score, 77% IoU). Ball et al. investigated the segmentation in a tropical forest of very high complexity. They used RGB data with a GSD of 8–10 cm and applied the detectree2 algorithm, which is based on Mask-CNN. The F1-Score was 0.6 for trees in the understory and 0.74 for dominating trees, while an IoU was accepted when it was larger than 0.62 [
33].
The present study focuses on individual two-dimensional tree segmentation from UAV multispectral imagery and on providing an automated workflow for others to use and/or improve on new data. The main goal is to compare the performance of different algorithms on RGB imagery as well as on different vegetation indices (VIs). Additionally, we evaluate the impact of image enhancement on the segmentation performance and discuss the impact of label quality of accuracy metrics.
2. Materials and Methods
The overall workflow and data used in this study are shown in
Figure 1. Data and processing will be described in detail in the following subsections.
2.1. Study Area and Data
The study area is located in the Black Forest region to the east of Freiburg im Breisgau (for details, refer to
Figure 2). The forests are dominated by
Picea abies (European spruce), with sporadic existence of
Abies alba (European silver fir),
Larix decidua (European larch),
Acer pseudoplatanus (Sycamore maple),
Acer platanoides (Norway maple),
Fagus sylvatica (European beech),
Fraxinus excelsior (European ash), and
Sorbus aucuparia (Mountain ash). The trees show different health conditions, which can affect their crown shape, e.g., due to needle loss. The multispectral UAV data were collected using a DJI Matrice 300 RTK drone system (DJI, Shenzhen, China), with the RedEdge-MX™ Dual Camera sensor (AGEagle, Neodesha, KS, USA). The resulting raster data, with a resolution of approximately 1.3 to 2.3 cm per pixel, consisted of five bands: blue (475 nm), green (560 nm), red (668 nm), near-infrared (840 nm), and RedEdge (717 nm). The segmentation labels were created manually at a very high level of detail based on the multispectral images in a GIS (ArcGIS Pro 2.8, Environmental Systems Research Institute (ESRI), Redlands, CA, USA). In total, there are 1096 labeled trees in the study area (compare
Table 1 for details about tree species).
2.2. Preprocessing for Deep Learning
Preprocessing of data included creating tiles and masks for Deep Learning as well as the calculation of vegetation indices: (1) We converted the shapefiles into binary masks with the extent of their corresponding multispectral images. For Faster-RCNN, Mask R-CNN and TensorMask, the raster images were divided into 1200 by 1200 pixel tiles with a 4% overlap between tiles. Additionally, the dataset was transformed into the COCO-JSON format [
34].
(2) Subsequently, all data were randomly split into training and test datasets (approximately 10:1); validation was automatically set apart by the Deep Learning networks. The training dataset contains approximately 800 trees over an area of approximately 4.3 hectares, the test dataset contains approximately 80 trees over a total area of approximately 0.4 hectares.
(3) In addition, data were processed using three image-enhancement methods to improve their quality using OpenCV, an open-source library for image processing: A:
Denoising, which plays an important role in digital image processing, because it can remove unwanted noise and improve image quality [
35,
36]. B:
Histogram stretching, which enhances the image contrast by adjusting pixel values to cover the full brightness range. It makes image details more visible [
37]. C:
Deblurring, which optimizes the image sharpness [
38]. We use this second, improved, dataset to evaluate the impact of image enhancement on model performance.
(4) Additionally, different VIs were calculated to assess their importance for the segmentation task. VIs use combinations of spectral bands to show the health, density and other important aspects of vegetation [
39,
40]. A total of eleven indices were evaluated and are listed in
Table 2.
2.3. Experiments and Deep Learning Models
For our experiments, we used Faster R-CNN, Mask R-CNN, TensorMask and SAM as instance-segmentation models. A brief description of each model’s characteristics will be given in the following,
Table 3 shows a list of all conducted experiments and
Table A1 and
Table A2 in the
Appendix A show the optimum hyperparameter settings for the respective models found during fine-tuning.
For Faster R-CNN, Mask R-CNN, TensorMask, and SAM, certain hyperparameters, such as the Bias Learning Rate Factor, have a significant impact on performance metrics like IoU and F1-Score and best values were chosen for the final model. Other important hyperparameters include the Learning Rate, Warm-up Factor, Number of Decays, and Weight Decay. Too many Maximal Iterations lead to overfitting, while a smaller Batch Size and a higher Learning Rate yielded better results for our specific dataset.
As TensorMask is not an R-CNN, somewhat different hyperparameters were tested. The best results were achieved by tuning parameters like the Number of Decays, Learning Rate Gamma and Mask Loss Weight, crucial for finding the global minimum and influencing the quality of mask predictions. The Number of Convolutions showed that increased complexity leads to overfitting, while good regularization at higher values improved performance. Other important hyperparameters include Maximum Iterations, Batch Size, Learning Rate, Warm-up Factor, and Momentum.
Faster R-CNN, short for Faster Region-Based Convolutional Neural Network (R-CNN), was the first model developed for object detection. A Feature Pyramid Network (FPN) is utilized to extract features at various scales. The core element of the architecture is the Region Proposal Network (RPN), which suggests regions likely to contain objects based on the extracted features. These regions are standardized in size using the RoI (Regions of Interest) pooling layer and then processed within an R-CNN. The results are further refined by a classification and object detection head. For details, refer to [
52,
53]. We use ResNet-101 as a backbone during our experiments.
The structure of Mask R-CNN closely resembles that of Faster R-CNN, with ResNet-50 serving as the backbone network in combination with FPN. From the extracted features, RoIs are proposed via RPN. To standardize the sizes of these regions, it employs an RoI align layer. Next, these regions are refined in an R-CNN module. A key difference from the original Faster R-CNN is that Mask R-CNN additionally fine-tunes its results with an instance-segmentation head [
52,
54].
TensorMask employs the dense sliding window method for feature extraction. In this approach, a window slides over an input image, and at each step, a Convolutional Backbone is executed to extract features and generate mask proposals. These mask proposals are represented as 4D tensors, where the dimensions correspond to the height, width, number of channels for classes, and pixel values of the mask. Similar to Mask R-CNN, TensorMask utilizes ResNet-50 as the backbone in combination with FPN. The mask proposals are then processed in three branches: one for classification, one for object detection through bounding boxes, and a third for predicting segmentation masks [
52,
55].
SAM (Segment Anything Model) was pre-trained on 1.1 million images and 1.1 billion masks, using techniques from both computer vision and natural language processing (NLP). SAM utilizes an image encoder to generate an image embedding for each input image, representing its visual features in a numerical vector format. Subsequently, a prompt encoder is employed to encode various types of prompts, such as points, masks, bounding boxes, or text, into embedding vectors. Following this, a mask decoder is utilized, using the embedding vectors from the image encoder and prompt encoder to accurately generate segmentation masks [
56,
57]. In our first set of experiments we conduct extensive hyperparameter tuning using grid search (see
Table A1 and
Table A2 for best settings) for all architectures and compare model performance on RGB data (compare
Table 3).
Model performance was assessed using the F1-Score and IoU on our test dataset. The F1-Score is the harmonic mean between Precision and Recall (see Equation (1)). Precision calculates the ratio of True Positives (TPs) and the total number of positive predictions, which includes True Positives and False Positives (FPs) (see Equation (2)). Recall, on the other hand, shows the ratio of True Positives and the total number of actual positive instances, containing True Positives and False Negatives (FNs) (Equtaion (3)) [
58]:
The Intersection over Union (IoU) is defined as the ratio between the area of overlap and the area of union between ground truth (
A) and the bounding boxes or segmentation masks predicted by model (
B) (Equation (4)) [
59]:
2.4. TreeSeg Toolbox
Based on the optimal results obtained, a tool was developed to automatically calculate instance segmentation of individual trees in multispectral images. The output of the tool is a shapefile containing individual tree crowns. The toolbox implementation was conducted using Python (3.9.11 for Faster R-CNN, Mask R-CNN and Tensormask, 3.8.18 for SAM) and can be added to ArcGIS Pro (v3.1).