Open AccessArticle

Study on the Evolutionary Characteristics of Post-Fire Forest Recovery Using Unmanned Aerial Vehicle Imagery and Deep Learning: A Case Study of Jinyun Mountain in Chongqing, China

Deli Zhu

^1,2 and

Peiji Yang

^2,*

Chongqing Digital Agriculture Service Engineering Technology Research Center, Chongqing 401331, China

School of Computer and Information Science, Chongqing Normal University, Chongqing 401331, China

Author to whom correspondence should be addressed.

Sustainability 2024, 16(22), 9717; https://doi.org/10.3390/su16229717

Submission received: 14 October 2024 / Revised: 29 October 2024 / Accepted: 6 November 2024 / Published: 7 November 2024

(This article belongs to the Section Sustainable Forestry)

Download

Browse Figures

Figure 1
Location of the study area. Source from: <a href="https://www.google.com.hk/maps/" target="_blank">https://www.google.com.hk/maps/</a>, accessed on 13 October 2024. Source from: <a href="http://www.bigemap.com/" target="_blank">http://www.bigemap.com/</a>, accessed on 13 October 2024. "> Figure 2
Image data of the same forest area. "> Figure 3
Overall architecture of Mask2former. The Pixel Decoder obtains the outputs of all stages in the feature extraction network and converts them into pixel-level prediction results, obtaining output features with multiple sizes. The largest output feature is used to calculate the mask, while the smaller output features are used as inputs to the Transformer Decoder. "> Figure 4
(a) Swin Transformer Network Architecture; (b) Swin Transformer Block Structure (right). The figure on the right shows two Swin Transformer Blocks connected in series. In network architecture, this structure appears in pairs, at least two of which are grouped together. In this structure, W-MSA represents Multi Head Self-Attention with a window, while SW-MSA represents Multi Head Self-Attention with a sliding window. "> Figure 5
(a) Window Multi-Head Self-Attention, W-MSA and (b) Shifted Window Multi-Head Self-Attention, SW-MSA. "> Figure 6
The approximate calculation process for the adaptive module EA Block. In the figure, the “+” inside a circle indicates that the inputs to that node are added together, and the “*” inside a circle indicates that the inputs are multiplied together. "> Figure 7
Structure of EAswin-Mask2former. "> Figure 8
Some segmentation results of EAswin-Mask2former and other models. "> Figure 9
Comparison of mIou between EAswin-Mask2former and other models. DLV3+, SEG, R-M, and S-M in the table represent DeepLabV3+, Segformer, Resnet50-Mask2former, and Swin-Mask2former, respectively. "> Figure 10
Satellite remote sensing images of the forest area from 2022 to 2024. "> Figure 11
Unmanned aerial vehicle images of Region A at different times and their segmentation effects. The corresponding shooting times from top to bottom are October 2022, March 2023, March 2023, and February 2024. "> Figure 11 Cont.
Unmanned aerial vehicle images of Region A at different times and their segmentation effects. The corresponding shooting times from top to bottom are October 2022, March 2023, March 2023, and February 2024. "> Figure 12
The trend over time of the burned and damaged area and the proportion of forest area in Region A. ">

Review Reports Versions Notes

Abstract

Forest fires pose a significant threat to forest ecosystems, with severe impacts on both the environment and human society. Understanding the post-fire recovery processes of forests is crucial for developing strategies for species diversity conservation and ecological restoration and preventing further damage. The present study proposes applying the EAswin-Mask2former model based on semantic segmentation in deep learning using visible light band data to better monitor the evolution of burn areas in forests after fires. This model is an improvement of the classical semantic segmentation model Mask2former and can better adapt to the complex environment of burned forest areas. This model employs Swin-Transformer as the backbone for feature extraction, which is particularly advantageous for processing high-resolution images. It also includes the Contextual Transformer (CoT) Block to better capture contextual information capture and incorporates the Efficient Multi-Scale Attention (EMA) Block into the Efficiently Adaptive (EA) Block to enhance the model’s ability to learn key features and long-range dependencies. The experimental results demonstrate that the EAswin-Mask2former model can achieve a mean Intersection-over-Union (mIoU) of 76.35% in segmenting complex forest burn areas across different seasons, representing improvements of 3.26 and 0.58 percentage points, respectively, over the Mask2former models using ResNet and Swin-Transformer backbones, respectively. Moreover, this method surpasses the performance of the DeepLabV3+ and Segformer models by 4.04 and 1.75 percentage points, respectively. Ultimately, the proposed model offers excellent segmentation performance for both forest and burn areas and can effectively track the evolution of burned forests when combined with unmanned aerial vehicle (UAV) remote sensing images.

Keywords:

image segmentation; context information; overfire areas in forests; evolutionary characteristics; Swin Transformer Network; Mask2former model

1. Introduction

Forest ecosystems hold significant environmental, economic, and cultural value for humanity [1]. However, the area of forest coverage continues to decline annually. Apart from commercial deforestation activities, one of the primary factors contributing to forest reduction is forest fires. Fires have become a leading cause of forest loss, with over 4 million hectares of land being burned each year, according to Global Forest Watch statistics [2]. Fires not only destroy forest vegetation and trees but also expose the forest soil to direct sunlight, accelerating erosion and the loss of topsoil. Burned forest areas are regions within forests or woodlands that have been damaged or destroyed by fire, resulting in severe degradation of the ecosystem and environment.

Ongoing issues such as global warming have exacerbated the destruction of forests through fire, diminishing tree coverage [3,4]. Consequently, researchers are increasingly recognizing the importance of forest conservation as a key measure to halt this vicious cycle. Post-fire restoration has thus become an integral part of forest protection and management. Monitoring the evolution of burned forest areas is crucial for understanding post-fire recovery, making this type of oversight an essential task. Over the past few decades, advances in sensor technology have made remote sensing an indispensable tool for monitoring forest ecosystems, due to its ability to rapidly capture extensive information over large areas. Remote sensing has been widely used in estimating burn severity [5], mapping affected regions [6], and monitoring vegetation recovery [7].

In the past, forest disaster conditions and evolutionary characteristics were assessed manually. However, with the advancement of remote sensing technology and the increasing recognition of forest ecosystems’ importance, relevant research began to emerge in the 1970s, with significant progress made over the past two decades.

Vegetation indices provide a simple and effective way to qualitatively and quantitatively evaluate vegetation cover and growth dynamics. Some of the earliest indices, such as the Ratio Vegetation Index (RVI) proposed by Jordan before the 1970s, allowed green biomass estimations but were significantly influenced by atmospheric effects when vegetation cover was low [8]. Subsequent developments included the Difference Vegetation Index (DVI) and the Perpendicular Vegetation Index (PVI), which estimate vegetation density using Euclidean distance [9]. The Normalized Difference Vegetation Index (NDVI), introduced by Rouse Jr. et al., remains the most widely used method due to its simple calculations and strong correlation with key metrics such as biomass and plant growth [10]. Visible-band indices, such as the Excess Green Index (EXG) and Normalized Green–Red Difference Index (NGRDI), have more limited applications, often struggling to differentiate between various non-vegetative elements [11,12]. VDVI, compared to NDVI, performs better with high-resolution UAV images but requires stricter threshold settings [13]. Vegetation indices are still widely used in forestry and agricultural research, assisting in post-fire forest planning and management, as well as assessing fire risk factors in forest ecosystems [14,15].

Since the beginning of the 21st century, artificial intelligence-related technologies have developed rapidly, and neural networks (NN) have seen widespread use, including increasing applications in post-forest fire recovery efforts. Domingo utilized Support Vector Machine (SVM) and Random Forest (RF) algorithms to model fire risks in forests, significantly improving classification accuracy, which plays a crucial role in forest fire prevention and the management of post-fire re-ignition risks [16]. Overall, SVM has become an effective and efficient tool in both complex mountainous forest classification tasks [17] and categorizing and analyzing areas affected by forest fires [18]. Attarchi et al. used SVM and neural networks to classify different regions of complex forests, achieving an accuracy of 86%, notably higher than the results under the traditional Maximum Likelihood Classification (MLC) method [17]. Change Vector Analysis (CVA) and Decision Trees (DT) also offer high accuracy in forest fire risk assessment and the classification and analysis of burned areas [19]. Combinations of methods such as Bootstrap Aggregation and region-growing algorithms with neural networks have also shown outstanding performance in tasks such as mapping burned scars [20] and predicting fire break locations [21]. Compared to traditional methods, neural networks provide greater research value for high-resolution forest imagery by segmenting and classifying objects within images rather than merely processing pixels, thereby enhancing the depth and breadth of research. At the same time, traditional pixel-based methods, such as those relying solely on vegetation indices, retain the advantages of simplicity and reduced processing time and continue to have practical applications.

This study presents a segmentation method for burned forest areas using remote sensing imagery based on deep learning. Deep learning, which emerged in the 21st century as computational power advanced rapidly, uses multi-layer neural networks to extract deep and abstract features from data layer by layer. In our deep learning approach, we use semantic segmentation to distinguish between different regions in post-fire forests, thereby helping to analyze the evolutionary characteristics of the forest after a fire. As a semantic segmentation model, we used Mask2former [22], which has become commonplace in recent years. We improved upon the Mask2former model by making it more adaptable to the complex environment in post-fire forests.

2. Materials and Methods

2.1. The Study Area

On 21 August 2022, a wildfire occurred near Hutou Village in Beibei District, Chongqing, spreading rapidly towards Jinyun Mountain Nature Reserve. Firefighters and local residents spent five arduous days bringing the blaze under control. This large-scale wildfire led to significant losses of manpower and resources while also inflicting considerable damage to the forest ecosystem of the Jinyun Mountain range.

The study area of this paper is the Jinyun Mountain range in Beibei District, Chongqing, China, extending from Yangjiawan to Bajiaochi, and located at 29°45′26.064″ to 29°48′38.9448″ N and 106°18′49.3632″ to 106°22′2.4564″ E. The burned area covers approximately 14 square kilometers, with an elevation ranging from about 321.2 m to 845.76 m. This region features a subtropical monsoon humid climate with a subtropical evergreen broadleaf zonal forest type. Before the fire, the forest coverage rate exceeded 95%, reaching up to 98.6% in some areas (Figure 1).

2.2. Data Acquisition

Our data were collected from Beibei District, Chongqing, China, between October 2022 and February 2024, using a DJI Mavic 3 drone, which was produced by Shenzhen Dajiang Innovation Technology Co., Ltd. in Shenzhen, China. It has an effective resolution of 20 million pixels, an ISO sensitivity range of 100 to 6400, and an image size of 5280 × 3956 pixels in JPEG format. The data collection area had sufficient natural lighting conditions (>15 lux), and the diffuse reflectance of the materials in the scene was greater than 20%. The data were collected four times in total, with each acquisition conducted after two seasonal changes, resulting in an interval of approximately 5–6 months between each capture, as detailed in Table 1.

The dataset was divided into a training set and a validation set in a 3:1 ratio. For data preprocessing, each image was quartered considering the balance between training effectiveness and hardware limitations, with the original 5280 × 3956 pixels images cropped to 2640 × 1978 pixels. Additionally, the training set was augmented using techniques such as random Gaussian noise addition, median blurring, flipping, and Alpha-blend vertical linear gradient mixing, thereby enriching the variety of training images. After processing, the total number of images was 3776.

2.3. Methodology

In forest fire disaster severity and recovery monitoring cases primarily utilizing satellite remote sensing imagery, traditional vegetation indices are widely applicable and effective for two main reasons. First, satellite remote sensing images typically contain multispectral information, allowing more detailed data to be extracted. Second, the high upper limit of the displayable range in satellite images facilitates the elimination of local disturbances, enabling comprehensive global analysis. However, the revisit cycle of satellites often results in image data that are not fully up-to-date. In addition, various natural and human factors can also affect image quality or the revisit frequency, such as extreme weather conditions or satellite mission resource allocation, factors beyond our control. Furthermore, more comprehensive data require more complex processing techniques. Without radiometric calibration and atmospheric correction, it is often difficult to achieve good results using vegetation indices. As a result, visible-light UAV remote sensing images are frequently required in practical applications. Nevertheless, simple vegetation indices in these images highlight certain limitations, e.g., the visible-band difference vegetation index (VDVI), which is based solely on visible light.

The images in Figure 2 represent the same unburned forest area captured at different time intervals, arranged chronologically from top to bottom. Clearly, since the area was unaffected by the wildfire, the forest’s scale and structure did not undergo significant changes with seasonal variations. However, when using VDVI with a consistent threshold (0.1) to assess the vegetation coverage, the values from top to bottom become 27.93%, 98.03%, and 79.74%, respectively. Thus, the estimated vegetation coverage in the images is considerably lower than the actual values, except for that in the middle image. Furthermore, for more complex scenarios in burned areas, the vegetation coverage derived from vegetation indices often fails to fully capture the extent of forest recovery or reflect the forest’s scale or structural changes. Therefore, in tasks involving visible-light imagery, particularly in post-fire areas with more complex data, vegetation indices suffer from limitations such as their reliance on time sensitivity and environmental factors, as well as their poor flexibility, making them less suitable for observing the evolutionary characteristics of burned areas.

As long as the data are trained at different times and in various environments, the network model can have the ability to recognize and differentiate between different areas under complex conditions, thus allowing for better observation of the evolutionary characteristics of the areas.

In this study, we employed a deep learning-based semantic segmentation method to observe the evolutionary characteristics of burned areas. In remote sensing images, there are not only forests and burned areas but also other objects such as roads, buildings, ponds, construction sites, and regions that are either not completely destroyed or remain in the process of recovery but do not yet constitute a forested area. Therefore, we could not simply add together the forest and burned areas to represent the entire region. A singular focus on either the forested or burned areas would also have been insufficient. Thus, we segmented both the forested and burned portions of the entire region for analysis. Additionally, since roads have a similar color to burned areas and occupied a considerable portion in some images, we also segmented the road sections separately to improve accuracy of the model.

2.4. Model

2.4.1. Semantic Segmentation Model Mask2Former

Semantic segmentation is the task of annotating and classifying objects of interest within an image. Through deep learning models, different regions of the image can be identified, and corresponding semantic labels can be assigned to the pixels within these regions.

Mask2Former is an efficient and highly flexible segmentation model [22] whose multi-scale features significantly enhance segmentation accuracy. This model is particularly suitable for segmentation tasks targeting specific objects. The overall structure of the model is illustrated in Figure 3.

Mask2Former optimizes the Decoder by introducing a multi-scale strategy. Specifically, the image first undergoes feature extraction through the Backbone, forming a feature pyramid composed of both low- and high-resolution features. These features are then fed from the Pixel decoder to the Transformer Decoder, enabling the parallel processing of different resolution features. This optimization strategy not only enhances the model’s comprehensiveness in information extraction but also significantly improves overall performance. This approach also increases the model’s adaptability, providing greater robustness in handling a variety of tasks.

2.4.2. Backbone Swin-Transformer

Commonly used Backbones for Mask2Former include VIT, Swin-Transformer, and ResNet. Given the high resolution of our dataset, we selected Swin-Transformer [23] as the Backbone, and applied the Shifted Window strategy to maintain computational efficiency when processing high-resolution images. Swin-Transformer builds on the strong attention mechanism of Transformer by employing hierarchical feature representation, which not only preserves detailed information but also captures global context more effectively, adapting to image features at various scales. The Shifted Window mechanism applies local window attention to the feature map, reducing computational complexity while improving the model’s ability to capture long-range dependencies. This method enhances performance in processing high-resolution images while maintaining computational efficiency. The architecture of the Swin-Transformer and the Swin-Transformer Block are shown in Figure 4.

The traditional Multi-Head Self-Attention (MSA) mechanism facilitates feature interactions at different spatial locations and understands the global features, but the correlation between all tokens requires significant computation. On the other hand, Window Multi-Head Self-Attention (W-MSA), which is unique in Swin-Transformer, performs self-attention computations within each window by dividing the feature map into multiple non-overlapping windows. Against this background, the complexity of MSA and W-MSA can be compared as follows:

Ω (MSA) = 4 {hwC}^{2} + 2 {(hw)}^{2} C

(1)

Ω (W - MSA) = 4 {hwC}^{2} + 2 M^{2} hwC

(2)

where Ω denotes the computational complexity, h and w denote the height and width of the feature map, C denotes the depth of the feature map, and M denotes the window size. It can be seen that the computational complexity of MSA is quadratically related to the size of the feature map, while the computational complexity of W-MSA is linearly related to the size of the feature map. Lastly, M2 is the size of the window, which is certain to be smaller than the size of the feature map, h × w, as shown in Figure 5a.

Since the information obtained via W-MSA provides only the content inside this window, there is no effective information transfer between windows. The solution to this problem is Shifted Window Multi-Head Self-Attention (SW-MSA). SW-MSA shifts the original window both to the left and upward by M/2 distance to form multiple new windows with different sizes, as shown in Figure 5b. The slid window thus integrates the patches not previously included in the same W-MSA windows. The information between the originally different windows can then be exchanged, enabling the network to better extract the global features. Thus, both W-MSA and SW-MSA appear in pairs in Swin-Transformer Block.

2.4.3. Efficiently Adaptive Block

In the process of secondary succession, forests gradually recovering from mountain fires face intense competition among different species, with constantly alternating dominant species. At the same time, forests are in different successional stages due to different degrees of fire damage in each region, resulting in a complex and variable situation. To achieve better segmentation results in this complex environment, we introduced the Contextual Transformer Block [24] (CoT Block). This Block integrates contextual information mining and a self-attention mechanism, which is able to better explore the contextual information in complex environments and help distinguish between different neighboring objects.

The CoT Block first uses k × k sets of convolutions for neighboring keys in a k × k grid to obtain the static contextual representation of the input, K1. K1 is then merged with the input Query, and the dynamic multi-head attention matrix is trained with two consecutive 1 × 1 convolutions to obtain the attention matrix, A. The trained attention matrix, A, is multiplied by the input value, V, to achieve the dynamic contextual representation of the input, K2, i.e., K2 = V × A. The final fusion of static and dynamic contextual representations is used as the output.

To further enhance the feature extraction ability of the network in classification tasks, we introduced the idea of Efficient Multi-Scale Attention (EMA) in the CoT Block [25]. EMA enables spatial semantic features to be spatially semantically distributed within each feature group by reshaping some of the channels into batch dimensions and dividing the channel dimensions into sub-features. At the same time, EMA recalibrates the channel weights of each parallel branch by encoding global information, improving the learning efficiency of the channel content without reducing the channel dimensions and producing better pixel-level attention in advanced feature mapping.

First, EMA is used to divide the input feature map x into g (groups) sub-features in channel dimension to learn different semantics, usually g << c (channels), and enables the learned attentional weight descriptors to enhance the feature representation of the region of interest in each sub-feature. EMA is then used to extract the attentional weight descriptors of the grouped feature map through three parallel routes. Two parallel routes are on 1 × 1 branches and the third route is on 3 × 3 branches. A global average pooling operation is applied along each of the two spatial directions in the 1 × 1 branch, and only one 3 × 3 convolution kernel is used in the 3 × 3 branch to capture the multi-scale feature representation, alleviating the computational requirements. We then join the two encoded features along the image height direction so that they share the same 1 × 1 convolution without reducing the dimension of the 1 × 1 branch. After decomposing the output of the 1 × 1 convolution into two vectors, a nonlinear Sigmoid function is utilized to fit the two-dimensional binomial distribution after linear convolution. Cross-channel feature interactions were realized by fusing the attention maps of the two channels within each group via multiplication, and the 3 × 3 branch enlarged the feature space and acquired cross-channel communication information through convolution. In this way, EMA not only pairwise adjusted the importance of different channels but also retained accurate spatial structure information within the channels.

In the process of realizing cross-channel feature interaction, we next introduced the outputs of the 1 × 1 branch and 3 × 3 branch and encoded the output of the 1 × 1 branch into the corresponding dimensional shapes using two-dimensional global average pooling, which operates with the following equation.

Z_{c} = \frac{1}{H \times W} \sum_{j}^{H} \sum_{i}^{W} x_{c} (i, j)

(3)

where Z_C is the average pooling result on channel c, representing the global feature description of channel c; H and W are the height and width of the feature map, respectively; and

x_{c} (i, j)

denotes the value of the feature map on channel c located at position (i,j). To improve computational efficiency, the nonlinear function Softmax is used during the output of the 2D global average pooling to fit the above linear transformation. By multiplying the output of the above parallel processing with the matrix dot product operation, we obtained the first spatial attention map containing spatial location information. On the 3 × 3 branch, we used the same method to obtain the second spatial attention map, and then the two sets of generated spatial attention weights were pooled and passed through the Sigmoid function.

The mechanism for applying EMA in the CoT Block enhances the capture of contextual information while utilizing that information to better highlight advanced and in-depth features, enabling the network to adapt to the task of a complex environment while still efficiently learning the advanced features of the object of interest, which we call the Efficiently Adaptive Block (EA Block), as shown in Figure 6.

The optimized Swin-Transformer with an integrated EA Block is used here as the Backbone of the Mask2former model to improve the accuracy of segmentation, as well as its ability to capture contextual information and learn key features. We call this segmentation network EAswin-Mask2former, whose overall network structure is shown in Figure 7.

3. Experiment

3.1. Experimental Conditions

The models were implemented using Pytorch and run on an NVIDIA GeForce RTX 3090. For consistency in our experiments, all models used the same dataset, with a learning rate of 0.0001 and batch size of 2. In addition, models were trained for a total of 250 epochs. If losses did not decrease after 20 consecutive epochs, the training process was stopped.

We use the Cross Entropy Loss function in the segmentation task, as follows:

Cross Entropy Loss = \frac{1}{b} \sum_{j = 1}^{b} \sum_{i = 1}^{n} [- y_{ji} {\log y^}_{ji} - (1 - y_{ji}) \log (1 - {y^}_{ji})]

(4)

where b is the batch size during training, and

y_{j i}

and

{\hat{y}}_{j i}

are the actual and predicted true values, respectively.

3.2. Performance Index

To evaluate the performance of the model, we used four common evaluation metrics: mAcc, mIou, mDice, and mPrecision with values 0–1. The closer the value is to 1, the better the model’s prediction. The meanings of Acc, Iou, Dice, and Precision are as follows for TP (the truth value belongs to a certain class and the prediction result after segmentation is also in that class), TN (the truth value does not belong to a certain class and the prediction result after segmentation is also not in that class), FP (the truth value does not belong to a certain class and the prediction result after segmentation is also in that class), and FN (the truth value belongs to a certain class and the prediction result after segmentation is also not in that class):

Acc = \frac{n_{TP} + n_{TN}}{n_{TP} + n_{TN} + n_{FP} + n_{FN}}

(5)

Iou = \frac{true value \cap predictive value}{true value \cup predictive value} = \frac{n_{TP}}{n_{TP} + n_{FP} + n_{FN}}

(6)

Dice = \frac{2 \times n_{TP}}{2 \times n_{TP} + n_{FP} + n_{FN}}

(7)

Precision = \frac{n_{TP}}{n_{TP} + n_{FP}}

(8)

where mAcc, mIou, mDice, and mPrecision denote their mean values, respectively.

3.3. Comparative Experiments and Ablation Experiment

To verify the effectiveness and universal applicability of the improved EAswin–Mask2-former, we conducted a comparative experiment with classic image segmentation models and Mask2-formers using different backbones on an open-source dataset. Here, the model compared with the EAswin–Mask2former is called baselines. This experiment used the open-source image segmentation dataset Cityscapes to segment 19 target categories: “roads”, “sidewalks”, “buildings”, “walls”, “fences”, “poles”, “traffic lights”, “road signs”, “vegetation”, “terrain”, “sky”, “people”, “riders”, “cars”, “trucks”, “buses”, “trains”, “motorcycles”, and “bicycles”.

Cityscapes is a high-quality dataset used for computer vision tasks such as semantic segmentation, instance segmentation, and object detection in urban environments. We used this dataset because it shares similar evolutionary characteristics with our dataset collected from Jinyun Mountain, featuring frequent occurrences of major elements, such as roads and vehicles, along with other changing objects.

Several image segmentation models were previously compared with EAswin–Mask2former, including Mask2former, Segformer [26], and Deeplabv3+ [27], with Resnet [28] as the backbone, and others. Due to the limited size of our dataset, we used the lightweight Swin-Tiny version of Swin-Transformer in the Mask2former approach. Similarly, the ResNet network used for comparison was ResNet50, which has similar computational complexity. For Segformer and DeepLabv3+, we used MixVIT and ResNet101 as the backbone networks, which offered the best experimental results. Using ResNet50 would have resulted in lower performance.

Next, to verify the effectiveness of the improvement of the EAswin–Mask2former in the complex environment of post-fire forests, we conducted another comparative experiment with the aforementioned baselines on the dataset we obtained from Jinyun Mountain.

Finally, to verify the effectiveness of the improvements made to EAswin–Mask2former based on Swin–Mask2former, we conducted an ablation experiment. This experiment compared the basic Swin–Mask2former, Swin–Mask2former modified with CoT, Swin–Mask2former modified with EMA, and EAswin–Mask2former modified with both CoT and EMA on the Jinyun Mountain dataset.

4. Results and Discussion

4.1. Experimental Results

The comparison results of EAswin-Mask2former with the baselines mentioned above on the open-source dataset Cityscapes are shown in Table 2.

Then, using the dataset obtained from drone photography in the post-fire forest area of Jinyun Mountain, combined with the above baselines, we evaluated the effectiveness of our EAswin-Mask2former in the complex environment of post-fire forests. Table 3 shows the results under this model.

Here, EAswin-Mask2former performs better in all key metrics compared with the more well-known baselines, with a higher mean intersection ratio (mIou). MIou is particularly important in segmentation tasks, where a higher value can indicate a high degree of correctness for the segmentation results, as well as a lower rate of misassigning the segmentation results to different categories of image parts, which is a key metric used to evaluate the performance of segmentation models, especially for assessing the consistency between predictions and observations. The comparison results of several models suggest that our proposed EAswin-Mask2former model can extract information from overfire areas and forest images more efficiently and improve the accuracy of prediction. Figure 8 presents the segmentation effect of five segmentation models in five different regions in the comparative experiment.

To further validate the effectiveness of the EA Block, an ablation experiment was performed on the EA Block, CoT Block, and EMA Block, respectively. The results of the ablation experiments are shown in Table 4.

As shown in the results of the ablation experiments, we added the CoT Block to improve the ability of contextual information to guide dynamic learning and strengthen the visual representation. The results show that there was, indeed, an improvement in mIou. However, although mPrecision was the highest among all improved methods, mIou was the lowest, which suggests that improvements in the CoT Block sometimes lead to misjudgment, which may be caused by insufficient learning of some critical and deep features such as textures. This problem can be avoided by using the improved EMA Block, which improves the information extraction ability by reshaping some channels to the batch dimension and grouping the channel dimensions into multiple sub-features. At the same time, this method is more efficient and can avoid boundary blurring. The EA Block combines the advantages of the two blocks, which not only improves the model’s information extraction capabilities but also strengthens the model’s ability to learn long-distance data and efficiently avoids the boundary blurring problem. The EAswin-Mask2former improved using the EA Block offered increased overall performance. The results of all comparison experiments and ablation experiments are shown in Figure 9.

4.2. Segmentation of Overfire Areas by EAswin-Mask2former

Observing the evolutionary characteristics of forests after fires is necessary to determine the recovery status and regular features of forests in the time dimension after disasters. Observing global evolutionary trends helps to roughly evaluate recovery progress and formulate recovery plans. Periodic satellite remote sensing images lack real-time capabilities, but can observe the general trend of disaster and recovery in the forest area over time. Figure 10 shows the disaster and recovery situation in the Jinyun Mountain area based on satellite for 2022, 2023, and 2024 (from left to right).

The damage caused by fire varies across different regions, resulting in uneven forest integrity in burned regions after a fire. The starting point for secondary succession differs in each area, with lightly affected regions recovering faster and more thoroughly. Areas adjacent to unaffected zones also tend to recover more quickly.

Local small-scale evolutionary characteristics are beneficial for accurately assessing recovery progress and optimizing resource allocation. Based on the EAswin-Mask2former method, we observed the evolutionary characteristics of the burned areas to develop a more intuitive understanding of changes over time. In addition to random sampling, we conducted repeated photography at fixed locations, thereby generating segmentation maps of the burned areas at different times. As shown in Figure 11, these images represent the same Region A, which includes burned areas, forested areas, and roads. In the segmentation maps, the pale yellow areas represent burned and damaged areas, the blue-green areas represent forested areas, and the off-white areas represent roads.

4.3. Evolutionary Characterization of Region A

Next, we counted the overfire damaged portion, forested portion, and road portion of the map in the segmentation results and calculated the respective area shares of the overfire damaged portion and forested portion, respectively, after excluding the road portion. Because of its placement in the same location, the road portion occupies roughly the same percentage of the total area: 28.898%, 30.517%, 27.781%, and 32.720% from October 2022 to February 2024, respectively, all around 30%. After removing the area share of the road portion, the total share of the fire-damaged portion of the remaining portion is 89.032%, 88.071%, 40.402%, and 32.699%, respectively, and the total share of the forested portion of the remaining portion is 0, 0, 5.221%, and 13.434%, respectively. The remaining portion of the area was determined by the model to be incompletely damaged and not represent a forested portion, as shown in Figure 12.

The initial stage of a forest’s natural succession after a fire is characterized by soil changes and the sprouting of herbaceous plants and shrubs. Further soil erosion and loss may occur because of damage to the root systems of plants [29]. Over an extended period, areas severely affected by the fire will recover slowly and remain unstable. Based on our results, Region A followed this pattern for approximately the first six months after the fire. However, after a few more months, the burned area significantly decreased over time while the forested area gradually expanded. The combined proportion of burned and forested areas also decreased, indicating that the proportion of areas completely destroyed or forming mature forests was not increasing. This change was also found to be accelerating.

This trend suggests that the area is transitioning from the early to middle stage of secondary succession, characterized by intense competition between species, with mid-stage dominant species gradually expanding. Entering this stage means that evergreen broad-leaved tree species are starting to recover, forming young forests that replace shrubs as the dominant vegetation and creating a well-structured plant community [30]. Typically, the recovery of burned forests would not progress at this rate. However, local government post-fire remediation measures, such as reforestation and protective treatments, may have been applied, which helped to accelerate the recovery.

5. Conclusions

This study addressed the problem of monitoring the evolutionary characteristics of burned areas in forests using visible light remote sensing imagery and proposed a semantic segmentation model, EAswin–Mask2former, based on deep learning.

Forest fires pose a significant threat to forest ecosystems, making it crucial to understand post-fire recovery and observe the evolution of post-fire forests. The real-time visible light remote sensing images captured by UAVs are clearer than the satellite imagery commonly used in the past. Using deep learning-based semantic segmentation on high-resolution UAV imagery over time can lead to good performance, making this method more suitable for observing the evolutionary characteristics of post-fire forests compared with traditional vegetation index methods. We employed the Mask2Former model with Swin-Transformer as the backbone, which performed well in this segmentation task. On this basis, we proposed an improved EAswin-Mask2Former segmentation model. The experimental results demonstrated that this model is better adapted to the complex conditions of post-fire forests, achieving superior performance.

Within less than a year after the fire, some areas of Jinyun Mountain exhibited accelerated recovery, which may have been influenced by human intervention.

Author Contributions

Conceptualization, D.Z.; methodology, D.Z. and P.Y.; formal analysis, D.Z.; investigation, P.Y.; data curation, P.Y.; writing—original draft preparation, P.Y.; writing—review and editing, D.Z. and P.Y.; visualization, P.Y.; funding acquisition, D.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Science and Technology Research Project of Chongqing Education Commission (Grant No. KJQN201800536) and the Chongqing University Innovation Research Group Project (Grant No. CXQT20015).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study can be obtained by contacting the corresponding author upon request. The data are not publicly accessible due to the confidentiality policy of Chongqing Forestry Administration.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Costanza, R.; d’Arge, R.; De Groot, R.; Farber, S.; Grasso, M.; Hannon, B.; Limburg, K.; Naeem, S.; O’neill, R.V.; Paruelo, J.; et al. The value of the world’s ecosystem services and natural capital. Nature 1997, 387, 258–260. [Google Scholar] [CrossRef]
Tyukavina, A.; Potapov, P.; Hansen, M.C.; Pickens, A.; Stehman, S.; Turubanova, S.; Parker, D.; Zalles, A.; Lima, A.; Kommareddy, I.; et al. Global trends of forest loss due to fire, 2001–2019. Front. Remote Sens. 2022, 3, 825190. [Google Scholar] [CrossRef]
Yue, C.; Ciais, P.; Cadule, P.; Thonicke, K.; van Leeuwen, T.T. Modelling the role of fires in the terrestrial carbon balance by incorporating SPITFIRE into the global vegetation model ORCHIDEE—Part 2: Carbon emissions and the role of fires in the global carbon balance. Geosci. Model Dev. 2015, 8, 1285–1297. [Google Scholar] [CrossRef]
Singleton, M.P.; Thode, A.E.; Meador, A.J.S.; Iniguez, J.M. Increasing trends in high-severity fire in the southwestern USA from 1984 to 2015. For. Ecol. Manag. 2019, 433, 709–719. [Google Scholar] [CrossRef]
Kurbanov, E.; Vorobyev, O.; Leznin, S.; Polevshikova, Y.; Demisheva, E. Assessment of burn severity in Middle Povozhje with Landsat multitemporal data. Int. J. Wildland Fire 2017, 26, 772–782. [Google Scholar] [CrossRef]
Campagnolo, M.L.; Oom, D.; Padilla, M.; Pereira, J.M.C. A patch-based algorithm for global and daily burned area mapping. Remote Sens. Environ. 2019, 232, 111288. [Google Scholar] [CrossRef]
Kibler, C.L.; Parkinson, A.-M.L.; Peterson, S.H.; Roberts, D.A.; D’Antonio, C.M.; Meerdink, S.K.; Sweeney, S.H. Monitoring post-fire recovery of chaparral and conifer species using field surveys and Landsat time series. Remote Sens. 2019, 11, 2963. [Google Scholar] [CrossRef]
Jordan, C.F. Derivation of leaf-area index from quality of light on the forest floor. Ecology 1969, 50, 663–666. [Google Scholar] [CrossRef]
Richardson, A.J.; Weigand, C. Distinguishing vegetation from soil background information. Photogramm. Eng. Remote Sens. 1977, 43, 1541–1552. [Google Scholar]
Rouse, J.W.; Haas, R.H.; Schell, J.A.; Deering, D.W. Monitoring Vegetation Systems in the Great Plains with Erts; NASA Special Publication: Washington, DC, USA, 1974; pp. 309–351. [Google Scholar]
Reid, A.; Ramos, F.; Sukkarieh, S. Multi-class classification of vegetation in natural environments using an Unmanned Aerial system. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Shanghai, China, 9–13 May 2011; pp. 2953–2959. [Google Scholar]
Meyer, G.E.; Neto, J.C. Verification of color vegetation indices for automated crop imaging applications. Comput. Electron. Agric. 2008, 63, 282–293. [Google Scholar] [CrossRef]
Wang, X.; Wang, M.; Wang, S.; Wu, Y. Extraction of vegetation information from visible unmanned aerial vehicle images. Trans. Chin. Soc. Agric. Eng. 2015, 31, 152. [Google Scholar]
Healey, S.P.; Cohen, W.B.; Zhiqiang, Y.; Krankina, O.N. Comparison of tasseled cap-based Landsat data structures for use in forest disturbance detection. Remote Sens. Environ. 2005, 97, 301–310. [Google Scholar] [CrossRef]
Dunn, C.J.; O’Connor, C.D.; Reilly, M.J.; Calkin, D.E.; Thompson, M.P. Spatial and temporal assessment of responder exposure to snag hazards in post-fire environments. For. Ecol. Manag. 2019, 441, 202–214. [Google Scholar] [CrossRef]
Domingo, D.; de la Riva, J.; Lamelas, M.T.; García-Martín, A.; Ibarra, P.; Echeverría, M.; Hoffrén, R. Fuel type classification using airborne laser scanning and Sentinel 2 data in Mediterranean forest affected by wildfires. Remote Sens. 2020, 12, 3660. [Google Scholar] [CrossRef]
Attarchi, S.; Gloaguen, R. Classifying complex mountainous forests with L-Band SAR and Landsat data integration: A comparison among different machine learning methods in the Hyrcanian forest. Remote Sens. 2014, 6, 3624–3647. [Google Scholar] [CrossRef]
Petropoulos, G.P.; Kontoes, C.; Keramitsoglou, I. Burnt area delineation from a uni-temporal perspective based on Landsat TM imagery classification using Support Vector Machines. Int. J. Appl. Earth Obs. Geoinf. 2011, 13, 70–80. [Google Scholar] [CrossRef]
Maeda, E.E.; Arcoverde, G.F.B.; Pellikka, P.K.E.; Shimabukuro, Y.E. Fire risk assessment in the Brazilian Amazon using MODIS imagery and change vector analysis. Appl. Geogr. 2011, 31, 76–84. [Google Scholar] [CrossRef]
Sedano, F.; Kempeneers, P.; Strobl, P.; McInerney, D.; San Miguel, J. Increasing spatial detail of burned scar maps using IRS-AWiFS data for Mediterranean Europe. Remote Sens. 2012, 4, 726–744. [Google Scholar] [CrossRef]
Pereira-Pires, J.E.; Aubard, V.; Ribeiro, R.A.; Fonseca, J.M.; Silva, J.M.N.; Mora, A. Semi-automatic methodology for fire break maintenance operations detection with sentinel-2 imagery and artificial neural network. Remote Sens. 2020, 12, 909. [Google Scholar] [CrossRef]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 1290–1299. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–21 October 2021; pp. 10012–10022. [Google Scholar]
Li, Y.; Yao, C.; Pan, Y.; Mei, T. Contextual Transformer Networks for Visual Recognition. arXiv 2021, arXiv:2107.12292. [Google Scholar] [CrossRef]
Ouyang, D.; He, S.; Zhan, J.; Guo, H.; Huang, Z.; Luo, M.; Zhang, G. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. arXiv 2023, arXiv:2305.13563. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. (NeurIPS) 2021, 34, 12077–12090. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Chokkalingam, U.; De Jong, W. Secondary forest: A working definition and typology. Int. For. Rev. 2001, 19–26. [Google Scholar]
Ahn, Y.S.; Ryu, S.R.; Lim, J.; Lee, C.H.; Shin, J.H.; Choi, W.I.; Lee, B.; Jeong, J.H.; An, K.W.; Seo, J.I. Effects of forest fires on forest ecosystems in eastern coastal areas of Korea and an overview of restoration projects. Landsc. Ecol. Eng. 2014, 10, 229–237. [Google Scholar] [CrossRef]

Figure 1. Location of the study area. Source from: https://www.google.com.hk/maps/, accessed on 13 October 2024. Source from: http://www.bigemap.com/, accessed on 13 October 2024.

Figure 2. Image data of the same forest area.

Figure 3. Overall architecture of Mask2former. The Pixel Decoder obtains the outputs of all stages in the feature extraction network and converts them into pixel-level prediction results, obtaining output features with multiple sizes. The largest output feature is used to calculate the mask, while the smaller output features are used as inputs to the Transformer Decoder.

Figure 4. (a) Swin Transformer Network Architecture; (b) Swin Transformer Block Structure (right). The figure on the right shows two Swin Transformer Blocks connected in series. In network architecture, this structure appears in pairs, at least two of which are grouped together. In this structure, W-MSA represents Multi Head Self-Attention with a window, while SW-MSA represents Multi Head Self-Attention with a sliding window.

Figure 5. (a) Window Multi-Head Self-Attention, W-MSA and (b) Shifted Window Multi-Head Self-Attention, SW-MSA.

Figure 6. The approximate calculation process for the adaptive module EA Block. In the figure, the “+” inside a circle indicates that the inputs to that node are added together, and the “*” inside a circle indicates that the inputs are multiplied together.

Figure 7. Structure of EAswin-Mask2former.

Figure 8. Some segmentation results of EAswin-Mask2former and other models.

Figure 9. Comparison of mIou between EAswin-Mask2former and other models. DLV3+, SEG, R-M, and S-M in the table represent DeepLabV3+, Segformer, Resnet50-Mask2former, and Swin-Mask2former, respectively.

Figure 10. Satellite remote sensing images of the forest area from 2022 to 2024.

Figure 11. Unmanned aerial vehicle images of Region A at different times and their segmentation effects. The corresponding shooting times from top to bottom are October 2022, March 2023, March 2023, and February 2024.

Figure 12. The trend over time of the burned and damaged area and the proportion of forest area in Region A.

Table 1. Shooting conditions for data collection.

	Shooting Time	Weather Conditions	UAV Flight Altitude	UAV Flight Duration
1	2 October 2022	cloudy to sunny	600~800 m	4 × 30 min
2	6 March 2023	cloudy to sunny	600~850 m	4 × 30 min
3	1 August 2023	sunny	600~950 m	5 × 30 min
4	17 February 2024	sunny	600~950 m	6 × 30 min

A DJI MAVIC 3 Hasselblad camera was used for shotting, usually on sunny days or during periods of improved shooting conditions. As the drone’s battery was only sufficient for about half an hour of use, higher flights required more repeat charging cycles.

Table 2. Average evaluation results of EAswin-Mask2former and baselines using the Cityscapes dataset.

Segmentation Model	Backbone	mAcc	mIou	mDice	mPrecision
DeepLabV3+	Resnet101	0.7209	0.6361	0.7260	0.7817
Segformer	MixViT	0.7689	0.6806	0.7968	0.8433
Mask2former	Resnet50	0.7753	0.6732	0.7910	0.8184
Mask2former	Swin-Transformer (tiny)	0.8191	0.7068	0.8213	0.8303
Mask2former	EAswin-Transformer (tiny)	0.8365	0.7281	0.8352	0.8396

We use bold to represent the most effective data in the table.

Table 3. Average evaluation results of EAswin-Mask2former and baselines using the Jinyun Mountain dataset.

Segmentation Model	Backbone	mAcc	mIou	mDice	mPrecision
DeepLabV3+	Resnet101	0.8299	0.7231	0.8373	0.8455
Segformer	MixViT	0.8481	0.7460	0.8528	0.8580
Mask2former	Resnet50	0.8545	0.7309	0.8421	0.8310
Mask2former	Swin-Transformer (tiny)	0.8576	0.7577	0.8602	0.8633
Mask2former	EAswin-Transformer (tiny)	0.8623	0.7635	0.8642	0.8668

As above, we use bold to represent the most effective data in the table.

Table 4. Average evaluation results for EAswin-Mask2former and the aforementioned baselines on the Cityscapes dataset.

Model	CoT	EMA	mAcc	mIou	mDice	mPrecision
Swin-Mask2former	×	×	0.8576	0.7577	0.8602	0.8633
Swin-Mask2former	√	×	0.8528	0.7599	0.8618	0.8715
Swin-Mask2former	×	√	0.8607	0.7609	0.8625	0.8659
EAswin-Mask2former	√	√	0.8623	0.7635	0.8642	0.8668

In the above experiment, × indicates that the block was not introduced, √ indicates that the block was introduced, and bold is used to represent the data with the best performance in the table.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, D.; Yang, P. Study on the Evolutionary Characteristics of Post-Fire Forest Recovery Using Unmanned Aerial Vehicle Imagery and Deep Learning: A Case Study of Jinyun Mountain in Chongqing, China. Sustainability 2024, 16, 9717. https://doi.org/10.3390/su16229717

AMA Style

Zhu D, Yang P. Study on the Evolutionary Characteristics of Post-Fire Forest Recovery Using Unmanned Aerial Vehicle Imagery and Deep Learning: A Case Study of Jinyun Mountain in Chongqing, China. Sustainability. 2024; 16(22):9717. https://doi.org/10.3390/su16229717

Chicago/Turabian Style

Zhu, Deli, and Peiji Yang. 2024. "Study on the Evolutionary Characteristics of Post-Fire Forest Recovery Using Unmanned Aerial Vehicle Imagery and Deep Learning: A Case Study of Jinyun Mountain in Chongqing, China" Sustainability 16, no. 22: 9717. https://doi.org/10.3390/su16229717

APA Style

Zhu, D., & Yang, P. (2024). Study on the Evolutionary Characteristics of Post-Fire Forest Recovery Using Unmanned Aerial Vehicle Imagery and Deep Learning: A Case Study of Jinyun Mountain in Chongqing, China. Sustainability, 16(22), 9717. https://doi.org/10.3390/su16229717

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu