1. Introduction
Forest ecosystems hold significant environmental, economic, and cultural value for humanity [
1]. However, the area of forest coverage continues to decline annually. Apart from commercial deforestation activities, one of the primary factors contributing to forest reduction is forest fires. Fires have become a leading cause of forest loss, with over 4 million hectares of land being burned each year, according to Global Forest Watch statistics [
2]. Fires not only destroy forest vegetation and trees but also expose the forest soil to direct sunlight, accelerating erosion and the loss of topsoil. Burned forest areas are regions within forests or woodlands that have been damaged or destroyed by fire, resulting in severe degradation of the ecosystem and environment.
Ongoing issues such as global warming have exacerbated the destruction of forests through fire, diminishing tree coverage [
3,
4]. Consequently, researchers are increasingly recognizing the importance of forest conservation as a key measure to halt this vicious cycle. Post-fire restoration has thus become an integral part of forest protection and management. Monitoring the evolution of burned forest areas is crucial for understanding post-fire recovery, making this type of oversight an essential task. Over the past few decades, advances in sensor technology have made remote sensing an indispensable tool for monitoring forest ecosystems, due to its ability to rapidly capture extensive information over large areas. Remote sensing has been widely used in estimating burn severity [
5], mapping affected regions [
6], and monitoring vegetation recovery [
7].
In the past, forest disaster conditions and evolutionary characteristics were assessed manually. However, with the advancement of remote sensing technology and the increasing recognition of forest ecosystems’ importance, relevant research began to emerge in the 1970s, with significant progress made over the past two decades.
Vegetation indices provide a simple and effective way to qualitatively and quantitatively evaluate vegetation cover and growth dynamics. Some of the earliest indices, such as the Ratio Vegetation Index (RVI) proposed by Jordan before the 1970s, allowed green biomass estimations but were significantly influenced by atmospheric effects when vegetation cover was low [
8]. Subsequent developments included the Difference Vegetation Index (DVI) and the Perpendicular Vegetation Index (PVI), which estimate vegetation density using Euclidean distance [
9]. The Normalized Difference Vegetation Index (NDVI), introduced by Rouse Jr. et al., remains the most widely used method due to its simple calculations and strong correlation with key metrics such as biomass and plant growth [
10]. Visible-band indices, such as the Excess Green Index (EXG) and Normalized Green–Red Difference Index (NGRDI), have more limited applications, often struggling to differentiate between various non-vegetative elements [
11,
12]. VDVI, compared to NDVI, performs better with high-resolution UAV images but requires stricter threshold settings [
13]. Vegetation indices are still widely used in forestry and agricultural research, assisting in post-fire forest planning and management, as well as assessing fire risk factors in forest ecosystems [
14,
15].
Since the beginning of the 21st century, artificial intelligence-related technologies have developed rapidly, and neural networks (NN) have seen widespread use, including increasing applications in post-forest fire recovery efforts. Domingo utilized Support Vector Machine (SVM) and Random Forest (RF) algorithms to model fire risks in forests, significantly improving classification accuracy, which plays a crucial role in forest fire prevention and the management of post-fire re-ignition risks [
16]. Overall, SVM has become an effective and efficient tool in both complex mountainous forest classification tasks [
17] and categorizing and analyzing areas affected by forest fires [
18]. Attarchi et al. used SVM and neural networks to classify different regions of complex forests, achieving an accuracy of 86%, notably higher than the results under the traditional Maximum Likelihood Classification (MLC) method [
17]. Change Vector Analysis (CVA) and Decision Trees (DT) also offer high accuracy in forest fire risk assessment and the classification and analysis of burned areas [
19]. Combinations of methods such as Bootstrap Aggregation and region-growing algorithms with neural networks have also shown outstanding performance in tasks such as mapping burned scars [
20] and predicting fire break locations [
21]. Compared to traditional methods, neural networks provide greater research value for high-resolution forest imagery by segmenting and classifying objects within images rather than merely processing pixels, thereby enhancing the depth and breadth of research. At the same time, traditional pixel-based methods, such as those relying solely on vegetation indices, retain the advantages of simplicity and reduced processing time and continue to have practical applications.
This study presents a segmentation method for burned forest areas using remote sensing imagery based on deep learning. Deep learning, which emerged in the 21st century as computational power advanced rapidly, uses multi-layer neural networks to extract deep and abstract features from data layer by layer. In our deep learning approach, we use semantic segmentation to distinguish between different regions in post-fire forests, thereby helping to analyze the evolutionary characteristics of the forest after a fire. As a semantic segmentation model, we used Mask2former [
22], which has become commonplace in recent years. We improved upon the Mask2former model by making it more adaptable to the complex environment in post-fire forests.
2. Materials and Methods
2.1. The Study Area
On 21 August 2022, a wildfire occurred near Hutou Village in Beibei District, Chongqing, spreading rapidly towards Jinyun Mountain Nature Reserve. Firefighters and local residents spent five arduous days bringing the blaze under control. This large-scale wildfire led to significant losses of manpower and resources while also inflicting considerable damage to the forest ecosystem of the Jinyun Mountain range.
The study area of this paper is the Jinyun Mountain range in Beibei District, Chongqing, China, extending from Yangjiawan to Bajiaochi, and located at 29°45′26.064″ to 29°48′38.9448″ N and 106°18′49.3632″ to 106°22′2.4564″ E. The burned area covers approximately 14 square kilometers, with an elevation ranging from about 321.2 m to 845.76 m. This region features a subtropical monsoon humid climate with a subtropical evergreen broadleaf zonal forest type. Before the fire, the forest coverage rate exceeded 95%, reaching up to 98.6% in some areas (
Figure 1).
2.2. Data Acquisition
Our data were collected from Beibei District, Chongqing, China, between October 2022 and February 2024, using a DJI Mavic 3 drone, which was produced by Shenzhen Dajiang Innovation Technology Co., Ltd. in Shenzhen, China. It has an effective resolution of 20 million pixels, an ISO sensitivity range of 100 to 6400, and an image size of 5280 × 3956 pixels in JPEG format. The data collection area had sufficient natural lighting conditions (>15 lux), and the diffuse reflectance of the materials in the scene was greater than 20%. The data were collected four times in total, with each acquisition conducted after two seasonal changes, resulting in an interval of approximately 5–6 months between each capture, as detailed in
Table 1.
The dataset was divided into a training set and a validation set in a 3:1 ratio. For data preprocessing, each image was quartered considering the balance between training effectiveness and hardware limitations, with the original 5280 × 3956 pixels images cropped to 2640 × 1978 pixels. Additionally, the training set was augmented using techniques such as random Gaussian noise addition, median blurring, flipping, and Alpha-blend vertical linear gradient mixing, thereby enriching the variety of training images. After processing, the total number of images was 3776.
2.3. Methodology
In forest fire disaster severity and recovery monitoring cases primarily utilizing satellite remote sensing imagery, traditional vegetation indices are widely applicable and effective for two main reasons. First, satellite remote sensing images typically contain multispectral information, allowing more detailed data to be extracted. Second, the high upper limit of the displayable range in satellite images facilitates the elimination of local disturbances, enabling comprehensive global analysis. However, the revisit cycle of satellites often results in image data that are not fully up-to-date. In addition, various natural and human factors can also affect image quality or the revisit frequency, such as extreme weather conditions or satellite mission resource allocation, factors beyond our control. Furthermore, more comprehensive data require more complex processing techniques. Without radiometric calibration and atmospheric correction, it is often difficult to achieve good results using vegetation indices. As a result, visible-light UAV remote sensing images are frequently required in practical applications. Nevertheless, simple vegetation indices in these images highlight certain limitations, e.g., the visible-band difference vegetation index (VDVI), which is based solely on visible light.
The images in
Figure 2 represent the same unburned forest area captured at different time intervals, arranged chronologically from top to bottom. Clearly, since the area was unaffected by the wildfire, the forest’s scale and structure did not undergo significant changes with seasonal variations. However, when using VDVI with a consistent threshold (0.1) to assess the vegetation coverage, the values from top to bottom become 27.93%, 98.03%, and 79.74%, respectively. Thus, the estimated vegetation coverage in the images is considerably lower than the actual values, except for that in the middle image. Furthermore, for more complex scenarios in burned areas, the vegetation coverage derived from vegetation indices often fails to fully capture the extent of forest recovery or reflect the forest’s scale or structural changes. Therefore, in tasks involving visible-light imagery, particularly in post-fire areas with more complex data, vegetation indices suffer from limitations such as their reliance on time sensitivity and environmental factors, as well as their poor flexibility, making them less suitable for observing the evolutionary characteristics of burned areas.
As long as the data are trained at different times and in various environments, the network model can have the ability to recognize and differentiate between different areas under complex conditions, thus allowing for better observation of the evolutionary characteristics of the areas.
In this study, we employed a deep learning-based semantic segmentation method to observe the evolutionary characteristics of burned areas. In remote sensing images, there are not only forests and burned areas but also other objects such as roads, buildings, ponds, construction sites, and regions that are either not completely destroyed or remain in the process of recovery but do not yet constitute a forested area. Therefore, we could not simply add together the forest and burned areas to represent the entire region. A singular focus on either the forested or burned areas would also have been insufficient. Thus, we segmented both the forested and burned portions of the entire region for analysis. Additionally, since roads have a similar color to burned areas and occupied a considerable portion in some images, we also segmented the road sections separately to improve accuracy of the model.
2.4. Model
2.4.1. Semantic Segmentation Model Mask2Former
Semantic segmentation is the task of annotating and classifying objects of interest within an image. Through deep learning models, different regions of the image can be identified, and corresponding semantic labels can be assigned to the pixels within these regions.
Mask2Former is an efficient and highly flexible segmentation model [
22] whose multi-scale features significantly enhance segmentation accuracy. This model is particularly suitable for segmentation tasks targeting specific objects. The overall structure of the model is illustrated in
Figure 3.
Mask2Former optimizes the Decoder by introducing a multi-scale strategy. Specifically, the image first undergoes feature extraction through the Backbone, forming a feature pyramid composed of both low- and high-resolution features. These features are then fed from the Pixel decoder to the Transformer Decoder, enabling the parallel processing of different resolution features. This optimization strategy not only enhances the model’s comprehensiveness in information extraction but also significantly improves overall performance. This approach also increases the model’s adaptability, providing greater robustness in handling a variety of tasks.
2.4.2. Backbone Swin-Transformer
Commonly used Backbones for Mask2Former include VIT, Swin-Transformer, and ResNet. Given the high resolution of our dataset, we selected Swin-Transformer [
23] as the Backbone, and applied the Shifted Window strategy to maintain computational efficiency when processing high-resolution images. Swin-Transformer builds on the strong attention mechanism of Transformer by employing hierarchical feature representation, which not only preserves detailed information but also captures global context more effectively, adapting to image features at various scales. The Shifted Window mechanism applies local window attention to the feature map, reducing computational complexity while improving the model’s ability to capture long-range dependencies. This method enhances performance in processing high-resolution images while maintaining computational efficiency. The architecture of the Swin-Transformer and the Swin-Transformer Block are shown in
Figure 4.
The traditional Multi-Head Self-Attention (MSA) mechanism facilitates feature interactions at different spatial locations and understands the global features, but the correlation between all tokens requires significant computation. On the other hand, Window Multi-Head Self-Attention (W-MSA), which is unique in Swin-Transformer, performs self-attention computations within each window by dividing the feature map into multiple non-overlapping windows. Against this background, the complexity of MSA and W-MSA can be compared as follows:
where Ω denotes the computational complexity, h and w denote the height and width of the feature map, C denotes the depth of the feature map, and M denotes the window size. It can be seen that the computational complexity of MSA is quadratically related to the size of the feature map, while the computational complexity of W-MSA is linearly related to the size of the feature map. Lastly, M2 is the size of the window, which is certain to be smaller than the size of the feature map, h × w, as shown in
Figure 5a.
Since the information obtained via W-MSA provides only the content inside this window, there is no effective information transfer between windows. The solution to this problem is Shifted Window Multi-Head Self-Attention (SW-MSA). SW-MSA shifts the original window both to the left and upward by M/2 distance to form multiple new windows with different sizes, as shown in
Figure 5b. The slid window thus integrates the patches not previously included in the same W-MSA windows. The information between the originally different windows can then be exchanged, enabling the network to better extract the global features. Thus, both W-MSA and SW-MSA appear in pairs in Swin-Transformer Block.
2.4.3. Efficiently Adaptive Block
In the process of secondary succession, forests gradually recovering from mountain fires face intense competition among different species, with constantly alternating dominant species. At the same time, forests are in different successional stages due to different degrees of fire damage in each region, resulting in a complex and variable situation. To achieve better segmentation results in this complex environment, we introduced the Contextual Transformer Block [
24] (CoT Block). This Block integrates contextual information mining and a self-attention mechanism, which is able to better explore the contextual information in complex environments and help distinguish between different neighboring objects.
The CoT Block first uses k × k sets of convolutions for neighboring keys in a k × k grid to obtain the static contextual representation of the input, K1. K1 is then merged with the input Query, and the dynamic multi-head attention matrix is trained with two consecutive 1 × 1 convolutions to obtain the attention matrix, A. The trained attention matrix, A, is multiplied by the input value, V, to achieve the dynamic contextual representation of the input, K2, i.e., K2 = V × A. The final fusion of static and dynamic contextual representations is used as the output.
To further enhance the feature extraction ability of the network in classification tasks, we introduced the idea of Efficient Multi-Scale Attention (EMA) in the CoT Block [
25]. EMA enables spatial semantic features to be spatially semantically distributed within each feature group by reshaping some of the channels into batch dimensions and dividing the channel dimensions into sub-features. At the same time, EMA recalibrates the channel weights of each parallel branch by encoding global information, improving the learning efficiency of the channel content without reducing the channel dimensions and producing better pixel-level attention in advanced feature mapping.
First, EMA is used to divide the input feature map x into g (groups) sub-features in channel dimension to learn different semantics, usually g << c (channels), and enables the learned attentional weight descriptors to enhance the feature representation of the region of interest in each sub-feature. EMA is then used to extract the attentional weight descriptors of the grouped feature map through three parallel routes. Two parallel routes are on 1 × 1 branches and the third route is on 3 × 3 branches. A global average pooling operation is applied along each of the two spatial directions in the 1 × 1 branch, and only one 3 × 3 convolution kernel is used in the 3 × 3 branch to capture the multi-scale feature representation, alleviating the computational requirements. We then join the two encoded features along the image height direction so that they share the same 1 × 1 convolution without reducing the dimension of the 1 × 1 branch. After decomposing the output of the 1 × 1 convolution into two vectors, a nonlinear Sigmoid function is utilized to fit the two-dimensional binomial distribution after linear convolution. Cross-channel feature interactions were realized by fusing the attention maps of the two channels within each group via multiplication, and the 3 × 3 branch enlarged the feature space and acquired cross-channel communication information through convolution. In this way, EMA not only pairwise adjusted the importance of different channels but also retained accurate spatial structure information within the channels.
In the process of realizing cross-channel feature interaction, we next introduced the outputs of the 1 × 1 branch and 3 × 3 branch and encoded the output of the 1 × 1 branch into the corresponding dimensional shapes using two-dimensional global average pooling, which operates with the following equation.
where Z
C is the average pooling result on channel c, representing the global feature description of channel c; H and W are the height and width of the feature map, respectively; and
denotes the value of the feature map on channel c located at position (i,j). To improve computational efficiency, the nonlinear function Softmax is used during the output of the 2D global average pooling to fit the above linear transformation. By multiplying the output of the above parallel processing with the matrix dot product operation, we obtained the first spatial attention map containing spatial location information. On the 3 × 3 branch, we used the same method to obtain the second spatial attention map, and then the two sets of generated spatial attention weights were pooled and passed through the Sigmoid function.
The mechanism for applying EMA in the CoT Block enhances the capture of contextual information while utilizing that information to better highlight advanced and in-depth features, enabling the network to adapt to the task of a complex environment while still efficiently learning the advanced features of the object of interest, which we call the Efficiently Adaptive Block (EA Block), as shown in
Figure 6.
The optimized Swin-Transformer with an integrated EA Block is used here as the Backbone of the Mask2former model to improve the accuracy of segmentation, as well as its ability to capture contextual information and learn key features. We call this segmentation network EAswin-Mask2former, whose overall network structure is shown in
Figure 7.
3. Experiment
3.1. Experimental Conditions
The models were implemented using Pytorch and run on an NVIDIA GeForce RTX 3090. For consistency in our experiments, all models used the same dataset, with a learning rate of 0.0001 and batch size of 2. In addition, models were trained for a total of 250 epochs. If losses did not decrease after 20 consecutive epochs, the training process was stopped.
We use the Cross Entropy Loss function in the segmentation task, as follows:
where b is the batch size during training, and
and
are the actual and predicted true values, respectively.
3.2. Performance Index
To evaluate the performance of the model, we used four common evaluation metrics: mAcc, mIou, mDice, and mPrecision with values 0–1. The closer the value is to 1, the better the model’s prediction. The meanings of Acc, Iou, Dice, and Precision are as follows for TP (the truth value belongs to a certain class and the prediction result after segmentation is also in that class), TN (the truth value does not belong to a certain class and the prediction result after segmentation is also not in that class), FP (the truth value does not belong to a certain class and the prediction result after segmentation is also in that class), and FN (the truth value belongs to a certain class and the prediction result after segmentation is also not in that class):
where mAcc, mIou, mDice, and mPrecision denote their mean values, respectively.
3.3. Comparative Experiments and Ablation Experiment
To verify the effectiveness and universal applicability of the improved EAswin–Mask2-former, we conducted a comparative experiment with classic image segmentation models and Mask2-formers using different backbones on an open-source dataset. Here, the model compared with the EAswin–Mask2former is called baselines. This experiment used the open-source image segmentation dataset Cityscapes to segment 19 target categories: “roads”, “sidewalks”, “buildings”, “walls”, “fences”, “poles”, “traffic lights”, “road signs”, “vegetation”, “terrain”, “sky”, “people”, “riders”, “cars”, “trucks”, “buses”, “trains”, “motorcycles”, and “bicycles”.
Cityscapes is a high-quality dataset used for computer vision tasks such as semantic segmentation, instance segmentation, and object detection in urban environments. We used this dataset because it shares similar evolutionary characteristics with our dataset collected from Jinyun Mountain, featuring frequent occurrences of major elements, such as roads and vehicles, along with other changing objects.
Several image segmentation models were previously compared with EAswin–Mask2former, including Mask2former, Segformer [
26], and Deeplabv3+ [
27], with Resnet [
28] as the backbone, and others. Due to the limited size of our dataset, we used the lightweight Swin-Tiny version of Swin-Transformer in the Mask2former approach. Similarly, the ResNet network used for comparison was ResNet50, which has similar computational complexity. For Segformer and DeepLabv3+, we used MixVIT and ResNet101 as the backbone networks, which offered the best experimental results. Using ResNet50 would have resulted in lower performance.
Next, to verify the effectiveness of the improvement of the EAswin–Mask2former in the complex environment of post-fire forests, we conducted another comparative experiment with the aforementioned baselines on the dataset we obtained from Jinyun Mountain.
Finally, to verify the effectiveness of the improvements made to EAswin–Mask2former based on Swin–Mask2former, we conducted an ablation experiment. This experiment compared the basic Swin–Mask2former, Swin–Mask2former modified with CoT, Swin–Mask2former modified with EMA, and EAswin–Mask2former modified with both CoT and EMA on the Jinyun Mountain dataset.
5. Conclusions
This study addressed the problem of monitoring the evolutionary characteristics of burned areas in forests using visible light remote sensing imagery and proposed a semantic segmentation model, EAswin–Mask2former, based on deep learning.
Forest fires pose a significant threat to forest ecosystems, making it crucial to understand post-fire recovery and observe the evolution of post-fire forests. The real-time visible light remote sensing images captured by UAVs are clearer than the satellite imagery commonly used in the past. Using deep learning-based semantic segmentation on high-resolution UAV imagery over time can lead to good performance, making this method more suitable for observing the evolutionary characteristics of post-fire forests compared with traditional vegetation index methods. We employed the Mask2Former model with Swin-Transformer as the backbone, which performed well in this segmentation task. On this basis, we proposed an improved EAswin-Mask2Former segmentation model. The experimental results demonstrated that this model is better adapted to the complex conditions of post-fire forests, achieving superior performance.
Within less than a year after the fire, some areas of Jinyun Mountain exhibited accelerated recovery, which may have been influenced by human intervention.