Nothing Special   »   [go: up one dir, main page]

You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (8)

Search Parameters:
Keywords = multiclass road segmentation

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
25 pages, 115458 KiB  
Article
RSAM-Seg: A SAM-Based Model with Prior Knowledge Integration for Remote Sensing Image Semantic Segmentation
by Jie Zhang, Yunxin Li, Xubing Yang, Rui Jiang and Li Zhang
Remote Sens. 2025, 17(4), 590; https://doi.org/10.3390/rs17040590 - 8 Feb 2025
Viewed by 815
Abstract
High-resolution remote sensing satellites have revolutionized remote sensing research, yet accurately segmenting specific targets from complex satellite imagery remains challenging. While the Segment Anything Model (SAM) has emerged as a promising universal segmentation model, its direct application to remote sensing imagery yields suboptimal [...] Read more.
High-resolution remote sensing satellites have revolutionized remote sensing research, yet accurately segmenting specific targets from complex satellite imagery remains challenging. While the Segment Anything Model (SAM) has emerged as a promising universal segmentation model, its direct application to remote sensing imagery yields suboptimal results. To address these limitations, we propose RSAM-Seg, a novel deep learning model adapted from SAM specifically designed for remote sensing applications. Our model incorporates two key components: Adapter-Scale and Adapter-Feature modules. The Adapter-Scale modules, integrated within Vision Transformer (ViT) blocks, enhance model adaptability through learnable transformations, while the Adapter-Feature modules, positioned between ViT blocks, generate image-informed prompts by incorporating task-specific information. Extensive experiments across four binary and two multi-class segmentation scenarios demonstrate the superior performance of RSAM-Seg, achieving an F1 score of 0.815 in cloud detection, 0.834 in building segmentation, and 0.755 in road extraction, consistently outperforming established architectures like U-Net, DeepLabV3+, and Segformer. Moreover, RSAM-Seg shows significant improvements of up to 56.5% in F1 score compared to the original SAM. In addition, RSAM-Seg maintains robust performance in few-shot learning scenarios, achieving an F1 score of 0.656 with only 1% of the training data and increasing to 0.815 with full data availability. Furthermore, RSAM-Seg exhibits the capability to detect missing areas within the ground truth of certain datasets, highlighting its capability for completion. Full article
(This article belongs to the Special Issue Advanced AI Technology for Remote Sensing Analysis)
Show Figures

Graphical abstract

Graphical abstract
Full article ">Figure 1
<p>Comparison of segmentation results on various scenarios using point and box prompt modes, as well as the segment everything mode. Green squares indicate box prompt positions, green points show point prompt coordinates, and red areas represent predicted segmentation results. In Segment Everything results, each color represents a distinct segmented instance.</p>
Full article ">Figure 2
<p>The architecture and key components of RSAM-Seg. (<b>A</b>) The overall structure of RSAM-Seg. The encoder’s ViT blocks are modified to include the internal Adapter-Scale components and interleaved Adapter-Feature layers for enhanced image information extraction (⊕ denotes feature fusion), while the mask decoder remains unchanged and promptless. Each stage’s dimensional changes are marked in the figure, where ‘B’ represents batch size. (<b>B</b>) The structure of the modified transformer block and Adapter-Scale, where “Learned Embedding from (i − 1)-th ViT Block” represents features processed by the previous transformer layer, and “Reshaped Prompts” are dimension-adjusted prompts generated by Adapter-Feature for the ViT Block. (<b>C</b>) The structure of Adapter-Feature, which extracts HFC features through Fourier transform for prompt generation.</p>
Full article ">Figure 3
<p>Visual examples from different datasets depict various scenes, including clouds, buildings, fields, and roads. The image above shows remote sensing images with the corresponding mask. GT represents ground truth.</p>
Full article ">Figure 4
<p>Visual examples from the CSWV and GID datasets, illustrating cloud-like scenes in mountainous regions and diverse land-cover scenarios.</p>
Full article ">Figure 5
<p><math display="inline"><semantics> <mrow> <mi>P</mi> <mi>r</mi> <mi>e</mi> <mi>c</mi> <mi>i</mi> <mi>s</mi> <mi>i</mi> <mi>o</mi> <mi>n</mi> </mrow> </semantics></math> vs. <math display="inline"><semantics> <mrow> <mi>R</mi> <mi>e</mi> <mi>c</mi> <mi>a</mi> <mi>l</mi> <mi>l</mi> </mrow> </semantics></math> across various datasets, showing the performance of different segmentation models (U-Net, DeepLabV3+, Segformer, SAM(center +), SAM(center−), SAM(manual), and RSAM-Seg in different types of scenarios.</p>
Full article ">Figure 6
<p>Comparison of cloud segmentation results using the 38-Cloud dataset for RSAM-Seg, SAM, U-Net, DeepLabV3+, and Segformer. For SAM(manual), the green points indicate manually annotated positive samples, the blue points represent negative samples, and the red masks show the segmentation results. For SAM(center), + and − denote the segmentation results where the center points are marked as positive and negative classes, respectively.</p>
Full article ">Figure 7
<p>Comparison of field segmentation results for the Sentinel-2 dataset with RSAM-Seg, SAM, U-Net, DeepLabV3+, and Segformer. For SAM(manual), the green points indicate manually annotated positive samples, the blue points represent negative samples, and the red masks show the segmentation results. For SAM(center), + and − denote the segmentation results where the center points are marked as positive and negative classes, respectively.</p>
Full article ">Figure 8
<p>Comparison of building segmentation results for the Inria dataset with RSAM-Seg, SAM, U-Net, DeepLabV3+, and Segformer. In SAM(manual), the green points indicate manually annotated positive samples, the blue points represent negative samples, and the red masks show the segmentation results. For SAM(center), + and − denote the segmentation results where the center points are marked as positive and negative classes, respectively.</p>
Full article ">Figure 9
<p>Comparison of the road segmentation results for the DG-Road dataset using RSAM-Seg, SAM, U-Net, DeepLabV3+, and Segformer. For SAM(manual), the green points indicate manually annotated positive samples, the blue points represent negative samples, and the red masks show the segmentation results. For SAM(center), + and − denote the segmentation results where the center points are marked as positive and negative classes, respectively.</p>
Full article ">Figure 10
<p>Comparison of cloud and snow segmentation results for the CSWV dataset with original imagery, ground truth, U-Net, SAM (in the point and segment everything modes), and RSAM-Seg.</p>
Full article ">Figure 11
<p>Comparison of the land-cover multi-class segmentation results for the GID dataset with ground truth, U-Net, SAM, and RSAM-Seg.</p>
Full article ">Figure 12
<p>Visualization results of the ablation study for RSAM-Seg using the 38-Cloud dataset, where the complete RSAM-Seg model shows the best performance compared to its ablated versions.</p>
Full article ">Figure 13
<p>Comparison of images, ground truth, HFC features, and RSAM-Seg results across four distinct scenarios of clouds, fields, buildings, and roads, demonstrating segmentation through HFC feature extraction.</p>
Full article ">Figure 14
<p>Examples of few-shot segmentation results using the 38-Cloud dataset.</p>
Full article ">Figure 15
<p>Examples of completion results on the DG-Road dataset.</p>
Full article ">
14 pages, 4833 KiB  
Article
Automatic Road Extraction from Historical Maps Using Transformer-Based SegFormers
by Elif Sertel, Can Michael Hucko and Mustafa Erdem Kabadayı
ISPRS Int. J. Geo-Inf. 2024, 13(12), 464; https://doi.org/10.3390/ijgi13120464 - 21 Dec 2024
Viewed by 1403
Abstract
Historical maps are valuable sources of geospatial data for various geography-related applications, providing insightful information about historical land use, transportation infrastructure, and settlements. While transformer-based segmentation methods have been widely applied to image segmentation tasks, they have mostly focused on satellite images. There [...] Read more.
Historical maps are valuable sources of geospatial data for various geography-related applications, providing insightful information about historical land use, transportation infrastructure, and settlements. While transformer-based segmentation methods have been widely applied to image segmentation tasks, they have mostly focused on satellite images. There is a growing need to explore transformer-based approaches for geospatial object extraction from historical maps, given their superior performance over traditional convolutional neural network (CNN)-based architectures. In this research, we aim to automatically extract five different road types from historical maps, using a road dataset digitized from the scanned Deutsche Heereskarte 1:200,000 Türkei (DHK 200 Turkey) maps. We applied the variants of the transformer-based SegFormer model and evaluated the effects of different encoders, batch sizes, loss functions, optimizers, and augmentation techniques on road extraction performance. Our best results, with an intersection over union (IoU) of 0.5411 and an F1 score of 0.7017, were achieved using the SegFormer-B2 model, the Adam optimizer, and the focal loss function. All SegFormer-based experiments outperformed previously reported CNN-based segmentation models on the same dataset. In general, increasing the batch size and using larger SegFormer variants (from B0 to B2) resulted in improved accuracy metrics. Additionally, the choice of augmentation techniques significantly influenced the outcomes. Our results demonstrate that SegFormer models substantially enhance true positive predictions and resulted in higher precision metric values. These findings suggest that the output weights could be directly applied to transfer learning for similar historical maps and the inference of additional DHK maps, while offering a promising architecture for future road extraction studies. Full article
Show Figures

Figure 1

Figure 1
<p>(<b>a</b>) An example of the DHK 200 Map Sheet D-V for İzmit, (<b>b</b>) part of the legend focusing on road types, (<b>c</b>) selected road type explanations with our English translations (Source: <a href="https://digitalarchive.mcmaster.ca/" target="_blank">https://digitalarchive.mcmaster.ca/</a> accessed on 17 December 2024).</p>
Full article ">Figure 2
<p>Predictions on the test set of DHK 200 Turkey map. (<b>a</b>) Input image. (<b>b</b>) Ground truth. (<b>c</b>) Experiment 1 (B0-Adam-Dice-Old). (<b>d</b>) Experiment 6 (B0-AdamW-Focal-New). (<b>e</b>) Experiment 8 (B0-Adam-Focal-New). (<b>f</b>) Experiment 10 (B1-AdamW-Focal-New). (<b>g</b>) Experiment 11 (B1-Adam-Focal-New). (<b>h</b>) Experiment 13 (B1-AdamW-Focal-New). (<b>i</b>) Experiment 14 (B1-Adam-Focal-New). (<b>j</b>) Experiment 15 (U-Net++-TIMM-Adam-Dice-Old) [<a href="#B11-ijgi-13-00464" class="html-bibr">11</a>].</p>
Full article ">Figure 3
<p>Class-wise F1-scores of best five experiments.</p>
Full article ">Figure 4
<p>Class-wise IoUs of best five experiments.</p>
Full article ">
16 pages, 11666 KiB  
Article
A Lightweight Building Extraction Approach for Contour Recovery in Complex Urban Environments
by Jiaxin He, Yong Cheng, Wei Wang, Zhoupeng Ren, Ce Zhang and Wenjie Zhang
Remote Sens. 2024, 16(5), 740; https://doi.org/10.3390/rs16050740 - 20 Feb 2024
Cited by 3 | Viewed by 1614
Abstract
High-spatial-resolution urban buildings play a crucial role in urban planning, emergency response, and disaster management. However, challenges such as missing building contours due to occlusion problems (occlusion between buildings of different heights and buildings obscured by trees), uneven contour extraction due to mixing [...] Read more.
High-spatial-resolution urban buildings play a crucial role in urban planning, emergency response, and disaster management. However, challenges such as missing building contours due to occlusion problems (occlusion between buildings of different heights and buildings obscured by trees), uneven contour extraction due to mixing of building edges with other feature elements (roads, vehicles, and trees), and slow training speed in high-resolution image data hinder efficient and accurate building extraction. To address these issues, we propose a semantic segmentation model composed of a lightweight backbone, coordinate attention module, and pooling fusion module, which achieves lightweight building extraction and adaptive recovery of spatial contours. Comparative experiments were conducted on datasets featuring typical urban building instances in China and the Mapchallenge dataset, comparing our method with several classical and mainstream semantic segmentation algorithms. The results demonstrate the effectiveness of our approach, achieving excellent mean intersection over union (mIoU) and frames per second (FPS) scores on both datasets (China dataset: 85.11% and 110.67 FPS; Mapchallenge dataset: 90.27% and 117.68 FPS). Quantitative evaluations indicate that our model not only significantly improves computational speed but also ensures high accuracy in the extraction of urban buildings from high-resolution imagery. Specifically, on a typical urban building dataset from China, our model shows an accuracy improvement of 0.64% and a speed increase of 70.03 FPS compared to the baseline model. On the Mapchallenge dataset, our model achieves an accuracy improvement of 0.54% and a speed increase of 42.39 FPS compared to the baseline model. Our research indicates that lightweight networks show significant potential in urban building extraction tasks. In the future, the segmentation accuracy and prediction speed can be further balanced on the basis of adjusting the deep learning model or introducing remote sensing indices, which can be applied to research scenarios such as greenfield extraction or multi-class target extraction. Full article
Show Figures

Figure 1

Figure 1
<p>Improved network structure.</p>
Full article ">Figure 2
<p>Coordinate attention module.</p>
Full article ">Figure 3
<p>SP-ASPP module.</p>
Full article ">Figure 4
<p>Examples of typical urban buildings in China.</p>
Full article ">Figure 5
<p>Mapchallenge building dataset.</p>
Full article ">Figure 6
<p>Model visualization results on a dataset of buildings in typical Chinese cities.</p>
Full article ">Figure 7
<p>Comparative experimental results for the Mapchallenge building dataset.</p>
Full article ">Figure 8
<p>Comparative experimental results of the Typical Urban Buildings in China dataset.</p>
Full article ">
12 pages, 10744 KiB  
Communication
Application of C-LSTM Networks to Automatic Labeling of Vehicle Dynamic Response Data for Bridges
by Ryota Shin, Yukihiko Okada and Kyosuke Yamamoto
Sensors 2022, 22(9), 3486; https://doi.org/10.3390/s22093486 - 3 May 2022
Cited by 6 | Viewed by 2589
Abstract
Maintaining bridges that support road infrastructure is critical to the economy and human life. Structural health monitoring of bridges using vibration includes direct monitoring and drive-by monitoring. Drive-by monitoring uses a vehicle equipped with accelerometers to drive over bridges and estimates the bridge’s [...] Read more.
Maintaining bridges that support road infrastructure is critical to the economy and human life. Structural health monitoring of bridges using vibration includes direct monitoring and drive-by monitoring. Drive-by monitoring uses a vehicle equipped with accelerometers to drive over bridges and estimates the bridge’s health from the vehicle vibration obtained. In this study, we attempt to identify the driving segments on bridges in the vehicle vibration data for the practical application of drive-by monitoring. We developed an in-vehicle sensor system that can measure three-dimensional behavior, and we propose a new problem of identifying the driving segment of vehicle vibration on a bridge from data measured in a field experiment. The “on a bridge” label was assigned based on the peaks in the vehicle vibration when running at joints. A supervised binary classification model using C-LSTM (Convolution—Long-Term Short Memory) networks was constructed and applied to data measured, and the model was successfully constructed with high accuracy. The challenge is to build a model that can be applied to bridges where joints do not exist. Therefore, future work is needed to propose a running label on bridges based on bridge vibration and extend the model to a multi-class model. Full article
Show Figures

Figure 1

Figure 1
<p>Vehicle and Accelerometers installation point.</p>
Full article ">Figure 2
<p>Vehicle measurement sensor system used for measurement.</p>
Full article ">Figure 3
<p>The GPS receiver installed on the bridge.</p>
Full article ">Figure 4
<p>The measured vehicle vibration and position data.</p>
Full article ">Figure 5
<p>Photo of Bridge A: (<b>a</b>) rubber joints for bridges; (<b>b</b>) general view.</p>
Full article ">Figure 6
<p>Data measured on the left side of the front wheels for unsprung weight when running Bridge A.</p>
Full article ">Figure 7
<p>Data measured when driving on Bridge A and bridge labels.</p>
Full article ">
18 pages, 3934 KiB  
Article
Multi-Class Strategies for Joint Building Footprint and Road Detection in Remote Sensing
by Christian Ayala, Carlos Aranda and Mikel Galar
Appl. Sci. 2021, 11(18), 8340; https://doi.org/10.3390/app11188340 - 8 Sep 2021
Cited by 4 | Viewed by 2960
Abstract
Building footprints and road networks are important inputs for a great deal of services. For instance, building maps are useful for urban planning, whereas road maps are essential for disaster response services. Traditionally, building and road maps are manually generated by remote sensing [...] Read more.
Building footprints and road networks are important inputs for a great deal of services. For instance, building maps are useful for urban planning, whereas road maps are essential for disaster response services. Traditionally, building and road maps are manually generated by remote sensing experts or land surveying, occasionally assisted by semi-automatic tools. In the last decade, deep learning-based approaches have demonstrated their capabilities to extract these elements automatically and accurately from remote sensing imagery. The building footprint and road network detection problem can be considered a multi-class semantic segmentation task, that is, a single model performs a pixel-wise classification on multiple classes, optimizing the overall performance. However, depending on the spatial resolution of the imagery used, both classes may coexist within the same pixel, drastically reducing their separability. In this regard, binary decomposition techniques, which have been widely studied in the machine learning literature, are proved useful for addressing multi-class problems. Accordingly, the multi-class problem can be split into multiple binary semantic segmentation sub-problems, specializing different models for each class. Nevertheless, in these cases, an aggregation step is required to obtain the final output labels. Additionally, other novel approaches, such as multi-task learning, may come in handy to further increase the performance of the binary semantic segmentation models. Since there is no certainty as to which strategy should be carried out to accurately tackle a multi-class remote sensing semantic segmentation problem, this paper performs an in-depth study to shed light on the issue. For this purpose, open-access Sentinel-1 and Sentinel-2 imagery (at 10 m) are considered for extracting buildings and roads, making use of the well-known U-Net convolutional neural network. It is worth stressing that building and road classes may coexist within the same pixel when working at such a low spatial resolution, setting a challenging problem scheme. Accordingly, a robust experimental study is developed to assess the benefits of the decomposition strategies and their combination with a multi-task learning scheme. The obtained results demonstrate that decomposing the considered multi-class remote sensing semantic segmentation problem into multiple binary ones using a One-vs.-All binary decomposition technique leads to better results than the standard direct multi-class approach. Additionally, the benefits of using a multi-task learning scheme for pushing the performance of binary segmentation models are also shown. Full article
(This article belongs to the Special Issue Computer Vision in the Era of Deep Learning)
Show Figures

Figure 1

Figure 1
<p>U-net architecture [<a href="#B40-applsci-11-08340" class="html-bibr">40</a>] adapted to a multi-class semantic segmentation problem, including building footprints and road networks. Note that the final <math display="inline"><semantics> <mrow> <mn>1</mn> <mo>×</mo> <mn>1</mn> </mrow> </semantics></math> convolutional layer performs a classification into 3 classes since the background is also considered.</p>
Full article ">Figure 2
<p>OVO decomposition scheme adapted to a multi-class semantic segmentation problem, including building footprints and road networks.</p>
Full article ">Figure 3
<p>OVA decomposition scheme adapted to a multi-class semantic segmentation problem, including building footprints and road networks.</p>
Full article ">Figure 4
<p>Multi-task architectures depending on how the parameters of the hidden layers are shared.</p>
Full article ">Figure 5
<p>Dataset generation pipeline for a generic area of interest.</p>
Full article ">Figure 6
<p>Geographical distribution of the dataset (green training set/red testing set).</p>
Full article ">Figure 7
<p>Qualitative results for the benefits of using decomposition strategies. Visual comparison of the results obtained with the direct multi-class approach vs. OVA and OVO binary decomposition strategies for a zone taken from Pamplona city in the test set. TP are presented in green, FP in blue, FN in red and TN in white.</p>
Full article ">Figure 8
<p>Qualitative results for the benefits of using decomposition strategies. Visual comparison of the results obtained with the direct multi-class approach vs. OVA and OVO binary decomposition strategies for a zone taken from Barcelona North city in the test set. TP are presented in green, FP in blue, FN in red and TN in white.</p>
Full article ">Figure 9
<p>Qualitative results for the benefits of using a multi-task learning scheme. Visual comparison of the results obtained with the OVA binary decomposition strategy and its multi-task variant, for a zone taken from Pamplona city in the test set. TP are presented in green, FP in blue, FN in red and TN in white.</p>
Full article ">Figure 10
<p>Qualitative results for the benefits of using a multi-task learning scheme. Visual comparison of the results obtained with the OVA binary decomposition strategy and its multi-task variant, for a zone taken from Barcelona North city in the test set. TP are presented in green, FP in blue, FN in red and TN in white.</p>
Full article ">
15 pages, 5623 KiB  
Article
Automatic Road Extraction from Historical Maps Using Deep Learning Techniques: A Regional Case Study of Turkey in a German World War II Map
by Burak Ekim, Elif Sertel and M. Erdem Kabadayı
ISPRS Int. J. Geo-Inf. 2021, 10(8), 492; https://doi.org/10.3390/ijgi10080492 - 21 Jul 2021
Cited by 30 | Viewed by 5891
Abstract
Scanned historical maps are available from different sources in various scales and contents. Automatic geographical feature extraction from these historical maps is an essential task to derive valuable spatial information on the characteristics and distribution of transportation infrastructures and settlements and to conduct [...] Read more.
Scanned historical maps are available from different sources in various scales and contents. Automatic geographical feature extraction from these historical maps is an essential task to derive valuable spatial information on the characteristics and distribution of transportation infrastructures and settlements and to conduct quantitative and geometrical analysis. In this research, we used the Deutsche Heereskarte 1:200,000 Türkei (DHK 200 Turkey) maps as the base geoinformation source to construct the past transportation networks using the deep learning approach. Five different road types were digitized and labeled to be used as inputs for the proposed deep learning-based segmentation approach. We adapted U-Net++ and ResneXt50_32×4d architectures to produce multi-class segmentation masks and perform feature extraction to determine various road types accurately. We achieved remarkable results, with 98.73% overall accuracy, 41.99% intersection of union, and 46.61% F1 score values. The proposed method can be implemented in DHK maps of different countries to automatically extract different road types and used for transfer learning of different historical maps. Full article
Show Figures

Figure 1

Figure 1
<p>DHK 200 Turkey, 3rd special edition coverage of versions 6612, 6606, and 6613 from 1943.</p>
Full article ">Figure 2
<p>The 27 DHK 200 Turkey images or sheets used in the study.</p>
Full article ">Figure 3
<p>An example legend illustrating the road types in a DHK 200 Turkey map.</p>
Full article ">Figure 4
<p>Dataset samples of historical maps and corresponding ground truth masks. (<b>a</b>,<b>c</b>,<b>e</b>,<b>g</b>) are examples of original historical map patches; (<b>b</b>,<b>d</b>,<b>f</b>,<b>h</b>) are respective ground truth masks.</p>
Full article ">Figure 5
<p>Color representation of road types used in the classification scheme.</p>
Full article ">Figure 6
<p>Road type distribution of the ground truth mask.</p>
Full article ">Figure 7
<p>Training, validation, and test set road type distributions of the ground truth mask.</p>
Full article ">Figure 8
<p>Overall road type classification workflow. Given an input image, the U-Net++ architecture creates predicted road types.</p>
Full article ">Figure 9
<p>Qualitative assessment results. From left to right: input patch, ground truth mask, and produced road type prediction by the DNN model.</p>
Full article ">Figure 10
<p>Class-wise F1, precision, recall, and IoU score comparison.</p>
Full article ">Figure 11
<p>Normalized confusion matrix.</p>
Full article ">
17 pages, 5020 KiB  
Article
An Innovative Intelligent System with Integrated CNN and SVM: Considering Various Crops through Hyperspectral Image Data
by Shiuan Wan, Mei-Ling Yeh and Hong-Lin Ma
ISPRS Int. J. Geo-Inf. 2021, 10(4), 242; https://doi.org/10.3390/ijgi10040242 - 7 Apr 2021
Cited by 16 | Viewed by 3186
Abstract
Generation of a thematic map is important for scientists and agriculture engineers in analyzing different crops in a given field. Remote sensing data are well-accepted for image classification on a vast area of crop investigation. However, most of the research has currently focused [...] Read more.
Generation of a thematic map is important for scientists and agriculture engineers in analyzing different crops in a given field. Remote sensing data are well-accepted for image classification on a vast area of crop investigation. However, most of the research has currently focused on the classification of pixel-based image data for analysis. The study was carried out to develop a multi-category crop hyperspectral image classification system to identify the major crops in the Chiayi Golden Corridor. The hyperspectral image data from CASI (Compact Airborne Spectrographic Imager) were used as the experimental data in this study. A two-stage classification was designed to display the performance of the image classification. More specifically, the study used a multi-class classification by support vector machine (SVM) + convolutional neural network (CNN) for image classification analysis. SVM is a supervised learning model that analyzes data used for classification. CNN is a class of deep neural networks that is applied to analyzing visual imagery. The image classification comparison was made among four crops (paddy rice, potatoes, cabbages, and peanuts), roads, and structures for classification. In the first stage, the support vector machine handled the hyperspectral image classification through pixel-based analysis. Then, the convolution neural network improved the classification of image details through various blocks (cells) of segmentation in the second stage. A series of discussion and analyses of the results are presented. The repair module was also designed to link the usage of CNN and SVM to remove the classification errors. Full article
Show Figures

Figure 1

Figure 1
<p>Study area. (<b>a</b>) Hyperspectrum image (April, 2016). (<b>b</b>) Ground truth data.</p>
Full article ">Figure 2
<p>The convolutional neural network (CNN) model for prediction.</p>
Full article ">Figure 3
<p>Research steps of the study.</p>
Full article ">Figure 4
<p>The progress of the Cell program derived from the regional object classification model (ROC) program. The line is represent the margin of linear regression function by collecting the coordinate data of blue pixels: (<b>a</b>) The ROC program find seeds of Area and Similarity. (<b>b</b>) Removal of the tiny parts to be integrated.</p>
Full article ">Figure 5
<p>Thematic map of the first stage on SVM.</p>
Full article ">Figure 6
<p>Resize window of CNN selection by the Cell program and the prepared inputs for CNN.</p>
Full article ">Figure 7
<p>Accuracy of feature selection of PCA 8 epoch 30.</p>
Full article ">Figure 8
<p>Accuracy of feature selection of PCA 24 of epoch 30.</p>
Full article ">
37 pages, 41216 KiB  
Article
Pixel-Wise Classification of High-Resolution Ground-Based Urban Hyperspectral Images with Convolutional Neural Networks
by Farid Qamar and Gregory Dobler
Remote Sens. 2020, 12(16), 2540; https://doi.org/10.3390/rs12162540 - 7 Aug 2020
Cited by 14 | Viewed by 5180
Abstract
Using ground-based, remote hyperspectral images from 0.4–1.0 micron in ∼850 spectral channels—acquired with the Urban Observatory facility in New York City—we evaluate the use of one-dimensional Convolutional Neural Networks (CNNs) for pixel-level classification and segmentation of built and natural materials in urban environments. [...] Read more.
Using ground-based, remote hyperspectral images from 0.4–1.0 micron in ∼850 spectral channels—acquired with the Urban Observatory facility in New York City—we evaluate the use of one-dimensional Convolutional Neural Networks (CNNs) for pixel-level classification and segmentation of built and natural materials in urban environments. We find that a multi-class model trained on hand-labeled pixels containing Sky, Clouds, Vegetation, Water, Building facades, Windows, Roads, Cars, and Metal structures yields an accuracy of 90–97% for three different scenes. We assess the transferability of this model by training on one scene and testing to another with significantly different illumination conditions and/or different content. This results in a significant (∼45%) decrease in the model precision and recall as does training on all scenes at once and testing on the individual scenes. These results suggest that while CNNs are powerful tools for pixel-level classification of very high-resolution spectral data of urban environments, retraining between scenes may be necessary. Furthermore, we test the dependence of the model on several instrument- and data-specific parameters including reduced spectral resolution (down to 15 spectral channels) and number of available training instances. The results are strongly class-dependent; however, we find that the classification of natural materials is particularly robust, especially the Vegetation class with a precision and recall >94% for all scenes and model transfers and >90% with only a single training instance. Full article
(This article belongs to the Special Issue Feature Extraction and Data Classification in Hyperspectral Imaging)
Show Figures

Graphical abstract

Graphical abstract
Full article ">Figure 1
<p>(<b>top left</b>) Map of New York City showing the location of the Urban Observatory (green dot), coverage of the South-facing Scenes (red triangle) and North-facing Scene (blue triangle). (<b>bottom left</b>) A graphic representation of our hyperspectral datacube, where each spatial pixel has 848 wavelength channels that can be used to capture a full spectrum as seen in the two examples provided. (<b>right</b>) RGB (610 nm, 540 nm, and 475 nm) representations of the scenes imaged by the Urban Observatory’s hyperspectral imaging systems. Scenes 1-a and 1-b (<b>top</b> and <b>middle</b>) are south-facing pointings covering Downtown and North Brooklyn at 2 pm and 6 pm respectively. Scene 2 (<b>bottom</b>) is a north-facing pointing acquired with a different instrument covering northern Brooklyn and Manhattan. The scenes include a broad range of material compositions including Vegetation, Water, Concrete, Glass, Sky, Clouds, etc.</p>
Full article ">Figure 2
<p>A false-color image of Scene 1-a from <a href="#remotesensing-12-02540-f001" class="html-fig">Figure 1</a> for which the color mappings of the RGB channels are chosen to differentiate several natural and human-built materials in the scene. In particular, the blue channel is mapped to ∼740 nm, a wavelength for which vegetation has a strong peak, and the vegetation pixels appear predominantly blue in this false-color representation. Built structures with peaks at other wavelengths that are mapped to the red and green channels are differentiated by the red and yellow tints of buildings in the scene.</p>
Full article ">Figure 3
<p>(<b>left</b>) Locations of manually classified pixels superimposed over a grayscale image of the scenes in <a href="#remotesensing-12-02540-f001" class="html-fig">Figure 1</a>. (<b>center</b>) The color code and number of pixels for each class. Manually labeled pixels’ locations were chosen to be broadly representative of the spatial distribution of each class in the scene. (<b>right</b>) Mean of the standardized pixel spectra for each class of manually classified pixels in each scene (with a constant offset added to each spectrum for visual clarity).</p>
Full article ">Figure 4
<p>Architecture of CNN Model 1 (<b>top</b>) and Model 2 (<b>bottom</b>) used for the classification and segmentation of the hyperspectral data shown in <a href="#remotesensing-12-02540-f001" class="html-fig">Figure 1</a> and trained on the hand-labeled examples shown in <a href="#remotesensing-12-02540-f003" class="html-fig">Figure 3</a>. The two models differ only in the size and number of convolutional filters.</p>
Full article ">Figure 5
<p>Accuracy of training and testing instances (<b>left</b>), and weighted average precision, recall, and <math display="inline"><semantics> <mrow> <mi>F</mi> <mn>1</mn> </mrow> </semantics></math> Measure (<b>right</b>) of testing instances from the CNN model trained and tested on Scene 1-a with filters of different kernel sizes in the two convolutional layers. Maximum performance is obtained from kernel size 50, indicating that in general, it is the spectral features of size ∼35 nm that optimally discriminate between classes in urban scenes.</p>
Full article ">Figure 6
<p>Image segmentation results for Model 1 and Model 2 (see <a href="#remotesensing-12-02540-f004" class="html-fig">Figure 4</a>) for each of the three scenes in <a href="#remotesensing-12-02540-f001" class="html-fig">Figure 1</a>. For each model and each scene, the training/testing set consisted of the manually labeled pixels in <a href="#remotesensing-12-02540-f003" class="html-fig">Figure 3</a> with Model 2 generating a characteristically larger mean <math display="inline"><semantics> <mrow> <mi>F</mi> <mn>1</mn> </mrow> </semantics></math> score for a given scene. In addition, qualitative inspection of unlabeled pixels shows that Model 2 outperforms Model 1 on the segmentation task for each scene. These results are quantified in <a href="#remotesensing-12-02540-t001" class="html-table">Table 1</a>.</p>
Full article ">Figure 7
<p>Image segmentation results from CNN Model 2 for each of the three scenes in <a href="#remotesensing-12-02540-f001" class="html-fig">Figure 1</a>, both with (<b>left</b>) and without (<b>right</b>) spatial information included and used to classify pixels in each scene. Model 2 with spatial information included marginally outperformed the variant without when segmenting each scene. The per class results are quantified in <a href="#remotesensing-12-02540-t002" class="html-table">Table 2</a>.</p>
Full article ">Figure 8
<p>Image segmentation result from Model 2 trained and tested on Scene 1-a (<span class="html-italic">top left</span>), Scene 1-b (<span class="html-italic">middle left</span>), and Scene 2 (<span class="html-italic">bottom left</span>), compared to training Model 2 on all scenes at once, and testing on each individually (<span class="html-italic">right</span>). The model architectures and hyperparameters are shown in <a href="#remotesensing-12-02540-f004" class="html-fig">Figure 4</a>. Qualitative inspection shows the models trained on each image separately marginally outperformed the single model trained on all images at once. These results are quantified in <a href="#remotesensing-12-02540-t003" class="html-table">Table 3</a> on a by-class basis.</p>
Full article ">Figure 9
<p>Image segmentation result from Models 1 (<b>left</b>) and 2 (<b>right</b>) trained on Scene 1-a (<b>top</b>) and applied to Scene 1-b (<b>middle</b>) and Scene 2 (<b>bottom</b>). The model architectures and hyperparameters are shown in <a href="#remotesensing-12-02540-f004" class="html-fig">Figure 4</a> (although spatial information is not used when transferring models between scenes). Qualitative inspection shows that Models 1 and 2 performed comparably to one another when transferred, though transferring either model results in a significant reduction in performance relative to training on each image separately. However, Vegetation is the only class of pixels that was correctly and accurately classified by both models in all transfer testing scenarios. These results are quantified in <a href="#remotesensing-12-02540-t004" class="html-table">Table 4</a>.</p>
Full article ">Figure 10
<p>Accuracy and the weighted average of the <math display="inline"><semantics> <mrow> <mi>F</mi> <mn>1</mn> </mrow> </semantics></math> measure, precision, and recall (<b>left</b> to <b>right</b> respectively) of CNN Model 2 trained on labeled data for each scene with varying spectral resolutions. The performance shows a decreasing trend with decreasing spectral resolution in all three scenes, with the model trained and tested on Scene 1-a outperforming Scenes 1-b and 2 at all spectral resolutions.</p>
Full article ">Figure 11
<p>Image segmentation results from Model 2 trained and tested on Scene 1-a (<b>middle</b>), Scene 1-b (<span class="html-italic">center</span>), and Scene 2 (<b>bottom</b>), with the HSIs at full spectral resolution of 848 channels (<b>left</b>) as opposed to those at minimum spectral resolution of 15 channels (<b>right</b>). Qualitative inspection shows that the model with full spectral resolution outperformed that with reduced spectral resolution in all three scenes. The difference in performance is more pronounced in Scenes 1-b and 2 than in Scene 1-a. Identification of Vegetation pixels stands out as the only class in all images to be unaffected by the decrease in spectral resolution.</p>
Full article ">Figure 12
<p>Per class precision (blue), recall (yellow), and <math display="inline"><semantics> <mrow> <mi>F</mi> <mn>1</mn> </mrow> </semantics></math> Measure (red) of CNN Model 2 trained and tested on the hand-labeled spectra from Scene 1-a with different pixel spectral resolutions. In general, reducing the spectral resolution results in decreased performance metrics for human-built classes, while all metrics for the classification of natural material classes are relatively unaffected by the change in spectral resolution, highlighting the uniqueness of their spectral features even at low resolution.</p>
Full article ">Figure 13
<p>Overall training and testing accuracy (<b>left</b>) and weighted mean precision, recall, and <math display="inline"><semantics> <mrow> <mi>F</mi> <mn>1</mn> </mrow> </semantics></math> score (B) of CNN Model 2 trained on a variable number of hand-labeled examples from Scene 1-a and tested on the testing set of the same scene. As the training set decreases in size, the overall accuracy, mean precision, and mean recall decrease by as much as ∼40%.</p>
Full article ">Figure 14
<p>The same as <a href="#remotesensing-12-02540-f013" class="html-fig">Figure 13</a> but for per class performance. Clouds and Vegetation pixels maintain a high precision and recall regardless of the number of training instances used, while the remaining classes show reduced precision, recall, or both as the number of training pixels is reduced.</p>
Full article ">Figure 15
<p>The spectra of each class of pixels in Scene 1-a as classified by Model 2, together with the mean spectra of the training and testing instances used in creating the model. Both Sky and Clouds pixels contain significant saturation for the sample from 450-650 nm while Vegetation pixels show peaks at ∼550 and ∼725 nm consistent with enhanced chlorophyll reflectivity at those wavelengths. The remaining classes show very subtle differences in spectral shape.</p>
Full article ">Figure A1
<p>Confusion matrices used to derive the metrics in <a href="#remotesensing-12-02540-t001" class="html-table">Table 1</a> for the classification of hand-labeled pixel spectra using Model 1 (<b>left</b>) and Model 2 (<b>right</b>) trained and tested on each scene separately.</p>
Full article ">Figure A2
<p>Confusion matrices used to derive the metrics in <a href="#remotesensing-12-02540-t002" class="html-table">Table 2</a> for the classification of hand-labeled pixel spectra using Model 2 with (<b>left</b>) and without (<b>right</b>) the spatial features included for each scene separately.</p>
Full article ">Figure A3
<p>Confusion matrices used to derive the metrics in <a href="#remotesensing-12-02540-t003" class="html-table">Table 3</a> for the classification of hand-labeled pixel spectra using Model 2 trained and tested on each scene separately (<b>left</b>) as opposed to a single model trained on all scenes simultaneously (<b>right</b>).</p>
Full article ">Figure A4
<p>Confusion matrices used to derive the metrics in <a href="#remotesensing-12-02540-t004" class="html-table">Table 4</a> for the classification of hand-labeled pixel spectra using Model 1 (<b>left</b>) and Model 2 (<b>right</b>) trained on Scene 1-a, and transferred to Scenes 1-b and 2.</p>
Full article ">Figure A5
<p>(<b>left</b>) Segmentation maps of Scene 1-a (<b>top</b>), Scene 1-b (<b>middle</b>), and Scene 2 (<b>bottom</b>) using GBDT. <b>Center:</b> Feature importance of the GBDT model on each scene. (<b>right</b>) Confusion matrices from the testing sets of each scene.</p>
Full article ">Figure A6
<p>(<b>left</b>) Segmentation maps from applying PCA to Scene 1-a (<b>top</b>), Scene 1-b (<b>middle</b>), and Scene 2 (<b>bottom</b>), then using SVM to classify the pixels. (<b>right</b>) Confusion matrices from the testing sets of each scene.</p>
Full article ">
Back to TopTop