1. Introduction
Hyperspectral imagery (HSI) contains data in the form of a cube containing two spatial dimensions and one spectral dimension. Unlike RGB images, which only contain three bands, each pixel in HSI has rich spectral information, which usually possesses hundreds of immediately adjacent bands. Generally, the reflection spectrum of a material is called its spectral signature, which could be used to distinguish this from others. Thus, HSI is widely exploited in various fields, such as agricultural management, resource exploration, and military detection. Research on HSI usually includes dimensionality reduction [
1], target detection [
2,
3] and classification [
4], among which hyperspectral target detection (HTD) is a significant part. This can be divided into object-level and pixel-level detection. Object-level detection [
5] is the detection of a specific target object as a whole, and existing methods tend to detect this target by extracting spatial–spectral features and using RGB image object-detection algorithms. However, pixel-level detection is more mainstream. This is usually regarded as a binary classifier that can be used to determine whether the pixel belongs to the target or background [
6]. To be more precise, differing from the binary classifier, it performs a classification by assigning each pixel to one of two classes: the detector used to perform HTD finds an appropriate threshold, which corresponds to a boundary between the class of target absence and the class of target presence [
4]. Generally, detectors fall into three categories according to whether their background information and target spectral information are known. In the first category, neither is available—for example, RX [
7], which merely detects anomalies with limited performance. The target information is known in the second group, while the background information is unknown, as in CEM [
8]. As for the third category, such as OSP [
9], both the target and background information are known.
In general, the spectral signature can directly reflect the type of substance; there seems to be no need to learn and classify them using a network. However, when the target is submerged and the situation is not the same, the signature of the to-be-detected target will change due to interference from the water background. Compared with land-based HTD, less research has been conducted on underwater target detection (UTD) from HSI. There are challenges in the direct application of land-based methods to UTD, mainly because the observed target spectra, i.e., water-leaving radiance, are highly dependent on the inherent optical properties (IOPs) of water and the depth of the target, making it difficult to distinguish the target from the background. In response, research [
6,
10,
11] may provide partial solutions. According to the implementation scene, these solutions fall into one of two categories.
In the first category, which we shall call the underwater scene, a spectral imager is attached to an autonomous underwater vehicle, which directly dives underwater to acquire HSI. A rapid target detection technique employing band selection (BS) is proposed [
10]. This first uses a constrained target optimal refraction factor band selection (CTOIFBS) method to choose bands with high levels of information and low band correlation. Then, an underwater spectral imaging system incorporating the optimal subset of wavebands is built for underwater target picture capture. To detect the desired underwater targets, the CEM algorithm is finally employed. This overcomes the problems of a complex background and poor imaging environment. However, the expenses associated with underwater vehicle photography are considerable, the range of movement is limited, and BS loses a lot of spectral information while increasing its speed.
In the second category, denoted a satellite remote sensing (SRS) scene, the pipeline generally starts with IOPs estimation due to the water body’s influence on the off-water irradiance. IOPs are calculated using mathematical optimization methods (such as SOMA [
12]), then a target space is constructed based on a bathymetric model to solve the issue of missing depth information, and finally, detection is carried out using a manifold learning method [
11]. However, manifold learning has the drawback of being time-consuming. Deep learning has also been applied to address this issue. A network named IOPE-Net [
13], in the form of an unsupervised autoencoder, was proposed to estimate IOPs and achieved a good performance. In light of this, a self-improving framework [
6] is proposed, which is based on deep learning and EM theory. Firstly, anomalies are detected by a joint anomaly detector with a high threshold to obtain a guide training set, which is used to train a depth estimation network. Then, shallower pixels are screened as targets and used to train a detection network. After that, the guide training set is updated based on the detection results. Two networks are alternately iterated to boost the final detection performance. Although this framework shows a large improvement in detection performance compared to several land-based methods, it requires different iterations; thus, the issue of it being time-consuming remains, and the initial guided training set often leads to few-shot problems in practice due to its small size.
To facilitate the experiment, this paper focuses on unmanned aerial vehicle remote sensing (UAVRS) scenarios, which have the advantages of a low cost, broad applicability, high spatial resolution, and good real-time performance. In fact, in both UAVRS and SRS scenarios, underwater targets are detected from the air, so the proposed method can also be extended to SRS scenarios.
Based on the above scenarios, we propose a new framework for underwater target detection, which does not require repeated iterations or time-consuming mathematical optimization operations. We can quickly perform detection after training a model with the a priori land information of the target and the HSI of the specific water area. Deep learning methods are used throughout the pipeline. Since there is no publicly accessible HSI dataset containing underwater targets, and the received spectra are strongly influenced by the water body, which varies from water to water, it is difficult to generalize the model even if a dataset is available. Therefore, we used synthetic data to train the model for each specific water area. To solve the adaptation problem from the source domain to the target domain, we adopted domain randomization and domain adaptation strategies. Meanwhile, the spatial–spectrum process was used to utilize spatial information.
Generally speaking, the proposed framework consists of three main phrases: synthetic data, domain selection, and target detection. Firstly, we used a specific network to estimate the water quality parameters, combined the information of the target to plot the spectral distance variation curve with depth, and found the depth maximum where the spectral distance shows almost no change. Then, we selected certain target pixels, water background pixels, and depths to synthesize the underwater target observation spectra based on the bathymetric model and divided them into different domains according to depth intervals. Then, we used synthetic data and proper water background data to train the target detection network and domain prediction network. After that, we fine-tuned the detection network’s fully connected layers, i.e., the domain adaptive sub-network, according to domains. Finally, the trained model can be directly used for detection.
Aiming to quickly detect underwater targets, our main idea is to use synthetic data for training and directly apply them to real-world tasks. The closer the real-world data are to the domain of the synthetic data, the corresponding model of that domain will lead to better detection results.
In summary, the major contributions of our work are listed as follows:
A novel framework named TUTDF has been proposed to address the underwater detection problem based on deep learning and synthetic data theory. To the best of our knowledge, this is the first time that synthetic data have been used to train models, then directly applied to underwater target detection tasks in real environments.
We propose a synthetic data approach based on depth interval partitioning to solve the transfer problem from the synthetic data source domain to the real data target domain. We also develop domain prediction and domain adaptive networks to select the corresponding sourced domain model to yield more accurate detection results.
Due to the lack of hyperspectral underwater target datasets, we conducted extensive experiments to obtain realistic underwater target datasets. As far as we know, this is the first hyperspectral underwater target dataset with accurate depth information.
The remainder of this paper is structured as follows.
Section 2 first introduces the relevant model, followed by a detailed description of the experimental procedure and some theory, before finally displaying the proposed method in more detail. The experimental results are shown in
Section 3. Some of the components are discussed in
Section 4. In
Section 5, we summarize the entire work and draw a comprehensive conclusion.
2. Materials and Methods
In this section, a brief review of the bathymetric model of ocean hyperspectral data is given, providing a fundamental field background. The experimental implementation will be presented, which describes the data acquisition and processing in more detail. Then, some theories are introduced, and our proposed method is described in detail. Finally, the evaluation criteria, data partitioning, and other experimental details are described.
2.1. Bathymetric Model
Generally, the observed spectrum for a submerged target received by a sensor can be described as a specific equation, which is called the bathymetric model [
14]. This can be formulated as Equation (
1). The process described by the formula is illustrated in
Figure 1. Sunlight entered the water body and was reflected by the water column or the underwater targets. The reflected rays were transmitted from the water column and received by the sensor. In addition to atmospheric attenuation, there is a large amount of light attenuation in the water, with increasing depth. The water surface may suffer from weak polarization when reflecting sunlight. For convenience, as other researchers have, we ignored the effects of the water–air interface, which can be represented by the empirical relation of Lee et al. [
15]. Therefore, the target spectrum received by the sensor is affected by the following: the upper surface of the target, water quality, depth, and sunlight intensity and angle.
where
is the wavelength,
is the off-water reflectance,
is the reflectance of the “optical deep water” [
16],
is the reflectance of the target to be detected as the priori information,
is the downward attenuation coefficient of the light in the water,
and
are the upward attenuation coefficients of the water column and target, respectively, and
H is the depth information of the underwater target.
Further,
,
,
and
are determined by the IOPs of water: absorption coefficient
, backscattering coefficient
. In addition, in remote sensing,
is also related to the solar zenith angle
. The commonly referred-to water properties’ estimation is the estimate
,
from the hyperspectral image of the water surface. In this paper, we used an unsupervised water properties estimation network IOPE-Net in the form of an autoencoder to estimate
,
. The relationship between the above parameters is as follows:
2.2. Experimental Implementation
Due to the novelty of hyperspectral underwater target detection, there is no publicly available dataset containing underwater targets; therefore, much of the existing work is based on simulated data, which is not a good way to check the effectiveness of the method in real environments, and applications in real environments often have more issues to consider. Therefore, we conducted extensive and effective experiments to obtain a real underwater target dataset and selected some data for validation.
In this section, we conducted an outdoor UAV experiment and an indoor pool experiment. Some preliminary conclusions were obtained from the UAV experiment. After a simplification of some influencing parameters, the indoor pool experiment was used to obtain a dataset with accurate depths.
2.2.1. Outdoor UAV Experiment
The UAV experiment was conducted in the outdoor swimming pool of Northwestern Polytechnical University (34.01°N, 108.45°E, 11:00–12:00, 17 January 2023). We first shot the underwater targets with UAV to form a basis for subsequent indoor experiments with precise depth. A brief description of the outdoor UAV experiment process and preliminary conclusions is as follows.
(1) Equipment and materials: The Gaia Sky-mini3 unmanned airborne hyperspectrometer is mainly designed for outdoor or long-range imaging tests of larger objects and has some applications that require portable operation. Gaia Sky-mini3 has a spectral range of 400–1000 nm, a spectral resolution of 5.5 nm and a spatial resolution of 1024 × 448. Then, we used DJI M600 PRO UAV, which has a vertical hovering accuracy of 0.5 m, to carry the spectrometer for the pool experiment, as shown in
Figure 2a.
As exhibited in
Figure 2b, we selected iron, rubber, and stone as underwater target materials of interest. Iron is the most common material for underwater vehicles, rubber is an anechoic material that is commonly used by underwater submarines to avoid sonar detection, and stone is a simulation of some underwater reefs. Detailed information on the targets is presented later in the indoor pool experiment 5.
(2) Data Acquisition: The UAV hovering vertical water surface shooting. As shown in
Figure 3b, when tilting a certain angle to shoot, especially when the angle receives a direct reflection of light from the water surface, it will be beyond the water surface polarization phenomenon, affecting the quality of photography. As shown in
Figure 3c, when looking at the pseudo-color image, a slight spatial distortion is clear, which is tolerable and does not affect its spectral profile. The detailed steps are as follows.
step1: The reference board, which is usually called ‘white board’, was taken on the ground for reflectance correction and to fix the exposure time, as shown in
Figure 3a. The shooting time, latitude and longitude were recorded for solar zenith angle calculation. The ‘gray cloth’ was laid flat on the ground. The lens cap was closed to obtain the ‘black board’ data.
step2: The UAV flew to 20, 50, and 100 m altitudes to acquire images of the targets on the land. The ‘gray cloth’ was also taken at each altitude for atmospheric correction.
step3: Targets were placed at 0.89 m in the pool. Another shot was taken of the ‘white board’ on the ground and the exposure time was fixed again. The UAV flew to 20, 50, and 100 m altitudes to capture target images underwater. The ‘gray cloth’ was taken at each altitude.
step4: The data taken by the spectrometer were copied to the computer for processing. Using the software Spectral VIEW, atmospheric calibration was performed with “gray cloth” data, and automatic reflectance calibration was performed by selecting “whiteboard”, “blackboard” and raw data.
(3) Preliminary conclusion: Through an analysis of the obtained curves, we obtained the following preliminary conclusions and experiences. The spectral curve of the underwater target with a flat surface and uniform material does not significantly change with the change in shooting height; however, for the target with an uneven surface, with the increase in shooting height, pixel-mixing occurred and the curve of a point in the center of the surface may significantly change. An outdoor UAV needs to consider atmospheric attenuation, shooting angle, solar zenith angle, and distortion problems. Atmospheric attenuation occurred after shooting a gray cloth for atmospheric correction. The shooting angle was vertical water hovering shooting, avoiding the direct reflection of sunlight on the water’s surface. The solar zenith angle can be calculated based on latitude and longitude, as well as time.
Due to the limited depth of the pool, the water background is not ‘optically deep’, and the water quality cannot be effectively estimated. Moreover, it is difficult to obtain data for targets at different accurate depths. Therefore, we conducted the following indoor pool experiments.
2.2.2. Indoor Pool Experiment
In the anechoic pool laboratory of the College of Navigation, Northwestern Polytechnical University, there is a hydraulic lift rod, which can achieve depth variations from 0 to 3.1 m with an accuracy of 0.01 m. In addition, the pool water is deep and the indoor light is relatively weak, so the water background is optically deep. Here, we aimed to obtain underwater target data with an accurate depth.
(1) Environment and materials: The experimental site is shown in
Figure 4. Atmospheric attenuation can be ignored due to the close shooting distance. There was no significant direct sunlight indoors, and unlike outdoors, here, the ‘white board’ and target were placed horizontally and shot within the same image. That is, the light incident on the ‘white board’ and the water surface was the same. We used the data from the ‘white board’ for area calibration in the software Spectral VIEW. Therefore, the solar zenith angle was not considered here and was set to 0. The hyperspectral camera still shot vertically over the water’s surface. Depth, water quality, and target material were considered. The spatial resolution for the pool experiment scene was 1.72 mm, so the pixel mixing problem can be ignored. The difference between the light source for the outdoor and the indoor pool was the intensity of the sunlight, which affects the transmission depth of light in the water body and the detectable depth. Similarly, the water quality will also affect these parameters. Although there are some differences in light intensity and water quality, the transmission mechanism of these two was consistent; they fit the same bathymetric model (
1) after calibration.
The experimental targets were fixed on the fabricated bracket, and the bracket near the targets was tied with black adhesive tape to reduce the interference of reflected light. As exhibited in
Figure 5, iron and rubber plates had dimensions of 1 × 0.5 m, the disc’s diameter was 0.66 m, the cylinder’s length was 0.9 m, and the diameter was 0.2 m.
(2) Data Acquisition: In
Figure 6, we used a bracket to hold the target horizontally, and the bracket was connected to a hydraulic lift rod to achieve depth variations from 0 to 3.1 m with an accuracy of 0.01 m. Gaia Field–V10 is a portable push-broom spectral imager with a spectral resolution of 2.8 nm, band range of 400–1000 nm, and spatial resolution of 658 × 696. The spectral imager was positioned 4.2 m directly above the target for vertical photography, at which point the spatial resolution was 1.72 mm. The ‘white board’ with a side length of 0.15 m was placed on the water surface next to the target and included in all measurements. The reflectance calibration was performed according to Equation (
7).
where
(digital number) is the original HSI pixel’s brightness value and recorded grayscale value of objects.
refers to the image brightness value of the object to be measured;
refers to the noise generated by the spectral imager itself, which was captured by the spectral imager with the lens cover closed;
refers to the brightness value of the reference plate (commonly known as white plate);
refers to the standard reflectance value of the ‘white board’, and a 99% reference plate was used indoors.
and
eliminated the noise in the spectral imager system. The implementation site is shown in
Figure 6.
(3) Selection of Dataset Since the sunlight in the pool lab is weaker than that outdoors, we chose the depths of the human eye’s visual limit and, without a shadow, obscured a portion of the data for testing. As shown in
Figure 7, they possessed 120 bands, from 400 to 780 nm. The HSI of the targets on the land was cropped at 10 × 10 pixels; water and the underwater targets had a shape of 100 × 100 after cropping. The pseudo-color pictures of the aforementioned data are as follows.
2.3. Related Theory
Many of the present artificial intelligence problems are attributed to insufficient data or available datasets being too small; even if it is relatively easy to obtain unlabeled data, the cost of manual labeling is often remarkably high. For the application of deep learning to hyperspectral underwater target detection tasks, especially low-altitude, fast detection tasks, there are unfortunately no corresponding publicly available datasets at present. Acquiring a large number of corresponding datasets is costly, which poses a significant challenge when training neural networks to perform detection tasks.
Synthetic data are a fundamental approach to solving the data problem [
17], i.e., generating artificial data from scratch or using advanced data manipulation techniques to produce novel and diverse training examples, and have played a significant role in the field of computer vision and medical image processing. This paper generates synthetic data by partially “cutting” and model-based “pasting” with real data samples. Generally, the aim is to make the synthetic data as realistic as possible.
Synthetic data solve the problem of a lack of training sets. However, other problems arise. We refer to the training set generated from synthetic data as the source domain and the testing set obtained from the real world as the target domain. When we use the model trained from the source domain for target domain detection, the problem of domain transfer must be considered, i.e., how to ensure that the model from the source domain achieves good results in the target domain.
Inspired by the use of multi-path CNNs for the fine-grained recognition of cars [
18] and the use of an attribute-guided feature-learning network for vehicle re-identification [
19], we decided to divide multiple-source domains based on a fine-grained depth.
A multi-source domain adaptive task has multiple source domains and one target domain. The model learns from multiple source domains and applies what it has learned to predict the target domain. In general, this task seeks to identify the most useful source domain. Inspired by the domain-aware controller [
20], we design a domain prediction network to select the source domain model that is to be applied to the target domain.
2.4. Proposed Method
A Transfer-Based Underwater Target Detection Framework (TUTDF) is a deep-learning-based framework for hyperspectral underwater target detection, dedicated to fast detection with known a priori information about the target on-land. We address the lack of underwater target datasets by following the bathymetric model to synthesize data to yield training sets. We used an end-to-end network that does not require a rigorous upfront noise and dimensionality reduction effort for the backbone detection network. To solve the source-to-target domain adaptation problem, we set multiple source domains according to the depth range in the synthetic data module and designed a domain prediction network to select the corresponding source domain model for detection based on the depth information. In order to utilize more spatial information, we performed a spatial–spectral process before detection to improve the performance. All the contents are shown in
Figure 8. TUTDF consists of two modules: a
Synthetic Dataset Module and
Underwater Target Detection Module.
The Synthetic Dataset Module is the initial part of TUTDF, which is used to create the training set for the target detection module. We set the depth range and divided different source domains according to the bathymetric model to generate a large amount of training data, which were used to sample and train the subsequent network.
The Underwater Target Detection Module is the most essential part of TUTDF, which consists of the target detection network and the domain prediction network. The following describes the overall flow of this module. The HSI to be detected is input to the target detection network after the spatial–spectral process. When training the synthetic datasets, the domain prediction network can ascertain the depth of each input pixel. According to the average depth value of some of the shallowest pixels, the corresponding domain adaptive sub-network is selected and combined with the feature extraction sub-network to form a complete target detection network, which is self-sufficient as it can use an under-fitted domain prediction network to estimate the depth information. We only need an approximate depth range to determine the corresponding source domain model.
2.5. Synthetic Dataset Module
Synthetic data should resemble real data more closely and cover all potential circumstances so that samples from the target domain are included in the source domains. Given the depth
H, we may use Equation (
8) to yield the underwater target reflectance
. Here, we adopted the simplified model [
14], where
. Assuming that the noise in each HSI we used is
, we do not need to add extra noise during the synthesizing process.
We also need to consider
H. The concept of spectral distance [
21] is also relevant. Spectral distance describes the difference between two spectral vectors, calculated by treating the spectral curve as a spectral vector. The spectral distance
d is defined in Equation (
9), which represents the difference between
and
.
As shown in
Figure 9, the spectral distance increases with depth, and the growth rate gradually decreases to zero. After reaching a particular depth,
becomes the optical deep water
, and the influence of the spectral imager resolution and noise renders the observed target spectrum indistinguishable from the water background. Therefore, the spectral distance curves of different materials were constructed under the specific water condition, with a maximum depth of
.
A brief argument for the validity of using synthetic data is presented as follows. Firstly, the synthetic data are based on the model, which is reproducible and stable. Secondly, the spectral distance curves of the synthetic data match well with the real situation in
Figure 9. Finally, the results regarding the real data in
Section 4 will be more effective at proving the validity of the synthetic data.
As there are still differences between the real and synthetic spectra, they cannot be completely identical at all depths. As a result, it is expected that the distribution in the synthetic domain encompasses the real distribution. To this end, the depth interval
was divided into
N overlapping sub-intervals, with the length of the interval being set to gradually increase from shallow to deep, as shown in
Figure 10.
Each depth interval has
M depths taken at equal intervals, and the data synthesized in different intervals belong to different domains
.
Figure 11 illustrates the whole process of synthesizing data, which is performed in the following steps.
step1: Train a water quality estimation model unsupervised with IOPE-Net from HSI of water areas.
step2: Select P pixels from target donates and Q pixels from water donates . Calculate the water quality parameters .
step3: Use and to plot the spectral distance, combine the resolution conditions, set the maximum depth , divide into N overlapping sub-intervals, and select M depths at each sub-interval.
step4: According to Equation (
8),
underwater target samples are generated at each interval, and classified in domain
.
step5: The pixels in domain and an equal number of optically deep water pixels are used as the training set for the domain adaptive sub-network. The data of N domains are used as the training set for the domain prediction network.
Figure 12 shows the results of partial synthesis data. In this case, we separately selected one pixel from iron and water. The reflectance decreases with increasing depth until
, and there is an overlap between adjacent domains.
2.6. Target Detection Module
The module consists of two networks, the target detection network and the domain prediction network, described below.
2.6.1. Domain Predication Network
As can be seen from
Figure 13, the first half of the network is a depth estimation network. During the training phase, the network employs unsupervised training in the form of an autoencoder. The input pixel is denoted by
. Using a two-layer 1D-CNN encoder, the estimated depth
is obtained. Then, we reconstructed the pixel
using
and the bathymetric model in the decoder part. Finally, the network parameters were updated using the back-propagation of the loss function
L.
where
is the
norm of a given vector.
is the mean square error loss function, describing the euclidean distance between
x and
.
is the spectral angle loss function, which describes the similarity in shape between
x and
.
is the depth-constraint loss function, which is a very important term in the loss function that keeps the estimated depth from being too large.
and
are the corresponding weighting coefficients of the loss function
and
.
Since the depth estimate of the target is always small, we selected the smallest part of the depth value as the overall target depth. During the detection phase, the to-be-detected HSI is first fed into the depth estimation network to select the minimum
S depths with a depth value of
; then, the depth values are processed by the domain prediction operation to select the model used in the domain adaptive network. The domain prediction operation is as follows.
The value of
S was determined by considering the size of the target in the camera’s field of view. If the field of view is large and the target is small, the selected
S is larger than the number of target points, which means that the estimated target depth will be large; thus, an inappropriate domain will be selected. If
S is set too small, it may be disturbed by individual anomalies in the water background and estimate an inappropriate approximate overall target depth. Therefore, the setting of
S can be expressed as follows.
where
is the number of spatial pixels in the hyperspectral image captured by the camera,
is the area determined by the target being detected (set to 1/4 of the area of the upper surface of the target being detected, as only a part of the target is captured), and
is the area of the field of view at a fixed shooting height.
For depth intervals of different domains, this setting ensures that no certain depth will occur at the boundary of the selected field.
2.6.2. Target Detection Network
The network input is a pixel, i.e., a hyperspectral curve, which is regarded as a sequence input, and the output is the probability of the target to be detected. Two sub-networks comprise the target detection network: feature extraction sub-network and domain adaptive sub-network, as seen in
Figure 14.
The feature extraction network is a generic backbone network. Recently, a specific deep network [
22] successfully applied to long sequence data (ambulatory electrocardiograms) detection and classification. Inspired by that, we use a fully convolutional deep neural network as the backbone feature extraction network to achieve high performance without extra preprocessing (noise reduction, dimensionality reduction). The network simply inputs the original spectral sequence and generates the feature map through multilayer 1-D CNN with residual connectivity.
With a pixel
containing
B bands, for the feature extraction network, the output of the first block is:
the output of the second block is:
and the outputs of the successor blocks are:
where
,
and
,
denote the first and second convolutional layer weight and bias parameters of the
block, respectively; ∗ represents the convolution process,
h is the nonlinear activation function, and ⊕ denotes the residual connection. To effectively solve the gradient disappearance problem, the ReLU [
23] activation function is utilized; batch normalization and dropout are applied to accelerate network convergence. The residual connection is used to alleviate the problem of network degradation caused by gradient disappearance or explosion, and redundant information is removed by max pooling when the residual connection is formed. After every two residual blocks, the network channels are reduced by half. When the remaining connections do not add up to the same dimension, zero-padding is performed for the residual channel dimension. The domain adaptive network is a fully connected network. The softmax function obtains the probabilities that the curve belongs to the target and the water background. The network parameters are updated using cross-entropy loss function backpropagation.
Fine-tuning can train a model quickly, with a relatively small amount of data, and still achieve good results. A target detection network consists of a feature extraction sub-network and domain adaptive sub-network. To quickly train the models, we trained the entire network with the first domain’s dataset, then froze the feature extraction sub-network parameters and retrained the entire network with remainder datasets separately. In this way, we can obtain the corresponding domain adaptive sub-network model parameters.
When detected in the real scene, the spatial–spectral process is performed before the input network. Spatial consistency is a characteristic of feature distribution in hyperspectral images, i.e., features of the same category are concentrated in the actual environment, and the smaller the distance between features, the greater the probability that they belong to the same category.
For the pixel
with spatial location
in HSI, the region of the size
at the center of the pixel but not containing the pixel itself is called the nearest neighbor space of
, where
. Usually,
is taken as a positive odd number to indicate the size of the region so that it has a center. Additionally,
can be considered as the filter kernel size or the perceptual field size.
where
if there is a vacancy in the near-neighbor space of the pixel, such as at the corner or edge; therefore, this type of pixel can fill the vacant position in the near-neighbor space.
Figure 15 shows the nearest neighbor space when
, and the blue area is
.
According to the principle of spatial consistency, the central pixel
is replaced by the weighted sum of the pixels in the near-neighbor space; since each pixel contains spectral information, this processing combines the spatial and spectral information of the image, which can effectively reduce the phenomenon of foreign objects with the same spectrum and eliminate some anomalies. The recalculated pixel
can be expressed as:
where
is the weight of any pixel in the near-neighbor space.
is calculated as the square of the average distance between the central pixel and any other pixel in its near-neighbor space, excluding itself. The physical meaning of the size of
is that the more similar the pixel features, the greater the degree of interaction between them, and the greater the differences between the features, the smaller the degree of interaction between them.
In our work, since the water background and the targets are relatively simple, for simplicity, we set = 3 and used nearest neighbor spatial averaging to replace the center pixel.
2.7. Experimental Details
(1) Evaluation Criteria: The traditional receiver operating characteristic (ROC) curve depicts the correlation between the false alarm rate (FAR)
and target detection probability
, which can be formulated as Equations (
21) and (
22). The
represents the portion of falsely detected pixels for the entire image:
where
stands for the number of falsely detected pixels and
N refers to the total number of pixels in the image. The target detection probability
denotes the ratio between correctly detected pixels and target pixels among the whole image:
where
is the amount of correctly detected pixels and
refers to the number of target pixels in the entire image. Once the detection result maps are given, we can set a series threshold
to obtain many (
,
) to generate the ROC curve. The closer the ROC curve of (
,
) is to the upper left corner, the better the detection effect. Meanwhile, the area under the curve (AUC) of (
,
) is used for quantitative evaluation. The larger the AUC value, the better the detection effect.
However, a traditional ROC curve of (
,
) can only be used to evaluate the effectiveness of a detector, not target detectability (TD) or background suppressibility (BS). For a more comprehensive evaluation, we adopted a new 3D ROC curve [
24,
25]. Compared with the traditional ROC curve, the 3D ROC curve criterion, in the form of its three derived 2D ROC curves of (
,
), (
,
), and (
,
) can further be used to evaluate the effectiveness of a detector and its TD and BS, which can be quantified as:
where
,
and
represent the AUC value of the ROC curve of (
,
), (
,
) and (
,
), respectively. The calculated
and
are employed to quantitatively characterize the detector’s TD and BS, respectively.
(2) Experimental Settings: We used the spectral imager with SpecView software to calibrate the reflectance of the raw images and utilized ENVI 5.3 to perform HSI cropping and waveband selection. Subsequent experiments were based on an Intel(R) Core(TM) i9-12900KF CPU, GeForce RTX 3080 Ti GPU, 64 GB RAM, Windows 11 OS, and the code was based on Python 3.9 and the deep learning framework PyTorch 1.12.1. The relevant parameters of the experimental dataset are shown in
Table 1,
Table 2 and
Table 3.
5. Conclusions
This paper used a deep learning method to detect underwater targets from HSI. For a low-altitude remote sensing scene, a framework named TUTDF is proposed. This particular framework consists of two large modules: the synthetic data module and the target detection module. Due to the difficulty of obtaining hyperspectral underwater target datasets, we proposed a transfer-based approach, which trained the synthetic data-based networks, then transferred them to real data scenarios. The proposed framework first developed an unsupervised network to obtain water-quality parameters, combined with the a priori information of the target that is to be detected and set different depth intervals to synthesize underwater target datasets for multiple source domains to train the networks. Then, we designed a domain prediction network to select the source domain that is closest to the target domain based on the depth information. Finally, we can transfer the trained networks to a real scenario to directly perform the detection task.
To verify the effectiveness of the method in real scenarios, we conducted extensive and quantitative indoor and outdoor experiments and then obtained hyperspectral underwater target datasets with accurate depth labels. The detection results for real data with different materials and depths show that the framework trained using synthetic data is valid and reliable, and the transfer from synthetic data domains to real-world domains is feasible. The proposed framework has a definite advantage in terms of TD, BS, and detection speed over traditional land-based detection methods.
In addition, we carried out some work in UAV underwater data acquisition; focusing on how to optimize the data acquisition and calibration methods to achieve real-time UAV underwater target detection will be the subject of future work.