1. Introduction
Remote sensing image fusion has been a major focus in remote sensing research. Due to the physical limitations of sensors and data transmission capacity, current remote sensing satellites typically provide high spatial resolution panchromatic (PAN) images and low spatial resolution hyperspectral images (LR-HSI). However, applications such as scene classification [
1] and target identification [
2] often require images with both high spatial and spectral resolution. To meet this demand, a variety of image processing techniques have been developed.
Specifically, hyperspectral (HS) imaging sensors can simultaneously capture numerous bands within a narrow spectral range, leading to relatively low spatial resolution to ensure sufficient signal-to-noise ratio (SNR) [
3], which limits the potential applications of HSIs. In contrast, the PAN image, with a single band acquired by the PAN imaging system, typically offers much higher spatial resolution. Hyperspectral pansharpening is a method that extends traditional pansharpening techniques, which typically involve fusing a multispectral image (MSI) with a PAN image [
4]. An effective approach to generating a high-resolution hyperspectral image (HR-HSI) is to fuse a LR-HSI with a high-resolution panchromatic (HR-PAN) image of the same scene.
Current techniques for HS pansharpening can be classified into two categories: (1) traditional HS pansharpening techniques, (2) deep learning-based methods. With the development of deep learning, numerous network architectures have been employed for HS pansharpening. Traditional HS pansharpening techniques have been proposed and developed, which can roughly be categorized into three major branches: (1) the component substitution (CS) methods; (2) the multi-resolution analysis (MRA) methods; (3) the variation optimization (VO) methods.
The CS methods rely on substituting the spatial component of the HSI with the PAN image, which have been provided in most of the professional remote sensing software due to their fast and easy implementation. Intensity-hue-saturation (IHS) [
5], Gram–Schmidt (GS) [
6], Gram–Schmidt adaptive (GSA) [
7], and principal component analysis (PCA) [
8] all belong to this branch. MRA methods inject the spatial details extracted by a multiresolution decomposition of the PAN image into the resampled HSI, represented by the modulation transfer function generalized Laplacian pyramid (MTF-GLP) [
9] and the MTF-GLP with high pass modulation (MTF-GLP-HPM) [
10]. The VO methods formulate an optimization problem, assuming image smoothness and imposing image fidelity constraints. The final image is obtained by minimizing the constructed energy functional. However, traditional HS pansharpening techniques often lead to significant spectral distortions, causing the fused image to differ considerably from the ground truth (GT) of the sample data.
With the development of deep learning (DL), several prominent frameworks have emerged, including CNN, deep residual networks, and autoencoder–decoder models. These networks have been applied to HS pansharpening and have demonstrated impressive results.
The HS pansharpening process is essentially a form of super-resolution, achieved by using the simultaneously acquired PAN image to enhance the resolution of the HS image. He et al. [
11] introduced HyperPNN, a convolutional neural network designed for HS pansharpening. HyperPNN comprises two sub-networks: one for channel (substitute of spectral in this context) prediction and another for spatial–spectral inference, which leverages both channel and spatial contextual information. Subsequently, the attention mechanism was incorporated to enhance both channel and spatial fidelity [
12,
13,
14]. Zheng et al. [
12] proposed a two-step fusion method in which the original LR-HSI is first upsampled using a super-resolution technique, followed by training the model with the upsampled image. Zhuo et al. [
13] proposed a deep–shallow fusion network with detail extractor and spectral attention to extract more information from input image.
Although many CNN-based methods are lightweight and efficient, they exhibit notable limitations. These include insufficient attention to global information and inadequate modeling of multi-band dependencies, particularly between the single-band PAN and the multiple spectral bands of HSI. Consequently, these issues lead to image distortion and loss of detail after fusion. Therefore, it is essential to design a model that not only preserves a lightweight architecture but also effectively captures global information and models cross-modality dependencies, ensuring that the fused image closely aligns with the GT.
In addition, another emerging DL model, the transformer architecture, has shown a promising prospect in the computer vision community. A Transformer relies exclusively on the self-attention mechanism, which enables the adjustment of each pixel based on the long-range dependencies of input features. Chaminda Bandara et al. [
15] developed the HyperTransformer for spatial–channel fusion in low-resolution hyperspectral images, using enhanced self-attention to identify similar channel and spatial-rich features. Hu et al. [
16] proposed Fusformer, focusing on estimating residuals to improve learning efficiency by operating in a smaller mapping space while capturing global image relationships. Liu et al. [
17] proposed the InteractFormer, suitable for hyperspectral image super-resolution, capable of extracting and interacting all features.
While Transformer models can operate at the pixel level and mitigate spectral distortion and detail loss common in CNN-based fusion methods, they often stack layers sequentially without adopting a more integrated hybrid approach. Additionally, Transformer-based models face challenges in modifying the self-attention mechanism and balancing local and global information effectively. Thus, it is crucial to develop strategies that effectively balance local and global information while refining the self-attention mechanism.
In recent years, generative adversarial networks (GAN) have been employed as a fundamental framework for fusing high-resolution images with PAN images. GAN, an unsupervised learning approach, consists of a generator and a discriminator that engage in adversarial learning. In the basic GAN framework for HS and PAN image fusion, the generator combines low-resolution spectral information from HS images with high-resolution spatial details from PAN images, creating realistic and high-fidelity fused outputs. The discriminator evaluates the authenticity of these generated images by comparing them to real high-resolution images and provides feedback to the generator for continuous enhancement. This adversarial learning mechanism ensures that the generator and discriminator improve collaboratively, effectively modeling complex nonlinear relationships and capturing essential features from both sources.
Liu et al. [
18] proposed PSGAN to achieve pansharpening using the GAN framework. They employed a dual-stream CNN architecture as the generator, which was designed to produce high-quality pansharpened images using downsampled PAN and high-resolution images. Simultaneously, a fully convolutional discriminator was used to learn an adaptive loss function, enhancing the fidelity and overall quality of the pansharpened images. Ma et al. [
19] developed Pan–GAN, which uses a residual network encoder–decoder in the generator to unsupervisedly generate fused images. The model interpolates multispectral images to match the resolution of PAN images, stacking the channels as input. A conditional discriminator network handles two tasks: preserving spectral information from multispectral images and maintaining spatial details from the panchromatic image, using separate discriminators to enhance image fidelity.
Xie et al. [
20] developed HPGAN, the first model to introduce 3DGAN into hyperspectral pansharpening. They proposed a 3D channel–spatial generator network that reconstructs HR-HSI by fusing a newly constructed 3D PAN cube and LR-HSI, achieving a nonlinear mapping between different dimensions. The model also includes a PAN discriminator based on the hyperspectral imaging spectral response function to effectively distinguish simulated PAN images from real ones. Xu et al. [
21] introduced an iterative GAN-based network for pansharpening, employing spectral and spatial loss constraints. The approach generates mean difference images rather than direct fusions, using a coarse-to-fine framework and optimized discriminators to ensure high-resolution image fidelity.
GAN’s ability to dynamically adjust to the differences between HS and PAN images makes it highly beneficial for fusion tasks. By learning and preserving the nonlinear relationships between spatial and spectral features, GAN ensures the fused image retains both high spatial resolution and complete spectral integrity. Additionally, its multi-scale design captures features at different levels, improving the overall quality and consistency. The feedback loop from the discriminator pushes the generator to produce more natural and authentic fused images, demonstrating the robust potential of GAN-based approaches in preserving the spectral details of HS images and the spatial details of PAN images, as well as achieving high-quality image synthesis.
Currently, GAN-based methods are increasingly utilized in pansharpening with notable success. However, their application in HS pansharpening remains limited. The significant differences in spatial information between HS and PAN images, combined with the low signal-to-noise ratio of HS images and the resolution disparity between HS and PAN, result in considerable spatial distortion. This leads to severe distortion in the fused output, causing a substantial deviation from the ground truth. Consequently, the potential of GANs to capture and reconstruct the complex relationships between different spectral channels is not fully realized, nor can they effectively model the dependencies between the high-resolution PAN image and the multiple spectral bands of the HS image.
This work makes three contributions:
- (1)
We present a specialized GAN architecture for hyperspectral pansharpening that ensures fast and stable convergence during training, enhancing robustness and efficiency.
- (2)
We introduce a novel method that modifies input images and their transformations at the generator’s input stage, along with a new module integrating Attention and Inception components to effectively balance local and global information.
- (3)
We propose a novel discriminator designed to analyze the residual distribution between the fused images and the GT, improving the evaluation of fusion quality.
2. Related Work
Generative Adversarial Networks (GAN), introduced by Goodfellow et al. [
22], were designed for generating realistic images through unsupervised learning and have since achieved success in various computer vision tasks [
23,
24,
25]. A GAN comprises a generator and a discriminator in a competitive framework where the generator strives to create samples that mimic real data to deceive the discriminator, while the discriminator attempts to distinguish between real and synthetic samples. Through iterative training, both models adjust parameters to improve performance, eventually reaching a balanced state, or Nash equilibrium, where the discriminator can no longer easily differentiate generated samples from real ones.
From a mathematical perspective, the objective of the generative (
is to produce samples that the discriminator (
) cannot distinguish from real samples. The distribution of the generated samples (
) aims to approximate the true distribution of the training data (
).
and
engage in a minimax adversarial game, as described below:
The original GAN was developed to generate digit images. However, the generated results contained substantial noise, and the model exhibited slow convergence, especially pronounced in high-resolution images. To address these issues, researchers have explored several new directions. Energy-Based GAN (EBGAN) [
25] introduced an energy-based formulation of adversarial training, demonstrating that under a simple hinge loss, the generator’s output points follow the underlying data distribution at convergence. Building on EBGAN, Boundary Equilibrium GAN (BEGAN) [
26] proposed a convergence approximation metric to improve model convergence speed and achieve better balance between the discriminator and generator. The Laplacian pyramid GAN (LAPGAN) [
27] utilized a Laplacian pyramid to generate high-resolution images guided by low-resolution supervision, though it encountered challenges with images containing jittery objects. Additionally, to improve GAN stability during training, Deep Convolutional GAN (DCGAN) [
28] successfully applied deep convolutional neural networks to GANs and proposed guidelines for designing CNN architectures for generators and discriminators, leading to more stable training.
3. Methodology
This section begins by introducing the framework of HyperGAN, outlining the architectures of both the generator and discriminator. It also explains the loss functions in the generator and discriminator.
3.1. Overview of the Framework
The hyperspectral pansharpening problem is addressed as a task-oriented challenge to retain the spectral information of the LR-HSI and the spatial details of the PAN image. This is accomplished using a generative adversarial strategy to find the optimal solution, as shown in
Figure 1.
Initially, the low-resolution hyperspectral image (LR-HSI) is upsampled using bicubic interpolation [
29], a technique that utilizes the values of surrounding pixels to estimate new pixel values, resulting in a smoother and more natural image. In our specific case, the LR-HSI has a resolution that is six times lower than that of the panchromatic (PAN) image. Therefore, we apply bicubic interpolation to upscale the LR-HSI, bringing its resolution in line with that of the PAN image for further processing. Both the PAN image and the upsampled LR-HSI are processed through an attention mechanism, and the resulting images are combined through multiplication. Simultaneously, the PAN image is replicated across multiple channels and merged with the upsampled LR-HSI and the attention-processed images. The combined image is fed into the generator, producing the fused image. The fused image and the GT are then evaluated by the discriminator to achieve a better balance of spectral and spatial details, resulting in a fused image that more closely resembles the GT.
3.2. Loss Function
3.2.1. Loss Function of Discriminator
the loss for training a
as:
Let
be the distribution of the loss
, where
are real samples. Let
be the distribution of the loss
, let
represent the set containing all couplings of
and
, and let
be their respective means. The Wasserstein distance [
30] can be expressed as:
γ represents a joint distribution (or coupling) between the two probability distributions and , and and represent random variables sampled from the marginal distributions and , respectively, under the joint distribution γ. denotes the expected distance between samples and , drawn from the joint distribution γ, and refers to the infimum, or the minimum value over all possible joint distributions in the set .
Using Jensen’s inequality, a lower bound for
can be derived as follows:
It is important to clarify that the objective is to optimize a lower bound of the Wasserstein distance between the loss distributions, rather than between the sample distributions.
Given the discriminator parameters
, each updated by minimizing the losses
, we express the problem as the GAN objective, where
are samples from
:
represents the input data samples.
3.2.2. Loss Function of Generator
A Generator loss similar to that of the discriminator is involved. The loss for training the generator, denoted as
, is introduced as follows:
Given the generator parameters
, each updated by minimizing the losses
, the problem is expressed as the GAN objective, where
and
are samples from
:
where,
are regularization constants.
represents the standard deviation of the original sample distribution. By incorporating both the discriminator’s loss and the (Structural Similarity Index Measure) into the generator’s loss function, significant advantages are gained over traditional binary logistic loss. Binary loss provides only fixed targets (0 or 1), which can lead to similar gradients across different samples, thereby reducing training efficiency. In contrast, integrating the discriminator’s loss introduces richer and more continuous gradient signals to guide the generator, enhancing gradient diversity and accelerating convergence during training.
Furthermore, the inclusion of helps preserve spatial information between the fused image and the GT by focusing on spatial information. The use of ensures that the fused images not only resemble the GT visually but also retain detailed spatial information. Consequently, the combination of the discriminator’s loss and improves the quality of the fused images, making the model more robust in capturing both global and local spatial information, while enabling more efficient learning in complex tasks.
3.3. Network Architecture of Generator
HyperGAN explores a novel generator architecture based on PanNet [
31], as illustrated in
Figure 2. The ResBlock (
Figure 3a), serving as the foundational module of the generator, primarily functions to perform identity mapping, alleviate gradient vanishing, enhance training efficiency, and improve feature learning.
The network structure is broadened through a specially designed module that enhances its ability to capture features at multiple scales. By processing different convolutional kernel sizes, such as 1 × 1, 3 × 3, and 5 × 5, in parallel, the module efficiently gathers both fine-grained local information and broader global information. This approach allows the model to capture diverse patterns while maintaining computational efficiency. Additionally, reducing the dimensionality of the feature maps can negatively impact the ability to model dependencies between channels, which led to the development of the Wide Block module (
Figure 3b), designed to address this challenge and improve feature representation.
The combined image obtained after applying attention in the input stage is fed into the generator. It first passes through a 3 × 3 convolution, followed by two ResBlock modules. Next, it goes through two designed Wide Block modules, then through another two ResBlock modules, and finally passes through a 1 × 1 convolution. The resulting output image is then added to the previously upsampled image to produce the fused image.
The Wide Block module incorporates the Inception [
32] (
Figure 3c) and Efficient Spatial and Channel Attention Module (ESCA) (
Figure 3d). The Inception module can be repeatedly stacked to form larger networks, enabling the efficient expansion of network width while improving deep learning network accuracy and preventing overfitting. The advantages of the Inception module include its ability to perform dimensionality reduction on larger matrices while aggregating visual information at different scales. This allows neurons to dynamically adjust their receptive field size and extract features from multiple scales through selective convolution with kernels of varying sizes. The ESCA module captures spatial and spectral information from images without dimensionality reduction, ensuring efficient cross-channel interactions essential for high-performance channel attention. This design adds minimal complexity while significantly enhancing model performance. Detailed construction is illustrated in
Figure 4.
In this experiment, convolutional layers with kernel sizes of 1, 3, 5, and 7 are used. For a given feature map , the following convolutions are applied: , , , and .
The results of the 1 × 1 convolutional layers are retained as skip connections within the Wide Block module. This approach significantly preserves the original information and passes more details to subsequent layers, thereby fully leveraging effective information and enhancing training efficiency. The 1 × 1 convolutional layer functions more like a gate, controlling the flow of information between layers. The primary objective is to enable neurons to dynamically adjust their receptive field sizes based on the stimulus content. This is achieved by employing gates to control the flow of information from multiple branches, each carrying information at different scales, into the neurons in the next layer. To implement this, the gates integrate information from all branches. Initially, the outputs from the various branches (
Figure 3b) are fused through element-wise summation:
represents the combined feature map obtained by summing the feature maps , and .
ESCA includes channel attention and spatial attention. The channel attention module is to compress the feature graph in the spatial dimension to produce a channel attention graph. Channel attention, for a graph alone, focuses on what is significant on this graph. Compression is performed using a global average pooling (GAP) operation. Taking
as an example, the cth element of
is computed by reducing
by the spatial dimension
×
:
represents the input feature map, denotes the GAP operation applied to channel of the feature map , where is the global pooling function, and represents the value of the input feature map at position in channel .
Subsequently, adaptive tuning is applied to selectively emphasize informative features and suppress less useful ones, even when leveraging global information. This is typically achieved using a simple Fully Connected (FC) layer, which enhances efficiency by reducing dimensionality. The Efficient Channel Attention (ECA) [
33] module for CNNs was proposed. This module avoids dimensionality reduction and efficiently facilitates cross-channel interactions. The ECA module learns channel attention by adaptively determining the kernel size
after aggregating convolutional features using GAP without reducing dimensionality, and then performs one-dimensional convolution followed by a sigmoid activation function. The construction of
is expressed as follows:
stands for leakyrelu, an activation function that allows a small, non-zero gradient for negative inputs to prevent dead neurons during training. stands for batchnorm (batch normalization), which normalizes the inputs to the next layer, stabilizing and accelerating the training process. CID denotes a one-dimensional convolutional operation, which is applied to the input .
Similarly, the spatial attention module performs channel compression and focuses on the most informative regions of the feature map, complementing the channel attention module. Taking
as an example, the [
]th element of
is computed by the channel-dimension
reduction
:
represents the feature map value at spatial location (, ) for channel . denotes the GAP operation applied to the feature map across the spatial dimensions .
Finally, the two results obtained are multiplied:
represents the output after the Wide Block. Beyond its capability for enhanced global and local information learning, Wide Block architectures (incorporating Inception and ESCA) demonstrate remarkable ability in capturing diverse feature patterns through multiple branches, thereby significantly enhancing the network’s expressiveness and feature diversity. The architecture provides multiple feature fusion pathways, effectively reducing the dependency on singular features. Moreover, the network exhibits adaptive selection of optimal feature combinations for spatial and spectral information extraction, while simultaneously addressing the vanishing or exploding gradient problem commonly encountered in deep neural networks.
3.4. Network Architecture of Discriminator
HyperGAN incorporates an energy-based discriminator [
25], designed to maintain the spectral integrity of the LR-HSI while preserving the spatial details of the PAN image. Instead of having a direct probabilistic interpretation, the discriminator is conceptualized as an energy function (or contrast function). Energy values are assigned to both the GT and fused images, aiming to achieve similar energy levels. The energy function computed by the discriminator serves as a trainable cost function for the generator, guiding it during training. The discriminator assigns low energy values to areas with high data density and higher energy values to regions with lower data density, enabling a clear distinction between the fused image and the GT. Through this energy feedback mechanism, the generator is inclined to adjust its parameters during training, progressively improving the generated image in areas with higher energy values (i.e., regions deviating from the GT), resulting in more realistic images.
As training progresses, the residual distributions of the fused image and the GT converge, indicating similar energy states. Once equilibrium is reached, the fused image becomes indistinguishable from the GT, resulting in the desired HR-HSI. The architecture of the discriminator is shown in
Figure 4.
The discriminator : is a CNN-based architecture, where with , , and represent the height, width, and number of channels, respectively. In the discriminator, the processed data tensor is mapped through fully connected layers, without introducing nonlinearity, to and from an embedding state , where denotes the dimension of the discriminator’s hidden state.
In the initial part of the discriminator, the features of the fused image and the GT are concatenated. The SPLIT operation is applied after the convolution layers, where the output generated by the network, following the concatenation, is split into two parts. These two components are then used for separate error calculations: one part is compared to the fused image, and the other part to the GT.
Different energies are assigned to both the fused image and the GT. They are then concatenated and passed through a series of processing steps: first, a 3 × 3 convolution is applied, followed by two ResBlock modules, and another 3 × 3 convolution is performed to separate the concatenated image. This series of convolutions and ResBlock modules ensures that the energy of the fused image aligns with that of the GT. Finally, the difference between the input image and the separated components is calculated to determine the deviation. Through this process, the two parts are effectively merged into one coherent output, allowing the model to align with the energy and detail of the GT, achieving results closer to the high-resolution target.
The discriminator offers distinct advantages over a binary logic network by providing varied targets through reconstruction-based outputs. Unlike binary logistic loss, which limits gradients to two targets (0 or 1) and can lead to similar gradients across samples, the discriminator’s variability enhances training efficiency within small batches. Its structure generates diverse gradient orientations, enabling the use of larger batch sizes without efficiency loss, a significant benefit given current hardware constraints.
5. Discussion
5.1. Application Scenarios Analysis of HyperGAN
To assess HyperGAN’s performance, three distinct datasets were utilized: the Pavia Center dataset, the Eastern Tianshan dataset, and the Chikusei dataset. The Pavia Center dataset includes nine categories, such as Bitumen, Water, Asphalt, and Tiles, representing an urban environment. The Eastern Tianshan dataset, located in southeast Hami, Xinjiang, China, covers various lithologies, water bodies, ravines, and snow. The Chikusei dataset, from Chikusei, Ibaraki, Japan, features 19 categories, including bare soil, parks, farmland, rice fields, forests, and plastic houses, representing agricultural and urban landscapes.
The quantitative results shown in
Table 1,
Table 2 and
Table 3 demonstrate that the model’s performance varies significantly across different scenarios. To identify potential causes, an analysis of the original input images was conducted. SNR [
44] was used to evaluate the images, as higher SNR values indicate that the useful information (signal) in the image greatly outweighs the noise. This suggests higher image quality, cleaner data, and clearer details. The results are presented in
Table 4.
Furthermore, statistical analyses of the original data were conducted through visualizations of means [
45] and standard deviations [
45], as well as scatter plots generated from PCA [
8] dimensionality reduction. The smoothness of channel means and standard deviations is critical for evaluating image quality. A generally smooth mean curve with significant drops in specific channels may indicate absorption bands or physical features in the spectrum. This does not necessarily imply poor quality, rather, it could suggest data anomalies. The standard deviation indicates the variation in pixel values; a sudden increase in standard deviation in certain channels without a reasonable explanation may signal a higher presence of noise in those channels.
The scatter plot generated from PCA dimensionality reduction primarily clusters in a relatively tight region, indicating that the spectral characteristics of most pixels vary consistently across the three principal components. This suggests that the data follow a certain pattern, and different pixels in the hyperspectral image may exhibit similar spectral features. A few scattered points at the periphery may represent outliers. If these outliers are limited in number, they are likely to be isolated anomalous pixels that do not significantly impact the overall image quality. However, a larger number of outliers could indicate the presence of noise in the image or issues within certain channels. Typically, the first few principal components after PCA can explain a substantial portion of the data variance. If the first three principal components capture most of the information, it implies that the hyperspectral data contain considerable redundancy and exhibit good image quality. Conversely, if the PCA-transformed data are widely dispersed or lack clear structure among the principal components, it may suggest that the image contains a significant amount of noise or lacks substantial useful information.
The visualization results for the Pavia Center dataset, the Eastern Tianshan dataset, and the Chikusei dataset are presented in
Figure 8,
Figure 9, and
Figure 10, respectively.
As shown in
Table 4, the Eastern Tianshan dataset has the highest signal-to-noise ratio (SNR).
Figure 9 illustrates that the scatter points are primarily concentrated in a relatively tight region, indicating that the spectral characteristics of most pixels vary consistently across the three principal components. This suggests that the data follow a certain pattern, and different pixels in the hyperspectral image may exhibit similar spectral features. The figure also reveals a few scattered points at the periphery, which may represent outliers. Since the number of these outliers is limited, they are likely isolated anomalous pixels that do not significantly affect the overall image quality.
In the results presented in
Table 2, HyperGAN achieved the highest PSNR values in the Eastern Tianshan dataset, as well as the best results for the SSIM, SAM, and ERGAS metrics. However, due to the diverse categories in the Eastern Tianshan dataset, including various lithologies and ravines, it can be challenging to achieve optimal fusion results when a single pixel contains two or more similar spectral features. Consequently, the CC between the fused image and GT is only 0.958.
In the Chikusei dataset, the SNR is 0.51 dB, however, the fused PSNR and ERGAS metrics are better than those of the Pavia Center dataset, and the CC is the highest among the three datasets. This indicates that HyperGAN is better suited for agricultural and urban areas across these scenarios. However, as shown in
Figure 10, there is significant fluctuation in the image mean, a wide range of residuals, and a higher number of outliers, which may suggest the presence of noise in the image. Additionally, the limited quality of the original images results in suboptimal SSIM and SAM performance.
In the Pavia Center dataset, the SNR is 1.83 dB. The curve in
Figure 8a is smooth, while
Figure 8b shows scattered outliers. As indicated in
Table 1 and
Figure 5, the results obtained through fusion with HyperGAN demonstrate superior performance from both visual and quantitative perspectives.
5.2. Resolution Ratio Analysis of HyperGAN
To further assess the fusion capability of the model at higher resolution ratios, we doubled the resolution ratio to 1:12 and conducted experiments on the three datasets.
The average quantitative results of the proposed and competing methods on the Pavia Center dataset, the Eastern Tianshan dataset, and the Chikusei dataset for a resolution ratio of 1:12 are shown in
Table 5,
Table 6, and
Table 7, respectively. As shown in
Table 5,
Table 6 and
Table 7, HyperGAN has the highest CC, PSNR, and SSIM values and the lowest SAM and ERGAS values.
Figure 11,
Figure 12 and
Figure 13 show the visual effects and residual images on the Pavia Center dataset, the Eastern Tianshan dataset, and the Chikusei dataset for a resolution ratio of 1:12, respectively.
The results clearly demonstrate that HyperGAN consistently outperforms both traditional and deep learning-based methods across all reference metrics. It effectively minimizes residuals, retains clearer spatial details, and experiences less information loss. In terms of both spectral and spatial information, the fusion results produced by HyperGAN are closest to the GT. Whether at a resolution ratio of 1:6 or 1:12, HyperGAN successfully preserves the spectral integrity of LR-HSI and the spatial details of PAN images, making it a highly efficient method for image fusion.
However, based on the comparison of results across the three datasets at resolution ratios of 1:6 and 1:12, it is evident that when the resolution increases to 1:12, all evaluation metrics and visual results are inferior to those at 1:6. This is primarily due to information loss and the increased difficulty of detail recovery. At the 1:12 resolution ratio, the gap between the LR-HSI and the PAN image becomes significantly larger, meaning that spatial details in the LR-HSI are sparser. The generator must reconstruct high-resolution details from much less spatial information. As the resolution disparity increases, the generator faces greater challenges in effectively extracting spatial information. The rich details present in the high-resolution PAN image become more difficult to retain or transfer to the LR-HSI, resulting in missing or blurred spatial information in the fused image. Furthermore, the larger resolution gap complicates the balance between spectral and spatial features, leading to a decline in fusion performance and, consequently, lower evaluation metrics.
5.3. Gaussian Kernel Size Analysis of HyperGAN
In developing the HyperGAN dataset, the original HS dataset is used as the GT. The HS data are blurred using Gaussian kernels and downsampled into non-overlapping cubes to form the modified dataset. The impact of applying different Gaussian kernels varies, leading to potential differences in data quality and characteristics. To investigate these effects, we processed the original data from three distinct datasets using various Gaussian kernels, followed by downsampling with a factor of 6. The downsampled data were then upsampled using bicubic interpolation. The upsampled data were compared both visually and quantitatively to the original data. For the quantitative evaluation, we utilized metrics such as PSNR, CC, SAM, SSIM, and ERGAS.
We applied Gaussian kernels of sizes 3 × 3, 6 × 6, 9 × 9, 12 × 12, 15 × 15, and 18 × 18 to process the original data, and the quantitative comparison results between the upsampled data and the GT for the Pavia Center dataset, the Eastern Tianshan dataset, and the Chikusei dataset are presented in
Table 8,
Table 9, and
Table 10, respectively. As demonstrated in these tables, the data processed with a 12 × 12 Gaussian kernel achieved the highest CC, PSNR, and SSIM values while having the lowest SAM and ERGAS values.
Figure 14,
Figure 15 and
Figure 16 depict the visual comparisons of the data processed using different Gaussian kernels with the GT for the Pavia Center, Eastern Tianshan, and Chikusei datasets, respectively.
Based on the visual results and quantitative analysis of the three datasets, it can be seen that the 12 × 12 Gaussian kernel generates smoother images closer to the GT, effectively reducing the impact of high-frequency noise. Consequently, the model can better focus on capturing critical spatial structures and details, learning essential features rather than being distracted by noise or irrelevant details.
Moreover, the 12 × 12 Gaussian kernel strikes a balance between blurring and preserving image details during downsampling, providing a clearer representation of global information compared to smaller kernels like 6 × 6 or 9 × 9. Excessive blurring leads to significant information loss, potentially resulting in uncertainty or underfitting during training, thereby hindering the model’s ability to learn genuine inter-image relationships. In contrast, the 12 × 12 Gaussian kernel provides relatively stable and consistent image quality, ensuring a more efficient learning process and enhancing the model’s feature extraction capabilities.
5.4. Ablation Experiment
To evaluate the contributions of Inception and ESCA to the HyperGAN, we conducted ablation experiments by removing Inception and/or ESCA. Apart from the network structure, all parameter settings and experimental configurations remained identical to ensure a fair comparison. The results are summarized in
Table 11. These experiments were performed on the Eastern Tianshan dataset at a resolution ratio of 1:6.
Specifically, the model without the ESCA module performs better than the model without the Inception module, indicating that widening the network to capture more global information has a greater impact on the model. Additionally, the ERGAS shows a more significant decline when the model lacks ESCA, suggesting that the Inception module has a stronger influence on reducing the spectral and spatial distortions between the fused image and the GT. In summary, the ablation experiment demonstrates that both the Inception and ESCA modules contribute to achieving satisfactory performance.
5.5. Parameters
To evaluate the computational complexity and time cost of our HyperGAN,
Table 12 provide the parameters, convergence speed (epoch), and inference time. The HyperGAN method demonstrates remarkable performance with a highly sophisticated model architecture, evident in its substantial parameter count of 3.766 M. This substantial number of parameters indicates the model’s strong ability to learn and capture complex patterns, which is critical for achieving high performance in challenging tasks. Although HyperGAN requires longer inference times (42.499 ms), this trade-off is warranted by its advanced parameter complexity, which enables richer feature representations and potentially enhanced accuracy for tasks that demand a deeper level of understanding.
While the model’s computationally intensive inference process may result in longer inference times, it provides significant performance improvements over simpler models such as HyperPNN1 and HyperPNN2. Despite having faster inference times (approximately 20 ms), these simpler models feature a less complex architecture. The increased model size and associated inference time of HyperGAN clearly highlight its potential to achieve state-of-the-art results in demanding applications, where model accuracy and robustness are prioritized over inference speed.
6. Conclusions
This paper introduces HyperGAN, a novel GAN-based network architecture specifically designed for the fusion of hyperspectral and panchromatic images. HyperGAN frames pansharpening as a task-specific problem, leveraging a generative adversarial framework to achieve optimized results. The model features an energy-based discriminator that significantly enhances the generator’s ability to capture both spatial and spectral information, ensuring that the fused images closely match the GT.
The HyperGAN architecture incorporates a Wide Block module within the generator, which combines spectral and spatial information extraction to improve image quality. This module includes Inception and ESCA components, enabling the generator to effectively learn and integrate detailed spectral and spatial features while expanding the network width. The energy-based discriminator balances the energy between the fused images and the GT, ensuring superior fusion results.
In experiments conducted on the Pavia Center, Eastern Tianshan, and Chikusei datasets with an input image resolution ratio of 1:6, both quantitative evaluation metrics and visual quality assessments consistently demonstrated that HyperGAN outperformed all other methods across all reference metrics. Additionally, ablation experiments confirmed the effectiveness of both the Inception and ESCA modules.
In the future, we will focus on improving the generator and the energy-based discriminator to enable the model to effectively fuse imagery from different scenes with similar imaging resolutions. We also plan to extend the current framework to support the fusion of multispectral and hyperspectral imagery. To address the challenge of relying on labeled data, we aim to explore unsupervised learning techniques to generate reliable pseudo-ground truth labels. This approach will improve the robustness and adaptability of the framework, making it more suitable for real-world applications where labeled data are often limited.