Open AccessArticle

HyperGAN: A Hyperspectral Image Fusion Approach Based on Generative Adversarial Networks

Jing Wang

^1,2,3,

Xu Zhu

^1,2,

Linhai Jing

^4,*,

Yunwei Tang

^2,3

Hui Li

^2,3

Zhengqing Xiao

¹ and

Haifeng Ding

^2,3

College of Mathematics and System Science, Xinjiang University, Urumqi 830017, China

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

International Research Center of Big Data for Sustainable Development Goals, Beijing 100094, China

⁴

School of Information Engineering, China University of Geosciences (Beijing), Beijing 100083, China

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(23), 4389; https://doi.org/10.3390/rs16234389

Submission received: 16 October 2024 / Revised: 13 November 2024 / Accepted: 18 November 2024 / Published: 24 November 2024

Download

Browse Figures

Versions Notes

Abstract

The objective of hyperspectral pansharpening is to fuse low-resolution hyperspectral images (LR-HSI) with corresponding panchromatic (PAN) images to generate high-resolution hyperspectral images (HR-HSI). Despite advancements in hyperspectral (HS) pansharpening using deep learning, the rich spectral details and large data volume of HS images place higher demands on models for effective spectral extraction and processing. In this paper, we present HyperGAN, a hyperspectral image fusion approach based on Generative Adversarial Networks. Unlike previous methods that deepen the network to capture spectral information, HyperGAN widens the structure with a Wide Block for multi-scale learning, effectively capturing global and local details from upsampled HSI and PAN images. While LR-HSI provides rich spectral data, PAN images offer spatial information. We introduce the Efficient Spatial and Channel Attention Module (ESCA) to integrate these features and add an energy-based discriminator to enhance model performance by learning directly from the Ground Truth (GT), improving fused image quality. We validated our method on various scenes, including the Pavia Center, Eastern Tianshan, and Chikusei. Results show that HyperGAN outperforms state-of-the-art methods in visual and quantitative evaluations.

Keywords:

generative adversarial networks; hyperspectral pansharpening; attention; energy

1. Introduction

Remote sensing image fusion has been a major focus in remote sensing research. Due to the physical limitations of sensors and data transmission capacity, current remote sensing satellites typically provide high spatial resolution panchromatic (PAN) images and low spatial resolution hyperspectral images (LR-HSI). However, applications such as scene classification [1] and target identification [2] often require images with both high spatial and spectral resolution. To meet this demand, a variety of image processing techniques have been developed.

Specifically, hyperspectral (HS) imaging sensors can simultaneously capture numerous bands within a narrow spectral range, leading to relatively low spatial resolution to ensure sufficient signal-to-noise ratio (SNR) [3], which limits the potential applications of HSIs. In contrast, the PAN image, with a single band acquired by the PAN imaging system, typically offers much higher spatial resolution. Hyperspectral pansharpening is a method that extends traditional pansharpening techniques, which typically involve fusing a multispectral image (MSI) with a PAN image [4]. An effective approach to generating a high-resolution hyperspectral image (HR-HSI) is to fuse a LR-HSI with a high-resolution panchromatic (HR-PAN) image of the same scene.

Current techniques for HS pansharpening can be classified into two categories: (1) traditional HS pansharpening techniques, (2) deep learning-based methods. With the development of deep learning, numerous network architectures have been employed for HS pansharpening. Traditional HS pansharpening techniques have been proposed and developed, which can roughly be categorized into three major branches: (1) the component substitution (CS) methods; (2) the multi-resolution analysis (MRA) methods; (3) the variation optimization (VO) methods.

The CS methods rely on substituting the spatial component of the HSI with the PAN image, which have been provided in most of the professional remote sensing software due to their fast and easy implementation. Intensity-hue-saturation (IHS) [5], Gram–Schmidt (GS) [6], Gram–Schmidt adaptive (GSA) [7], and principal component analysis (PCA) [8] all belong to this branch. MRA methods inject the spatial details extracted by a multiresolution decomposition of the PAN image into the resampled HSI, represented by the modulation transfer function generalized Laplacian pyramid (MTF-GLP) [9] and the MTF-GLP with high pass modulation (MTF-GLP-HPM) [10]. The VO methods formulate an optimization problem, assuming image smoothness and imposing image fidelity constraints. The final image is obtained by minimizing the constructed energy functional. However, traditional HS pansharpening techniques often lead to significant spectral distortions, causing the fused image to differ considerably from the ground truth (GT) of the sample data.

With the development of deep learning (DL), several prominent frameworks have emerged, including CNN, deep residual networks, and autoencoder–decoder models. These networks have been applied to HS pansharpening and have demonstrated impressive results.

The HS pansharpening process is essentially a form of super-resolution, achieved by using the simultaneously acquired PAN image to enhance the resolution of the HS image. He et al. [11] introduced HyperPNN, a convolutional neural network designed for HS pansharpening. HyperPNN comprises two sub-networks: one for channel (substitute of spectral in this context) prediction and another for spatial–spectral inference, which leverages both channel and spatial contextual information. Subsequently, the attention mechanism was incorporated to enhance both channel and spatial fidelity [12,13,14]. Zheng et al. [12] proposed a two-step fusion method in which the original LR-HSI is first upsampled using a super-resolution technique, followed by training the model with the upsampled image. Zhuo et al. [13] proposed a deep–shallow fusion network with detail extractor and spectral attention to extract more information from input image.

Although many CNN-based methods are lightweight and efficient, they exhibit notable limitations. These include insufficient attention to global information and inadequate modeling of multi-band dependencies, particularly between the single-band PAN and the multiple spectral bands of HSI. Consequently, these issues lead to image distortion and loss of detail after fusion. Therefore, it is essential to design a model that not only preserves a lightweight architecture but also effectively captures global information and models cross-modality dependencies, ensuring that the fused image closely aligns with the GT.

In addition, another emerging DL model, the transformer architecture, has shown a promising prospect in the computer vision community. A Transformer relies exclusively on the self-attention mechanism, which enables the adjustment of each pixel based on the long-range dependencies of input features. Chaminda Bandara et al. [15] developed the HyperTransformer for spatial–channel fusion in low-resolution hyperspectral images, using enhanced self-attention to identify similar channel and spatial-rich features. Hu et al. [16] proposed Fusformer, focusing on estimating residuals to improve learning efficiency by operating in a smaller mapping space while capturing global image relationships. Liu et al. [17] proposed the InteractFormer, suitable for hyperspectral image super-resolution, capable of extracting and interacting all features.

While Transformer models can operate at the pixel level and mitigate spectral distortion and detail loss common in CNN-based fusion methods, they often stack layers sequentially without adopting a more integrated hybrid approach. Additionally, Transformer-based models face challenges in modifying the self-attention mechanism and balancing local and global information effectively. Thus, it is crucial to develop strategies that effectively balance local and global information while refining the self-attention mechanism.

In recent years, generative adversarial networks (GAN) have been employed as a fundamental framework for fusing high-resolution images with PAN images. GAN, an unsupervised learning approach, consists of a generator and a discriminator that engage in adversarial learning. In the basic GAN framework for HS and PAN image fusion, the generator combines low-resolution spectral information from HS images with high-resolution spatial details from PAN images, creating realistic and high-fidelity fused outputs. The discriminator evaluates the authenticity of these generated images by comparing them to real high-resolution images and provides feedback to the generator for continuous enhancement. This adversarial learning mechanism ensures that the generator and discriminator improve collaboratively, effectively modeling complex nonlinear relationships and capturing essential features from both sources.

Liu et al. [18] proposed PSGAN to achieve pansharpening using the GAN framework. They employed a dual-stream CNN architecture as the generator, which was designed to produce high-quality pansharpened images using downsampled PAN and high-resolution images. Simultaneously, a fully convolutional discriminator was used to learn an adaptive loss function, enhancing the fidelity and overall quality of the pansharpened images. Ma et al. [19] developed Pan–GAN, which uses a residual network encoder–decoder in the generator to unsupervisedly generate fused images. The model interpolates multispectral images to match the resolution of PAN images, stacking the channels as input. A conditional discriminator network handles two tasks: preserving spectral information from multispectral images and maintaining spatial details from the panchromatic image, using separate discriminators to enhance image fidelity.

Xie et al. [20] developed HPGAN, the first model to introduce 3DGAN into hyperspectral pansharpening. They proposed a 3D channel–spatial generator network that reconstructs HR-HSI by fusing a newly constructed 3D PAN cube and LR-HSI, achieving a nonlinear mapping between different dimensions. The model also includes a PAN discriminator based on the hyperspectral imaging spectral response function to effectively distinguish simulated PAN images from real ones. Xu et al. [21] introduced an iterative GAN-based network for pansharpening, employing spectral and spatial loss constraints. The approach generates mean difference images rather than direct fusions, using a coarse-to-fine framework and optimized discriminators to ensure high-resolution image fidelity.

GAN’s ability to dynamically adjust to the differences between HS and PAN images makes it highly beneficial for fusion tasks. By learning and preserving the nonlinear relationships between spatial and spectral features, GAN ensures the fused image retains both high spatial resolution and complete spectral integrity. Additionally, its multi-scale design captures features at different levels, improving the overall quality and consistency. The feedback loop from the discriminator pushes the generator to produce more natural and authentic fused images, demonstrating the robust potential of GAN-based approaches in preserving the spectral details of HS images and the spatial details of PAN images, as well as achieving high-quality image synthesis.

Currently, GAN-based methods are increasingly utilized in pansharpening with notable success. However, their application in HS pansharpening remains limited. The significant differences in spatial information between HS and PAN images, combined with the low signal-to-noise ratio of HS images and the resolution disparity between HS and PAN, result in considerable spatial distortion. This leads to severe distortion in the fused output, causing a substantial deviation from the ground truth. Consequently, the potential of GANs to capture and reconstruct the complex relationships between different spectral channels is not fully realized, nor can they effectively model the dependencies between the high-resolution PAN image and the multiple spectral bands of the HS image.

This work makes three contributions:

(1): We present a specialized GAN architecture for hyperspectral pansharpening that ensures fast and stable convergence during training, enhancing robustness and efficiency.
(2): We introduce a novel method that modifies input images and their transformations at the generator’s input stage, along with a new module integrating Attention and Inception components to effectively balance local and global information.
(3): We propose a novel discriminator designed to analyze the residual distribution between the fused images and the GT, improving the evaluation of fusion quality.

2. Related Work

Generative Adversarial Networks (GAN), introduced by Goodfellow et al. [22], were designed for generating realistic images through unsupervised learning and have since achieved success in various computer vision tasks [23,24,25]. A GAN comprises a generator and a discriminator in a competitive framework where the generator strives to create samples that mimic real data to deceive the discriminator, while the discriminator attempts to distinguish between real and synthetic samples. Through iterative training, both models adjust parameters to improve performance, eventually reaching a balanced state, or Nash equilibrium, where the discriminator can no longer easily differentiate generated samples from real ones.

From a mathematical perspective, the objective of the generative (

G

is to produce samples that the discriminator (

D

) cannot distinguish from real samples. The distribution of the generated samples (

P_{G}

) aims to approximate the true distribution of the training data (

P_{d a t a}

G

and

D

engage in a minimax adversarial game, as described below:

\begin{matrix} \underset{G}{m i n} \underset{D}{m a x} V_{G A N} (G, D) = E_{x \sim p_{d a t a} (x)} [\log D (x)] + \\ E_{z \sim p_{z} (z)} [\log (1 - D (G (z)))] . \end{matrix}

(1)

The original GAN was developed to generate digit images. However, the generated results contained substantial noise, and the model exhibited slow convergence, especially pronounced in high-resolution images. To address these issues, researchers have explored several new directions. Energy-Based GAN (EBGAN) [25] introduced an energy-based formulation of adversarial training, demonstrating that under a simple hinge loss, the generator’s output points follow the underlying data distribution at convergence. Building on EBGAN, Boundary Equilibrium GAN (BEGAN) [26] proposed a convergence approximation metric to improve model convergence speed and achieve better balance between the discriminator and generator. The Laplacian pyramid GAN (LAPGAN) [27] utilized a Laplacian pyramid to generate high-resolution images guided by low-resolution supervision, though it encountered challenges with images containing jittery objects. Additionally, to improve GAN stability during training, Deep Convolutional GAN (DCGAN) [28] successfully applied deep convolutional neural networks to GANs and proposed guidelines for designing CNN architectures for generators and discriminators, leading to more stable training.

3. Methodology

This section begins by introducing the framework of HyperGAN, outlining the architectures of both the generator and discriminator. It also explains the loss functions in the generator and discriminator.

3.1. Overview of the Framework

The hyperspectral pansharpening problem is addressed as a task-oriented challenge to retain the spectral information of the LR-HSI and the spatial details of the PAN image. This is accomplished using a generative adversarial strategy to find the optimal solution, as shown in Figure 1.

Initially, the low-resolution hyperspectral image (LR-HSI) is upsampled using bicubic interpolation [29], a technique that utilizes the values of surrounding pixels to estimate new pixel values, resulting in a smoother and more natural image. In our specific case, the LR-HSI has a resolution that is six times lower than that of the panchromatic (PAN) image. Therefore, we apply bicubic interpolation to upscale the LR-HSI, bringing its resolution in line with that of the PAN image for further processing. Both the PAN image and the upsampled LR-HSI are processed through an attention mechanism, and the resulting images are combined through multiplication. Simultaneously, the PAN image is replicated across multiple channels and merged with the upsampled LR-HSI and the attention-processed images. The combined image is fed into the generator, producing the fused image. The fused image and the GT are then evaluated by the discriminator to achieve a better balance of spectral and spatial details, resulting in a fused image that more closely resembles the GT.

3.2. Loss Function

3.2.1. Loss Function of Discriminator

L_{1} : R^{N_{x}} \mapsto R^{+}

the loss for training a

D

as:

L_{1} (ν, u) = {|ν - D (u)|}^{η} w h e r e \{\begin{matrix} \begin{matrix} D : R^{N_{x}} \mapsto R^{N_{x}} i s t h e d i s c r i m i n a t o r f u n c t i o n \\ η \in \{1,2\} i s t h e t a r g e t n o r m \\ ν \in R^{N_{x}} i s a s a m p l e o f d i m e n s i o n N_{x} \\ u \in R^{N_{x}} i s a s a m p l e o f d i m e n s i o n N_{x} \end{matrix} \end{matrix}

(2)

Let

μ_{1}

be the distribution of the loss

L (x)

, where

x

are real samples. Let

μ_{2}

be the distribution of the loss

L (G (z))

, let

Γ (μ_{1}, μ_{2})

represent the set containing all couplings of

μ_{1}

and

μ_{2}

, and let

m_{1,2} \in R

be their respective means. The Wasserstein distance [30] can be expressed as:

W_{1} (μ_{1}, μ_{2}) = \begin{matrix} i n f \\ γ \in Γ (μ_{1}, μ_{2}) \end{matrix} E_{(x_{1}, x_{2}) ~ γ} [|x_{1} - x_{2}|]

(3)

γ represents a joint distribution (or coupling) between the two probability distributions

μ_{1}

and

μ_{2}

, and

x_{1}

and

x_{2}

represent random variables sampled from the marginal distributions

μ_{1}

and

μ_{2}

, respectively, under the joint distribution γ.

E_{(x_{1}, x_{2}) ~ γ} [|x_{1} - x_{2}|]

denotes the expected distance between samples

x_{1}

and

x_{2}

, drawn from the joint distribution γ, and

“ i n f ”

refers to the infimum, or the minimum value over all possible joint distributions in the set

Γ (μ_{1}, μ_{2})

Using Jensen’s inequality, a lower bound for

W_{1} (μ_{1}, μ_{2})

can be derived as follows:

\inf E [|x_{1} - x_{2}|] \geq \inf |E [x_{1} - x_{2}]| = |m_{1} - m_{2}|

(4)

It is important to clarify that the objective is to optimize a lower bound of the Wasserstein distance between the loss distributions, rather than between the sample distributions.

Given the discriminator parameters

θ_{D}

, each updated by minimizing the losses

L_{D}

, we express the problem as the GAN objective, where

Z_{D}

are samples from

Z

L_{D} = L_{1} ((x, x); θ_{D}) - L_{1} ((G (Z_{D}; θ_{D}), G (Z_{D}; θ_{D})); θ_{D}) + L_{1} ((x, G (Z_{D}; θ_{D})); θ_{D}) f o r θ_{D}

(5)

x

represents the input data samples.

3.2.2. Loss Function of Generator

A Generator loss similar to that of the discriminator is involved. The loss for training the generator, denoted as

L_{1} : R^{N_{x}} \mapsto R^{+}

, is introduced as follows:

L_{2} (ν, u) = {|ν - G (u)|}^{η} w h e r e \{\begin{matrix} G : R^{N_{x}} \mapsto R^{N_{x}} i s t h e g e n e r a t o r f u n c t i o n \\ η \in \{1,2\} i s t h e t a r g e t n o r m \\ ν \in R^{N_{x}} i s a s a m p l e o f d i m e n s i o n N_{x} \\ u \in R^{N_{x}} i s a s a m p l e o f d i m e n s i o n N_{x} \end{matrix}

(6)

Given the generator parameters

θ_{G}

, each updated by minimizing the losses

L_{G}

, the problem is expressed as the GAN objective, where

Z_{D}

and

Z_{G}

are samples from

Z

L_{G} = L_{2} ((Z_{G}, Z_{G}); θ_{G}) + L_{1} ((G (Z_{D}; θ_{G}), G (Z_{D}; θ_{G})); θ_{G}) + α \times L_{S S I M}

(7)

where,

α

are regularization constants.

S S I M (G (Z_{G}), Z_{G}) = \frac{(2 μ_{G (Z_{G})} μ_{Z_{G}} + C_{1}) (2 σ_{G (Z_{G})} σ_{Z_{G}} + C_{2})}{({μ_{G (Z_{G})}}^{2} + {μ_{Z_{G}}}^{2} + C_{1}) ({σ_{G (Z_{G})}}^{2} + {σ_{Z_{G}}}^{2} + C_{2})}

(8)

L_{S S I M} = 1 - S S I M (G (Z_{G}), Z_{G})

(9)

σ_{Z_{G}}

represents the standard deviation of the original sample distribution. By incorporating both the discriminator’s loss and the

L_{S S I M}

(Structural Similarity Index Measure) into the generator’s loss function, significant advantages are gained over traditional binary logistic loss. Binary loss provides only fixed targets (0 or 1), which can lead to similar gradients across different samples, thereby reducing training efficiency. In contrast, integrating the discriminator’s loss introduces richer and more continuous gradient signals to guide the generator, enhancing gradient diversity and accelerating convergence during training.

Furthermore, the inclusion of

L_{S S I M}

helps preserve spatial information between the fused image and the GT by focusing on spatial information. The use of

L_{S S I M}

ensures that the fused images not only resemble the GT visually but also retain detailed spatial information. Consequently, the combination of the discriminator’s loss and

L_{S S I M}

improves the quality of the fused images, making the model more robust in capturing both global and local spatial information, while enabling more efficient learning in complex tasks.

3.3. Network Architecture of Generator

HyperGAN explores a novel generator architecture based on PanNet [31], as illustrated in Figure 2. The ResBlock (Figure 3a), serving as the foundational module of the generator, primarily functions to perform identity mapping, alleviate gradient vanishing, enhance training efficiency, and improve feature learning.

The network structure is broadened through a specially designed module that enhances its ability to capture features at multiple scales. By processing different convolutional kernel sizes, such as 1 × 1, 3 × 3, and 5 × 5, in parallel, the module efficiently gathers both fine-grained local information and broader global information. This approach allows the model to capture diverse patterns while maintaining computational efficiency. Additionally, reducing the dimensionality of the feature maps can negatively impact the ability to model dependencies between channels, which led to the development of the Wide Block module (Figure 3b), designed to address this challenge and improve feature representation.

The combined image obtained after applying attention in the input stage is fed into the generator. It first passes through a 3 × 3 convolution, followed by two ResBlock modules. Next, it goes through two designed Wide Block modules, then through another two ResBlock modules, and finally passes through a 1 × 1 convolution. The resulting output image is then added to the previously upsampled image to produce the fused image.

The Wide Block module incorporates the Inception [32] (Figure 3c) and Efficient Spatial and Channel Attention Module (ESCA) (Figure 3d). The Inception module can be repeatedly stacked to form larger networks, enabling the efficient expansion of network width while improving deep learning network accuracy and preventing overfitting. The advantages of the Inception module include its ability to perform dimensionality reduction on larger matrices while aggregating visual information at different scales. This allows neurons to dynamically adjust their receptive field size and extract features from multiple scales through selective convolution with kernels of varying sizes. The ESCA module captures spatial and spectral information from images without dimensionality reduction, ensuring efficient cross-channel interactions essential for high-performance channel attention. This design adds minimal complexity while significantly enhancing model performance. Detailed construction is illustrated in Figure 4.

In this experiment, convolutional layers with kernel sizes of 1, 3, 5, and 7 are used. For a given feature map

x \in R^{H^{’} \times W^{’} \times C^{’}}

, the following convolutions are applied:

F_{1} : x \to U_{1} \in R^{H \times W \times C}

F_{3} : x \to U_{3} \in R^{H \times W \times C}

F_{5} : x \to U_{5} \in R^{H \times W \times C}

, and

F_{7} : x \to U_{7} \in R^{H \times W \times C}

The results of the 1 × 1 convolutional layers are retained as skip connections within the Wide Block module. This approach significantly preserves the original information and passes more details to subsequent layers, thereby fully leveraging effective information and enhancing training efficiency. The 1 × 1 convolutional layer functions more like a gate, controlling the flow of information between layers. The primary objective is to enable neurons to dynamically adjust their receptive field sizes based on the stimulus content. This is achieved by employing gates to control the flow of information from multiple branches, each carrying information at different scales, into the neurons in the next layer. To implement this, the gates integrate information from all branches. Initially, the outputs from the various branches (Figure 3b) are fused through element-wise summation:

{U = U}_{3} + U_{5} + U_{7}

(10)

U

represents the combined feature map obtained by summing the feature maps

U_{3}, U_{5}

, and

U_{7}

ESCA includes channel attention and spatial attention. The channel attention module is to compress the feature graph in the spatial dimension to produce a channel attention graph. Channel attention, for a graph alone, focuses on what is significant on this graph. Compression is performed using a global average pooling (GAP) operation. Taking

s \in R^{c}

as an example, the cth element of

S

is computed by reducing

U

by the spatial dimension

H

W

S_{c} = F_{g p} (U_{c}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} U_{c} (i, j)

(11)

U

represents the input feature map,

F_{g p} (U_{c})

denotes the GAP operation applied to channel

c

of the feature map

U

, where

F_{g p}

is the global pooling function, and

U_{c} (i, j)

represents the value of the input feature map

U

at position

(i, j)

in channel

c

Subsequently, adaptive tuning is applied to selectively emphasize informative features and suppress less useful ones, even when leveraging global information. This is typically achieved using a simple Fully Connected (FC) layer, which enhances efficiency by reducing dimensionality. The Efficient Channel Attention (ECA) [33] module for CNNs was proposed. This module avoids dimensionality reduction and efficiently facilitates cross-channel interactions. The ECA module learns channel attention by adaptively determining the kernel size

k

after aggregating convolutional features using GAP without reducing dimensionality, and then performs one-dimensional convolution followed by a sigmoid activation function. The construction of

z \in R^{d \times 1}

is expressed as follows:

z = F_{f c} (s) = δ (B (C I D (s)))

(12)

δ

stands for leakyrelu, an activation function that allows a small, non-zero gradient for negative inputs to prevent dead neurons during training.

B

stands for batchnorm (batch normalization), which normalizes the inputs to the next layer, stabilizing and accelerating the training process. CID denotes a one-dimensional convolutional operation, which is applied to the input

s

Similarly, the spatial attention module performs channel compression and focuses on the most informative regions of the feature map, complementing the channel attention module. Taking

v \in R^{H \times W}

as an example, the [

i, j

]th element of

v

is computed by the channel-dimension

C

reduction

U

v = F_{g p} (U_{H \times W}) = \frac{1}{C} \sum_{l = 1}^{C} U_{H \times W} (l)

(13)

U_{H \times W} (l)

represents the feature map value at spatial location (

H

W

) for channel

l

F_{g p} (U_{H \times W})

denotes the GAP operation applied to the feature map

U

across the spatial dimensions

H \times W

Finally, the two results obtained are multiplied:

A = v \times z + U_{1}

(14)

A

represents the output after the Wide Block. Beyond its capability for enhanced global and local information learning, Wide Block architectures (incorporating Inception and ESCA) demonstrate remarkable ability in capturing diverse feature patterns through multiple branches, thereby significantly enhancing the network’s expressiveness and feature diversity. The architecture provides multiple feature fusion pathways, effectively reducing the dependency on singular features. Moreover, the network exhibits adaptive selection of optimal feature combinations for spatial and spectral information extraction, while simultaneously addressing the vanishing or exploding gradient problem commonly encountered in deep neural networks.

3.4. Network Architecture of Discriminator

HyperGAN incorporates an energy-based discriminator [25], designed to maintain the spectral integrity of the LR-HSI while preserving the spatial details of the PAN image. Instead of having a direct probabilistic interpretation, the discriminator is conceptualized as an energy function (or contrast function). Energy values are assigned to both the GT and fused images, aiming to achieve similar energy levels. The energy function computed by the discriminator serves as a trainable cost function for the generator, guiding it during training. The discriminator assigns low energy values to areas with high data density and higher energy values to regions with lower data density, enabling a clear distinction between the fused image and the GT. Through this energy feedback mechanism, the generator is inclined to adjust its parameters during training, progressively improving the generated image in areas with higher energy values (i.e., regions deviating from the GT), resulting in more realistic images.

As training progresses, the residual distributions of the fused image and the GT converge, indicating similar energy states. Once equilibrium is reached, the fused image becomes indistinguishable from the GT, resulting in the desired HR-HSI. The architecture of the discriminator is shown in Figure 4.

The discriminator

D

R^{N_{x}} \to R^{N_{x}}

is a CNN-based architecture, where

N_{x} = H \times W \times C

with

H

W

, and

C

represent the height, width, and number of channels, respectively. In the discriminator, the processed data tensor is mapped through fully connected layers, without introducing nonlinearity, to and from an embedding state

R^{N_{h}}

, where

N_{h}

denotes the dimension of the discriminator’s hidden state.

In the initial part of the discriminator, the features of the fused image and the GT are concatenated. The SPLIT operation is applied after the convolution layers, where the output generated by the network, following the concatenation, is split into two parts. These two components are then used for separate error calculations: one part is compared to the fused image, and the other part to the GT.

Different energies are assigned to both the fused image and the GT. They are then concatenated and passed through a series of processing steps: first, a 3 × 3 convolution is applied, followed by two ResBlock modules, and another 3 × 3 convolution is performed to separate the concatenated image. This series of convolutions and ResBlock modules ensures that the energy of the fused image aligns with that of the GT. Finally, the difference between the input image and the separated components is calculated to determine the deviation. Through this process, the two parts are effectively merged into one coherent output, allowing the model to align with the energy and detail of the GT, achieving results closer to the high-resolution target.

The discriminator offers distinct advantages over a binary logic network by providing varied targets through reconstruction-based outputs. Unlike binary logistic loss, which limits gradients to two targets (0 or 1) and can lead to similar gradients across samples, the discriminator’s variability enhances training efficiency within small batches. Its structure generates diverse gradient orientations, enabling the use of larger batch sizes without efficiency loss, a significant benefit given current hardware constraints.

4. Experiments and Analysis

The proposed method was implemented using TensorFlow and trained on a workstation equipped with an NVIDIA GeForce RTX 3090 GPU with 24 GB of RAM. The batch size was set to four images, and the training comprised 1500 epochs. Both the generator and discriminator initially had learning rates of 0.0001, with a decay rate of 0.9 and a decay step size of 300. An Adam Optimizer was used for optimization. The value of α in the generator loss was set to 0.1. To accelerate training and ensure model stability, the generator and discriminator were trained concurrently.

4.1. Datasets and Assessment Metrics

4.1.1. The Pavia Center Dataset

The original HSI was acquired by the ROSIS [34] sensor and comprised 115 spectral bands ranging from 430 to 960 nm. The spatial dimensions of the image were 1096 × 1096 pixels, with each pixel having a geometric resolution of 1.3 m × 1.3 m. Thirteen noisy spectral bands were discarded, resulting in an HSI with 102 bands spanning 430 to 860 nm. This region was divided into 45 non-overlapping cubic patches of 120 × 120 × 102 pixels each, which constituted the GT for the Pavia Center dataset. To generate the corresponding PAN images and LR-HSIs for each high-resolution HSI, Wald’s protocol [35] was followed. PAN images of size 120 × 120 were generated by averaging the first 61 spectral bands of the Ground Truth. LR-HSIs of size 20 × 20 × 102 were created by spatially blurring the Ground Truth with a 12 × 12 Gaussian filter and then downsampling the result. For the Pavia Center dataset, the downsampling factor was set to 6. For training, 36 cubic patches were randomly selected, and the remaining nine patches were used as the test set. The training process took 1.5 h.

4.1.2. The Eastern Tianshan Dataset

The original HSI and PAN images were from the PRISMA satellite [36,37], managed by the Italian Space Agency and designed to monitor environmental elements on Earth. It has a hyperspectral imager with 234 bands covering visible, near-infrared, and short-wave infrared regions, with a spectral resolution of up to 12 nm. The spatial resolution is 30 m for the HSI and 5 m for the PAN image. The dataset includes images from the Eastern Tien Shan region, where 31 low signal-to-noise bands were discarded, leaving 203 bands for fusion experiments. Since no GT was available, the HSI from the PRISMA satellite served as the GT. Downsampled HS and PAN images were generated as input, with the HS images being spatially blurred using a 12 × 12 window and then downsampled by a factor of 6. The central region was cropped and divided into 36 non-overlapping patches of 120 × 120 × 203 pixels, with 28 patches randomly selected for training and the rest for testing. The training process took 1.77 h.

4.1.3. The Chikusei Dataset

The original HSI is from the Chikusei dataset [38], consisting of hyperspectral images captured in Chikusei, Ibaraki, Japan, by the Hyperspec-VNIR-CIRIS spectrometer. These images have a ground sampling distance of 2.5 m and 128 spectral bands ranging from 363 nm to 1018 nm. The original Chikusei HSI has spatial dimensions of 2517 × 2335 pixels and a geometric resolution of 2.5 × 2.5 m per pixel. For this study, the top-left corner of the HSI, sized 2400 × 2160 × 128 pixels, was used and divided into 90 non-overlapping patches of 240 × 240 × 128 pixels, which serve as the GT for the dataset. Following Wald’s protocol, PAN images of 240 × 240 pixels were generated by averaging the first 65 spectral bands of the GT. LR-HSIs were created by spatially blurring the GT using a 12 × 12 window and downsampling. For the Chikusei dataset, the downsampling factor was set to 6. Out of the 90 patches, 72 were randomly selected for training, while the remaining patches were used for testing. The training process took 12.61 h.

4.1.4. Metrics

It is challenging to perform a comprehensive evaluation of different methods based solely on subjective judgment. Therefore, five widely used quantitative metrics were introduced to assess fusion performance: Error Relative Global Dimensionless Synthesis (ERGAS) [39], Cross-Correlation (CC) [39], Spectral Angle Mapping (SAM) [39], Structural Similarity (SSIM) [40], and Peak Signal-to-Noise Ratio (PSNR) [39].

4.2. Comparison Between Different Methods

This section demonstrates the effectiveness of the proposed method using the three datasets previously mentioned and compares it with both traditional methods and deep learning-based methods, such as PCA [8], SFIM [41], GSA [7], MTF_GLP [9], Wavelet [42], PNN [43], PanNet [31], HyperPNN1 [11], HyperPNN2 [11], and PSGAN [18]. The results are analyzed, and the findings are presented in both tables and visualizations.

4.2.1. Results on the Pavia Center Dataset

When the input image ratio of HS to PAN is 1:6, the average quantitative results of the proposed method and the competing methods are shown in Table 1. As shown in the table, HyperGAN has the highest CC, PSNR, and SSIM values, and the lowest SAM and ERGAS values, indicating that HyperGAN has better spectral preservation and spatial enhancement.

Additionally, Figure 5 displays the visual results and corresponding residual images for resolution ratios of 1:6. The top row shows the visual results, while the bottom row presents the residual images. This visualization effectively highlights the visual comparison performance. Compared to other methods, HyperGAN produces more visually appealing images with minimal residual artifacts.

The residual map concerning the Ground Truth (GT) is represented in blue, where a larger blue area indicates greater similarity to the GT. From the fused images, it is clear that HyperGAN achieves the best results, both in terms of spatial details and spectral fidelity. Additionally, the residual map for HyperGAN shows the largest blue area, further reinforcing this observation. Both visual comparisons and quantitative results strongly affirm the effectiveness and superiority of HyperGAN.

4.2.2. Results on the Eastern Tianshan Dataset

The average quantitative results of the proposed and competing methods on the Eastern Tianshan dataset for a resolution ratio of 1:6 are shown in Table 2.

The results in Table 2 indicate that, with a resolution ratio of 1:6, HyperGAN consistently outperforms other generalized sharpening methods across all metrics, showing a 4.17% improvement in PSNR, a 1.71% improvement in SSIM, a 13.33% reduction in SAM, a 1.99% reduction in ERGAS, and a 1.91% improvement in CC.

When the resolution ratio is 1:6, the corresponding visual and residual images are shown in Figure 6. Among traditional methods, PCA and Wavelet introduce significant blurring and chromatic aberration. Similarly, among deep learning-based methods, PSGAN also exhibits these issues. While several traditional pansharpening methods, such as MTF-GLP and GSA, do not show results as severe as PCA, their performance is still suboptimal. Additionally, methods like PanNet, HyperPNN1, and HyperPNN2 display minor spectral distortions in certain areas. In contrast, HyperGAN produces fusion images that closely match the ground truth in both color and texture. These quantitative and visual comparison results robustly affirm the effectiveness and superiority of HyperGAN for hyperspectral pansharpening across all 203 bands.

4.2.3. Results on the Chikusei Dataset

For the Chikusei dataset, the average quantitative results for a resolution ratio of 1:6 are presented for the proposed method and the competing methods in Table 3. In Table 3, it can be observed that the PSNR is 35.510, the SSIM is 0.912, the SAM is 0.159, the ERGAS is 2.164, and the CC is 0.966. All of these values represent the best results achieved.

Figure 7 shows the visual effects and residual images for resolution ratios of 1:6. The results demonstrate that HyperGAN effectively minimizes residuals and retains clearer spatial details with less information loss compared to both traditional and deep learning-based methods.

5. Discussion

5.1. Application Scenarios Analysis of HyperGAN

To assess HyperGAN’s performance, three distinct datasets were utilized: the Pavia Center dataset, the Eastern Tianshan dataset, and the Chikusei dataset. The Pavia Center dataset includes nine categories, such as Bitumen, Water, Asphalt, and Tiles, representing an urban environment. The Eastern Tianshan dataset, located in southeast Hami, Xinjiang, China, covers various lithologies, water bodies, ravines, and snow. The Chikusei dataset, from Chikusei, Ibaraki, Japan, features 19 categories, including bare soil, parks, farmland, rice fields, forests, and plastic houses, representing agricultural and urban landscapes.

The quantitative results shown in Table 1, Table 2 and Table 3 demonstrate that the model’s performance varies significantly across different scenarios. To identify potential causes, an analysis of the original input images was conducted. SNR [44] was used to evaluate the images, as higher SNR values indicate that the useful information (signal) in the image greatly outweighs the noise. This suggests higher image quality, cleaner data, and clearer details. The results are presented in Table 4.

Furthermore, statistical analyses of the original data were conducted through visualizations of means [45] and standard deviations [45], as well as scatter plots generated from PCA [8] dimensionality reduction. The smoothness of channel means and standard deviations is critical for evaluating image quality. A generally smooth mean curve with significant drops in specific channels may indicate absorption bands or physical features in the spectrum. This does not necessarily imply poor quality, rather, it could suggest data anomalies. The standard deviation indicates the variation in pixel values; a sudden increase in standard deviation in certain channels without a reasonable explanation may signal a higher presence of noise in those channels.

The scatter plot generated from PCA dimensionality reduction primarily clusters in a relatively tight region, indicating that the spectral characteristics of most pixels vary consistently across the three principal components. This suggests that the data follow a certain pattern, and different pixels in the hyperspectral image may exhibit similar spectral features. A few scattered points at the periphery may represent outliers. If these outliers are limited in number, they are likely to be isolated anomalous pixels that do not significantly impact the overall image quality. However, a larger number of outliers could indicate the presence of noise in the image or issues within certain channels. Typically, the first few principal components after PCA can explain a substantial portion of the data variance. If the first three principal components capture most of the information, it implies that the hyperspectral data contain considerable redundancy and exhibit good image quality. Conversely, if the PCA-transformed data are widely dispersed or lack clear structure among the principal components, it may suggest that the image contains a significant amount of noise or lacks substantial useful information.

The visualization results for the Pavia Center dataset, the Eastern Tianshan dataset, and the Chikusei dataset are presented in Figure 8, Figure 9, and Figure 10, respectively.

As shown in Table 4, the Eastern Tianshan dataset has the highest signal-to-noise ratio (SNR). Figure 9 illustrates that the scatter points are primarily concentrated in a relatively tight region, indicating that the spectral characteristics of most pixels vary consistently across the three principal components. This suggests that the data follow a certain pattern, and different pixels in the hyperspectral image may exhibit similar spectral features. The figure also reveals a few scattered points at the periphery, which may represent outliers. Since the number of these outliers is limited, they are likely isolated anomalous pixels that do not significantly affect the overall image quality.

In the results presented in Table 2, HyperGAN achieved the highest PSNR values in the Eastern Tianshan dataset, as well as the best results for the SSIM, SAM, and ERGAS metrics. However, due to the diverse categories in the Eastern Tianshan dataset, including various lithologies and ravines, it can be challenging to achieve optimal fusion results when a single pixel contains two or more similar spectral features. Consequently, the CC between the fused image and GT is only 0.958.

In the Chikusei dataset, the SNR is 0.51 dB, however, the fused PSNR and ERGAS metrics are better than those of the Pavia Center dataset, and the CC is the highest among the three datasets. This indicates that HyperGAN is better suited for agricultural and urban areas across these scenarios. However, as shown in Figure 10, there is significant fluctuation in the image mean, a wide range of residuals, and a higher number of outliers, which may suggest the presence of noise in the image. Additionally, the limited quality of the original images results in suboptimal SSIM and SAM performance.

In the Pavia Center dataset, the SNR is 1.83 dB. The curve in Figure 8a is smooth, while Figure 8b shows scattered outliers. As indicated in Table 1 and Figure 5, the results obtained through fusion with HyperGAN demonstrate superior performance from both visual and quantitative perspectives.

5.2. Resolution Ratio Analysis of HyperGAN

To further assess the fusion capability of the model at higher resolution ratios, we doubled the resolution ratio to 1:12 and conducted experiments on the three datasets.

The average quantitative results of the proposed and competing methods on the Pavia Center dataset, the Eastern Tianshan dataset, and the Chikusei dataset for a resolution ratio of 1:12 are shown in Table 5, Table 6, and Table 7, respectively. As shown in Table 5, Table 6 and Table 7, HyperGAN has the highest CC, PSNR, and SSIM values and the lowest SAM and ERGAS values. Figure 11, Figure 12 and Figure 13 show the visual effects and residual images on the Pavia Center dataset, the Eastern Tianshan dataset, and the Chikusei dataset for a resolution ratio of 1:12, respectively.

The results clearly demonstrate that HyperGAN consistently outperforms both traditional and deep learning-based methods across all reference metrics. It effectively minimizes residuals, retains clearer spatial details, and experiences less information loss. In terms of both spectral and spatial information, the fusion results produced by HyperGAN are closest to the GT. Whether at a resolution ratio of 1:6 or 1:12, HyperGAN successfully preserves the spectral integrity of LR-HSI and the spatial details of PAN images, making it a highly efficient method for image fusion.

However, based on the comparison of results across the three datasets at resolution ratios of 1:6 and 1:12, it is evident that when the resolution increases to 1:12, all evaluation metrics and visual results are inferior to those at 1:6. This is primarily due to information loss and the increased difficulty of detail recovery. At the 1:12 resolution ratio, the gap between the LR-HSI and the PAN image becomes significantly larger, meaning that spatial details in the LR-HSI are sparser. The generator must reconstruct high-resolution details from much less spatial information. As the resolution disparity increases, the generator faces greater challenges in effectively extracting spatial information. The rich details present in the high-resolution PAN image become more difficult to retain or transfer to the LR-HSI, resulting in missing or blurred spatial information in the fused image. Furthermore, the larger resolution gap complicates the balance between spectral and spatial features, leading to a decline in fusion performance and, consequently, lower evaluation metrics.

5.3. Gaussian Kernel Size Analysis of HyperGAN

In developing the HyperGAN dataset, the original HS dataset is used as the GT. The HS data are blurred using Gaussian kernels and downsampled into non-overlapping cubes to form the modified dataset. The impact of applying different Gaussian kernels varies, leading to potential differences in data quality and characteristics. To investigate these effects, we processed the original data from three distinct datasets using various Gaussian kernels, followed by downsampling with a factor of 6. The downsampled data were then upsampled using bicubic interpolation. The upsampled data were compared both visually and quantitatively to the original data. For the quantitative evaluation, we utilized metrics such as PSNR, CC, SAM, SSIM, and ERGAS.

We applied Gaussian kernels of sizes 3 × 3, 6 × 6, 9 × 9, 12 × 12, 15 × 15, and 18 × 18 to process the original data, and the quantitative comparison results between the upsampled data and the GT for the Pavia Center dataset, the Eastern Tianshan dataset, and the Chikusei dataset are presented in Table 8, Table 9, and Table 10, respectively. As demonstrated in these tables, the data processed with a 12 × 12 Gaussian kernel achieved the highest CC, PSNR, and SSIM values while having the lowest SAM and ERGAS values. Figure 14, Figure 15 and Figure 16 depict the visual comparisons of the data processed using different Gaussian kernels with the GT for the Pavia Center, Eastern Tianshan, and Chikusei datasets, respectively.

Based on the visual results and quantitative analysis of the three datasets, it can be seen that the 12 × 12 Gaussian kernel generates smoother images closer to the GT, effectively reducing the impact of high-frequency noise. Consequently, the model can better focus on capturing critical spatial structures and details, learning essential features rather than being distracted by noise or irrelevant details.

Moreover, the 12 × 12 Gaussian kernel strikes a balance between blurring and preserving image details during downsampling, providing a clearer representation of global information compared to smaller kernels like 6 × 6 or 9 × 9. Excessive blurring leads to significant information loss, potentially resulting in uncertainty or underfitting during training, thereby hindering the model’s ability to learn genuine inter-image relationships. In contrast, the 12 × 12 Gaussian kernel provides relatively stable and consistent image quality, ensuring a more efficient learning process and enhancing the model’s feature extraction capabilities.

5.4. Ablation Experiment

To evaluate the contributions of Inception and ESCA to the HyperGAN, we conducted ablation experiments by removing Inception and/or ESCA. Apart from the network structure, all parameter settings and experimental configurations remained identical to ensure a fair comparison. The results are summarized in Table 11. These experiments were performed on the Eastern Tianshan dataset at a resolution ratio of 1:6.

Specifically, the model without the ESCA module performs better than the model without the Inception module, indicating that widening the network to capture more global information has a greater impact on the model. Additionally, the ERGAS shows a more significant decline when the model lacks ESCA, suggesting that the Inception module has a stronger influence on reducing the spectral and spatial distortions between the fused image and the GT. In summary, the ablation experiment demonstrates that both the Inception and ESCA modules contribute to achieving satisfactory performance.

5.5. Parameters

To evaluate the computational complexity and time cost of our HyperGAN, Table 12 provide the parameters, convergence speed (epoch), and inference time. The HyperGAN method demonstrates remarkable performance with a highly sophisticated model architecture, evident in its substantial parameter count of 3.766 M. This substantial number of parameters indicates the model’s strong ability to learn and capture complex patterns, which is critical for achieving high performance in challenging tasks. Although HyperGAN requires longer inference times (42.499 ms), this trade-off is warranted by its advanced parameter complexity, which enables richer feature representations and potentially enhanced accuracy for tasks that demand a deeper level of understanding.

While the model’s computationally intensive inference process may result in longer inference times, it provides significant performance improvements over simpler models such as HyperPNN1 and HyperPNN2. Despite having faster inference times (approximately 20 ms), these simpler models feature a less complex architecture. The increased model size and associated inference time of HyperGAN clearly highlight its potential to achieve state-of-the-art results in demanding applications, where model accuracy and robustness are prioritized over inference speed.

6. Conclusions

This paper introduces HyperGAN, a novel GAN-based network architecture specifically designed for the fusion of hyperspectral and panchromatic images. HyperGAN frames pansharpening as a task-specific problem, leveraging a generative adversarial framework to achieve optimized results. The model features an energy-based discriminator that significantly enhances the generator’s ability to capture both spatial and spectral information, ensuring that the fused images closely match the GT.

The HyperGAN architecture incorporates a Wide Block module within the generator, which combines spectral and spatial information extraction to improve image quality. This module includes Inception and ESCA components, enabling the generator to effectively learn and integrate detailed spectral and spatial features while expanding the network width. The energy-based discriminator balances the energy between the fused images and the GT, ensuring superior fusion results.

In experiments conducted on the Pavia Center, Eastern Tianshan, and Chikusei datasets with an input image resolution ratio of 1:6, both quantitative evaluation metrics and visual quality assessments consistently demonstrated that HyperGAN outperformed all other methods across all reference metrics. Additionally, ablation experiments confirmed the effectiveness of both the Inception and ESCA modules.

In the future, we will focus on improving the generator and the energy-based discriminator to enable the model to effectively fuse imagery from different scenes with similar imaging resolutions. We also plan to extend the current framework to support the fusion of multispectral and hyperspectral imagery. To address the challenge of relying on labeled data, we aim to explore unsupervised learning techniques to generate reliable pseudo-ground truth labels. This approach will improve the robustness and adaptability of the framework, making it more suitable for real-world applications where labeled data are often limited.

Author Contributions

Conceptualization, J.W.; methodology, J.W. and X.Z.; software, J.W. and X.Z.; investigation, L.J.; resources, L.J.; writing—original draft, J.W. and X.Z.; writing—review and editing, J.W., X.Z. and Y.T.; visualization, J.W.; supervision, Y.T., L.J., H.L., Z.X. and H.D.; funding acquisition, L.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by National Key Research and Development Program of China under Grant No. 2023YFB3907703_07, and in part by The Second Tibetan Plateau Scientific Expedition and Research (STEP) under Grant No.2019QZKK0806.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author. The data provided by the Italian Space Agency (ASI) were critical to this study and are available upon request, subject to the agency’s data-sharing policy.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Sun, L.; Fang, Y.; Chen, Y.; Huang, W.; Wu, Z.; Jeon, B. Multi-structure KELM with attention fusion strategy for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5539217. [Google Scholar] [CrossRef]
Sun, L.; Song, X.; Guo, H.; Zhao, G.; Wang, J. Patch-wise semantic segmentation for hyperspectral images via a cubic capsule network with emap features. Remote Sens. 2021, 13, 3497. [Google Scholar] [CrossRef]
Dong, W.; Fu, F.; Shi, G.; Cao, X.; Wu, J.; Li, G.; Li, X. Hyperspectral image super-resolution via non-negative structured sparse representation. IEEE Trans. Image Process. 2016, 25, 2337–2352. [Google Scholar] [CrossRef] [PubMed]
Ghamisi, P.; Yokoya, N.; Li, J.; Liao, W.; Liu, S.; Plaza, J.; Rasti, B.; Plaza, A. Advances in hyperspectral image and signal processing: A comprehensive overview of the state of the art. IEEE Geosci. Remote Sens. Mag. 2017, 5, 37–78. [Google Scholar] [CrossRef]
Tu, T.-M.; Su, S.-C.; Shyu, H.-C.; Huang, P.S. A new look at IHS-like image fusion methods. Inf. Fusion 2001, 2, 177–186. [Google Scholar] [CrossRef]
Aiazzi, B.; Alparone, L.; Baronti, S.; Garzelli, A.; Lotti, F.; Nencini, F.; Selva, M. Context-Sensitive Pan-Sharpening of Multispectral Images. In Semantic Multimedia; Springer: Berlin/Heidelberg, Germany, 2007; pp. 121–125. [Google Scholar]
Aiazzi, B.; Baronti, S.; Selva, M. Improving component substitution pansharpening through multivariate regression of MS + Pan data. IEEE Trans. Geosci. Remote Sens. 2007, 45, 3230–3239. [Google Scholar] [CrossRef]
Kwarteng, P.; Chavez, A. Extracting spectral contrast in Landsat Thematic Mapper image data using selective principal component analysis. Photogramm. Eng. Remote Sens 1989, 55, 339–348. [Google Scholar]
Aiazzi, B.; Alparone, L.; Baronti, S.; Garzelli, A. Context-driven fusion of high spatial and spectral resolution images based on oversampled multiresolution analysis. IEEE Trans. Geosci. Remote Sens. 2002, 40, 2300–2312. [Google Scholar] [CrossRef]
Aiazzi, B.; Alparone, L.; Baronti, S.; Garzelli, A.; Selva, M. MTF-tailored multiscale fusion of high-resolution MS and Pan imagery. Photogramm. Eng. Remote Sens. 2006, 72, 591–596. [Google Scholar] [CrossRef]
He, L.; Zhu, J.; Li, J.; Plaza, A.; Chanussot, J.; Li, B. HyperPNN: Hyperspectral pansharpening via spectrally predictive convolutional neural networks. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 3092–3100. [Google Scholar] [CrossRef]
Zheng, Y.; Li, J.; Li, Y.; Guo, J.; Wu, X.; Chanussot, J. Hyperspectral pansharpening using deep prior and dual attention residual network. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8059–8076. [Google Scholar] [CrossRef]
Zhuo, Y.-W.; Zhang, T.-J.; Hu, J.-F.; Dou, H.-X.; Huang, T.-Z.; Deng, L.-J. A deep-shallow fusion network with multidetail extractor and spectral attention for hyperspectral pansharpening. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 7539–7555. [Google Scholar] [CrossRef]
Nie, J.; Xu, Q.; Pan, J. Unsupervised hyperspectral pansharpening by ratio estimation and residual attention network. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6007105. [Google Scholar] [CrossRef]
Bandara, W.G.C.; Patel, V.M. Hypertransformer: A textural and spectral feature fusion transformer for pansharpening. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1767–1777. [Google Scholar]
Hu, J.-F.; Huang, T.-Z.; Deng, L.-J.; Dou, H.-X.; Hong, D.; Vivone, G. Fusformer: A transformer-based fusion network for hyperspectral image super-resolution. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6012305. [Google Scholar] [CrossRef]
Liu, Y.; Hu, J.; Kang, X.; Luo, J.; Fan, S. Interactformer: Interactive transformer and CNN for hyperspectral image super-resolution. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5531715. [Google Scholar] [CrossRef]
Liu, Q.; Zhou, H.; Xu, Q.; Liu, X.; Wang, Y. PSGAN: A generative adversarial network for remote sensing image pan-sharpening. IEEE Trans. Geosci. Remote Sens. 2020, 59, 10227–10242. [Google Scholar] [CrossRef]
Ma, J.; Yu, W.; Chen, C.; Liang, P.; Guo, X.; Jiang, J. Pan-GAN: An unsupervised pan-sharpening method for remote sensing image fusion. Inf. Fusion 2020, 62, 110–120. [Google Scholar] [CrossRef]
Xie, W.; Cui, Y.; Li, Y.; Lei, J.; Du, Q.; Li, J. HPGAN: Hyperspectral pansharpening using 3-D generative adversarial networks. IEEE Trans. Geosci. Remote Sens. 2020, 59, 463–477. [Google Scholar] [CrossRef]
Xu, Q.; Li, Y.; Nie, J.; Liu, Q.; Guo, M. UPanGAN: Unsupervised pansharpening based on the spectral and spatial loss constrained generative adversarial network. Inf. Fusion 2023, 91, 31–46. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; Volume 27. [Google Scholar]
Isola, P.; Zhu, J.-Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Zhao, J. Energy-based Generative Adversarial Network. arXiv 2016, arXiv:1609.03126. [Google Scholar]
Berthelot, D. BEGAN: Boundary Equilibrium Generative Adversarial Networks. arXiv 2017, arXiv:1703.10717. [Google Scholar]
Denton, E.L.; Chintala, S.; Fergus, R. Deep generative image models using a laplacian pyramid of adversarial networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv 2015, arXiv:1511.06434 2015. [Google Scholar]
Huang, Z.; Cao, L. Bicubic interpolation and extrapolation iteration method for high resolution digital holographic reconstruction. Opt. Lasers Eng. 2020, 130, 106090. [Google Scholar] [CrossRef]
Weng, L. From gan to wgan. arXiv 2019, arXiv:1904.08994. [Google Scholar]
Yang, J.; Fu, X.; Hu, Y.; Huang, Y.; Ding, X.; Paisley, J. PanNet: A deep network architecture for pan-sharpening. In Proceedings of the IEEE International Conference on Computer Vision, Cambridge, MA, USA, 20–23 June 1995; pp. 5449–5457. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; p. 1. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Holzwarth, S.; Muller, A.; Habermeyer, M.; Richter, R.; Hausold, A.; Thiemann, S.; Strobl, P. HySens-DAIS 7915/ROSIS imaging spectrometers at DLR. In Proceedings of the 3rd EARSeL Workshop on Imaging Spectroscopy, Herrsching, 13–16 May 2003; pp. 3–14. [Google Scholar]
Zeng, Y.; Huang, W.; Liu, M.; Zhang, H.; Zou, B. Fusion of satellite images in urban area: Assessing the quality of resulting images. In Proceedings of the 2010 18th International Conference On Geoinformatics, Beijing, China, 18–20 June 2010; pp. 1–4. [Google Scholar]
Lopinto, E.; Ananasso, C. The Prisma hyperspectral mission. In Proceedings of the 33rd EARSeL Symposium “Towards Horizon 2020: Earth Observation and Social Perspectives”, Matera, Italy, 36 June 2013. [Google Scholar]
Cogliati, S.; Sarti, F.; Chiarantini, L.; Cosi, M.; Lorusso, R.; Lopinto, E.; Miglietta, F.; Genesio, L.; Guanter, L.; Damm, A. The PRISMA imaging spectroscopy mission: Overview and first performance analysis. Remote Sens. Environ. 2021, 262, 112499. [Google Scholar] [CrossRef]
Yokoya, N.; Iwasaki, A. Airborne hyperspectral data over Chikusei. Space Appl. Lab. Univ. Tokyo Tokyo Japan Tech. Rep. 2016, 5, 5. [Google Scholar]
Loncan, L.; De Almeida, L.B.; Bioucas-Dias, J.M.; Briottet, X.; Chanussot, J.; Dobigeon, N.; Fabre, S.; Liao, W.; Licciardi, G.A.; Simoes, M. Hyperspectral pansharpening: A review. IEEE Geosci. Remote Sens. Mag. 2015, 3, 27–46. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Liu, J. Smoothing filter-based intensity modulation: A spectral preserve image fusion technique for improving spatial details. Int. J. Remote Sens. 2000, 21, 3461–3472. [Google Scholar] [CrossRef]
Mallat, S.G. A theory for multiresolution signal decomposition: The wavelet representation. IEEE Trans. Pattern Anal. Mach. Intell. 1989, 11, 674–693. [Google Scholar] [CrossRef]
Masi, G.; Cozzolino, D.; Verdoliva, L.; Scarpa, G. Pansharpening by convolutional neural networks. Remote Sens. 2016, 8, 594. [Google Scholar] [CrossRef]
Tandra, R.; Sahai, A. SNR walls for signal detection. IEEE J. Sel. Top. Signal Process. 2008, 2, 4–17. [Google Scholar] [CrossRef]
Wan, X.; Wang, W.; Liu, J.; Tong, T. Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range. BMC Med. Res. Methodol. 2014, 14, 135. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The flowchart of HyperGAN. G denotes the generator and D denotes the discriminator. The upsampled LR-HSI is resized to the same resolution as the PAN image. CA denotes channel attention and SA denotes spatial attention. C stands for contact, N represents the number of spectral bands,

“ + ”

denotes addition,

“ \times ”

and denotes multiplication.

“ + ”

denotes addition,

“ \times ”

and denotes multiplication.

Figure 2. The architecture of generator.

“ ↑ ”

represents upsampling.

Figure 2. The architecture of generator.

“ ↑ ”

represents upsampling.

Figure 3. The detailed structure of the Resblock (a) and the Wide Block (b). ESCA includes channel attention (c) and spatial attention (d). LeakyReLU represents an activation function, N represents the number of spectral bands, GAP represents Global Average Pooling, Norm represents batchnorm.

Figure 4. The architecture of discriminator. “SPLIT” represents the separation of the fused image after convolution from the GT, and

“ - ”

represents subtraction.

Figure 4. The architecture of discriminator. “SPLIT” represents the separation of the fused image after convolution from the GT, and

“ - ”

represents subtraction.

Figure 5. Comparison of all methods on the Pavia Center dataset with a fusion ratio of 1:6.

Figure 6. Comparison of all methods on the Eastern Tianshan dataset with a fusion ratio of 1:6.

Figure 7. Comparison of all methods on the Chikusei dataset with a fusion ratio of 1:6.

Figure 8. Visualization of the Pavia Center dataset: (a) statistical feature analysis, (b) PCA scatter plot after dimensionality reduction.

Figure 9. Visualization of the Eastern Tianshan dataset: (a) statistical feature analysis, (b) PCA scatter plot after dimensionality reduction.

Figure 10. Visualization of the Chikusei dataset: (a) statistical feature analysis, (b) PCA scatter plot after dimensionality reduction.

Figure 11. Comparison of all methods on the Pavia Center dataset with a fusion ratio of 1:12.

Figure 12. Comparison of all methods on the Eastern Tianshan dataset with a fusion ratio of 1:12.

Figure 13. Comparison of all methods on the Chikusei dataset with a fusion ratio of 1:12.

Figure 14. Comparison of different Gaussian kernel effects in data processing on the Pavia Center dataset.

Figure 15. Comparison of different Gaussian kernel effects in data processing on the Eastern Tianshan dataset.

Figure 16. Comparison of different Gaussian kernel effects in data processing on the Chikusei dataset.

Table 1. The evaluation scores of different methods tested by the Pavia Center dataset with a fusion ratio of 1:6.

Method	PSNR	SSIM	SAM	ERGAS	CC
PCA	28.863	0.670	0.368	7.097	0.744
SFIM	31.020	0.736	0.154	3.649	0.784
GSA	31.089	0.736	0.156	3.568	0.808
MTF_GLP	32.458	0.817	0.150	2.971	0.875
Wavelet	30.424	0.675	0.171	4.248	0.743
PNN	34.731	0.916	0.098	2.569	0.947
PanNet	34.715	0.910	0.103	2.543	0.942
HyperPNN1	34.329	0.910	0.098	2.851	0.948
HyperPNN2	34.803	0.920	0.097	2.456	0.945
PSGAN	33.928	0.883	0.125	2.812	0.933
HyperGAN	35.464	0.924	0.092	2.331	0.964