Open AccessArticle

ISAR Image Quality Assessment Based on Visual Attention Model

Jun Zhang

Zhicheng Zhao

and

Xilan Tian

East China Research Institute of Electronic Engineering, Hefei 230088, China

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(4), 1996; https://doi.org/10.3390/app15041996

Submission received: 6 January 2025 / Revised: 6 February 2025 / Accepted: 10 February 2025 / Published: 14 February 2025

Download

Browse Figures

Figure 1
The architecture of the proposed model. The image is partitioned into 8 × 8-sized patches. Then, the linear projection layer performs a convolution operation on these patches to acquire patch embeddings, which can be processed by the Transformer encoder. The Gram–T model computes the Gramin matrix of extracted features to obtain score tokens through one attention layer. Moreover, the CAB and IRAB strength interactions between channels and regions within features are output by the Transformer encoder. Finally, the prediction score from the IRAB is added by a score token from Gram–T to obtain the final score. "> Figure 2
CAB module. "> Figure 3
IRAB module. "> Figure 4
Display of the results in the training process. "> Figure 5
Example images from ISAR image dataset. The first caption row below the images refers to the ground truth scores of the according images, and the second row refers to prediction scores. "> Figure 6
Attention heatmap analysis. The attention heatmap is on the left, and the original image is on the right. In our experiments, the blue color in the attention heatmap represents the high weight, and the red color represents low weight. ">

Versions Notes

Abstract

The quality of ISAR (Inverse Synthetic Aperture Radar) images has a significant impact on the detection and recognition of targets. Therefore, ISAR image quality assessment is a fundamental prerequisite and primary link in the utilization of ISAR images. Previous ISAR image quality assessment methods typically extract hand-crafted features or use simple multi-layer networks to extract local features. Hand-crafted features and local features from networks usually lack the global information of ISAR images. Furthermore, most deep neural networks obtain feature representations by abridging the prediction quality score and the ground truth, neglecting to explore the strong correlations between features and quality scores in the stage of feature extraction. This study proposes a Gramin Transformer to explore the similarity and diversity of features extracted from different images, thus obtaining features containing quality-related information. The Gramin matrix of features is computed to obtain the score token through the self-attention layer. It prompts the network to learn more discriminative features, which are closely associated with quality scores. Despite the Transformer architecture’s ability to extract global information, the Channel Attention Block (CAB) can capture complementary information from different channels in an image, aggregating and mining information from these channels to provide a more comprehensive evaluation of ISAR images. ISAR images are formed from target scattering points with a background containing substantial silent noise, and the Inter-Region Attention Block (IRAB) is utilized to extract local scattering point features, which decide the clarity of target. In addition, extensive experiments are conducted on the ISAR image dataset (including space stations, ships, aircraft, etc.). The evaluation results of our method on the dataset are significantly superior to those of traditional feature extraction methods and existing image quality assessment methods.

Keywords:

ISAR image; quality assessment; Gramin Transformer; attention block

1. Introduction

The development of radar imaging technology has led to an increased application of ISAR imaging in target recognition [1]. However, the quality of ISAR images significantly impacts the detection and recognition of targets [2]. Consequently, assessing ISAR image quality becomes a fundamental task for ISAR image applications.

The field of general image quality assessment [3,4] has advanced significantly; however, assessing the quality of ISAR images [5] remains a considerable challenge. Traditional methods for ISAR quality assessment [6,7] typically rely on manually defined features, such as mean values, equivalent number of looks, and contrast. These approaches can be labor-intensive and prone to subjective evaluation errors. Additionally, traditional assessment methods primarily focus on image resolution, noise, and contrast, with limited attention given to other critical image features such as detail retention and target contour clarity. These limitations represent a significant bottleneck in the quality evaluation of ISAR images, which hinders their broader application. To enable a more comprehensive and objective evaluation of ISAR images, further research and development in ISAR image quality assessment is essential.

In recent years, general image quality assessment, as a hot issue in the image content-aware understanding area, has attracted much attention from academia and industry. The deep learning methods for ISAR image quality assessment imitate human vision systems and automatically predict the quality of ISAR images [8]. Deep models extract visual information from various receptive fields by neural layers, which provides ISAR images with a tremendous representation ability. The adoption of deep learning methods contributes to the collection of comprehensive and objective features during the training phase, thereby facilitating a more accurate assessment of ISAR images [9]. Some research methods evaluate ISAR image quality by local features extracted with a few convolution layers [8,9,10,11], which cannot capture interactions across regions and channels in the image and hence cannot evaluate the image from a global level. The Transformer model is able to capture global information from the image [12]. Through the self-attention mechanism [13,14,15], the transformer encoder is capable of learning complex features and relationships in ISAR images, thereby facilitating a more comprehensive assessment of the quality of ISAR images. However, despite the increase in the number of parameters in previous works [3,16], which employ transformers to extract features to evaluate image quality, the relation between the extracted features and the quality score is not carefully considered, resulting in a comparatively modest feature discrimination. In addition, existing deep learning-based methods neglect distinctive characteristics of ISAR images. Generally, they adopt MSE loss, which is common in generic image quality evaluation areas. However, no corresponding loss function has been designed for ISAR images, which poses an obstacle to the optimization of the ISAR image quality evaluation model.

We have designed the Gramin Transformer (Gram–T) to obtain features that are closely associated with image quality scores. Our approach involves computing the Gramin matrix of features to generate score tokens through a single attention layer. We propose a new network that incorporates the Gramin Transformer along with two attention mechanism blocks. Specifically, we apply the Gramin matrix to features extracted from transformer encoders to derive score tokens that are tightly linked to these features. This allows us to explore image quality information during the feature extraction process rather than just at the final fully connected layer, effectively reducing the loss of critical information. As a result, we are able to obtain more discriminative features and strengthen their relevance to image quality.

Additionally, our method utilizes cross-channel and cross-region attention mechanisms to enhance interactions between both channels and spatial features. The Inter-Region Attention Block (IRAB), which focuses on spatial relationships, captures distant semantic relationships between locations. Given that ISAR images are formed from target scattering points and the background contains substantial silent noise, IRAB is used to extract local scattering point features. The key to ISAR image quality lies in the clarity of target imaging. Therefore, IRAB is beneficial for the quality assessment of ISAR images. The scattering intensity of targets in ISAR images is reflected across multiple channel dimensions. Therefore, the Channel Attention Block (CAB) module is utilized to aggregate and mine multi-channel information to extract target representations, which facilitates the objective evaluation of ISAR image quality. This cross-dimensional approach allows modules to synergistically improve interactions between local regions and various channels, thereby enhancing the representational capability of the algorithm.

In ISAR (Inverse Synthetic Aperture Radar) images, targets appear as isolated and sparse groups of strong scatterers, which reflect their electromagnetic scattering structures. Moreover, the imaging texture effects of these targets remain relatively stable at different angles. This leads to structural and texture similarities between patches in ISAR images, which encourages us to design structural and texture constraints to improve model performance.

In summary, the key research contributions and innovations of this paper include the following:

(1): In this paper, Gram–T is designed for the quality assessment of ISAR images. The Gram–T model is highly adaptable and flexible when dealing with sequence data. By transforming an image into sequence data, the Transformer encoder can effectively grab spatial relationships and global information in the image. Furthermore, computing the Gramin matrix enhances the contrast of features representing different quality levels of ISAR images, thereby strengthening the association between ISAR image features and quality levels. In this way, a strong association between features and image quality can be explored, conducive to more discriminative features.
(2): We employ the Channel Attention Block (CAB) and Inter-Region Attention Block (IRAB) to enhance interactions between channels and space within features. The spatial dimension-oriented attention module captures local scattering point features, while the channel-oriented attention module captures semantic association information on channels. The CAB and IRAB modules enhance the extraction of features that characterize target clarity and image quality.
(3): Structure and texture constraints are proposed to boost model performance due to the exclusive properties of ISAR images. Extensive experiments are conducted on the ISAR dataset. Compared with various network architectures and the latest ISAR image evaluation methods, our method achieves the best performance. Ablation experiments demonstrate the effectiveness of each module.

2. Proposed Method

2.1. Overall Architecture

The Transformer encoder extracts global features from ISAR image patches through a multi-head attention mechanism. Then, the Gramin matrix of features is computed to acquire score tokens strongly associated with the features through one attention layer. The Gramin matrix of features can be seen as dots of different features, which can capture the similarity and diversity of different features. Because the similarity and diversity of different features are directly related to the quality score due to the score token, the features can learn more information about the quality level. This method promotes features extracted from similar quality level images to contain common information about similar quality levels and features extracted from diverse quality level images to contain discriminative information about different quality levels. Moreover, image quality information can be explored in the process of feature extraction rather than the final fully connected layer, avoiding information loss significantly. The extracted features pass through the CAB and IRAB to increase interactions among regions and channels. In this way, detailed information of the images can be explored and the representation is more robust. Figure 1 illustrates the overall architecture, which comprises Gram–T and two attention blocks.

2.2. Gramin Transformer

The Gramin Transformer (Gram–T) is designed to extract global information from images and strengthen the association between features and the image quality score. Gram–T comprises three principal components. The first is the linear projection of flattened patches, transforming image patches into embeddings. The second part is the Transformer encoder that extracts features from embeddings. The third module is Gram attention, which obtains score tokens by computing the Gramin matrix.

The Gramin matrix is computed according to the output features of the Transformer encoder, and the output features are concatenated by the {7,8,9,10}th layer of the transformer encoder. Figure 1 shows that the score token is assigned through Gram attention. We leverage the Gramin matrix to obtain a pairwise feature similarity in order to directly increase the interaction between features and score tokens in the process of feature extraction. Then, the score token is introduced to the final MLP stage to avoid significant information loss. On the one hand, we can promote agglomeration of features attributed to similar quality levels and diversity of features attributed to different quality levels. On the other hand, the loss function can directly propel the optimization of the score token when the score token makes an immediate impact on loss. Ultimately, our model can obtain more discriminative features. We obtain the projected feature

V_{X} = X W_{C}

from the output feature

X \in R^{N \times C \times L}

, where

W_{C} \in R^{L \times S}

. Thus, the Gramin matrix is computed as

G r a m_{X} = {(V_{X})}^{T} V_{X}

, which is used to measure the pairwise similarity of features. Finally, our method introduces the score token, as follows:

{s c o r e}_{t o k e n} = ({(V_{X})}^{T} V_{X}) \cdot Softmax (Q \cdot K^{T} / β)

(1)

where

K = X W_{K}

and

Q = X W_{Q}

, W_{K}, W_{Q} \in R^{L \times S}

. This approach ensures that the score token is consistent with the feature, and its parameters are no longer randomly initialized to directly affect the model parameters during the training process.

2.3. Channel Attention Block

In regular self-attention (SA), the key-query dot product operation promotes feature interactions between patches in the image, thus extracting the features across space. However, the information embedded in channels is neglected. The CAB increases feature interactions among channels. We concatenate the feature maps of the {7,8,9,10}th layer of Gram–T to obtain output features

F^{N \times C \times L}

, where N is batch number,

L = H \times W

Q \in R^{N \times C \times L}

K \in R^{N \times C \times L}

and

V \in R^{N \times C \times L}

are achieved by passing output features

F^{N \times C \times L}

of the Transformer encoder through three independent projections. The CAB performs matrix multiplication with

Q \in R^{N \times C \times L}

and

K^{T} \in R^{N \times L \times C}

to obtain correlations between channels and then pass them through a softmax layer to obtain a normalized attention weight. The correlation results are obtained by multiplying the attention weight by V. The process is outlined as follows:

\begin{matrix} F = Attn (Q, K, V) + F \end{matrix}

(2)

\begin{matrix} Attn (Q, K, V) = V \cdot Softmax (Q \cdot K^{T} / β) \end{matrix}

(3)

where

β

is a tunable parameter. The CAB module has two advantages: Firstly, it integrates four different sizes of layer feature maps in Gram–T and assigns different weights to the channels according to their relevance to the quality score. Secondly, the CAB encodes semantic information among diverse channels, which facilitates the extraction of global information (Figure 2).

2.4. Inter-Region Attention Block

The output features

F^{N \times C \times L}

of the CAB module are convolved and rearranged to obtain

F^{N \times C \times H / 8 \times W / 8}

, which is subsequently input into the IRAB. The IRAB module has three sub-modules consisting of IRL, Conv, and Coef. IRL includes the Intra-Region Attention layer, Multi-Layer Perceptron, and Inter-Region Attention layer. The Intra-Region Attention layer performs self-attention operation across a single region to improve interactions within one region. The Inter-Region Attention layer computes attention results across different regions using a sliding window over all regions. Coef is a scaling parameter

α

. The formula is outlined as follows:

F_{o u t} = α \cdot H_{C O N V} (H_{I R L} ({\tilde{F}}_{0}) + {\tilde{F}}_{0}

(4)

{\tilde{F}}_{0}

denotes

F^{N \times C \times H / 8 \times W / 8}

, which is input into the IRAB.

{\tilde{F}}_{0}

passes through the IRL convolution layer and relu layer, and then multiplies the scale factor by 0.8. The feature is added at the beginning of the

{\tilde{F}}_{i, 0}

to obtain

F_{o u t}^{N \times C \times H / 8 \times W / 8}

. This module further strengthens the connection between spatial regions and extracts fine features across regions for image quality assessment (Figure 3).

2.5. Loss Function

F_{o u t}

is input into the MLP block, which is composed of linear layer, relu, dropout, linear layer, and relu. The output

F^{N \times H / 8 \times W / 8}

denotes the prediction score of each patch. The score token output by Gram–T passes through MLP to obtain an integral quality evaluation score. Then, for an image, its total prediction score is the sum of the prediction scores of each patch and the wholeness score. This method refines the evaluation of the image patches and makes the overall image quality evaluation more accurate. Due to the texture similarity and structure similarity among patches in the ISAR image, we propose structure and texture constraints to improve model performance on the ISAR image dataset. In addition, MSE loss is employed to measure the distance between quality score prediction and the ground truth. According to the characteristics of the ISAR image, the loss function is specially designed, which includes the following: (1) Square error loss (MSE loss): N is the number of samples,

{\hat{y}}_{i}

is the prediction quality score, and

y_{i}

is the true labeling of the i-th ISAR image.

M S E L O S S = \frac{1}{N} \sum_{i = 1}^{N} {({\hat{y}}_{i} - y_{i})}^{2}

(5)

(2) Texture and structure similarity constraints: N is the number of samples,

μ_{i j}

is the predicted quality score of the jth patch of the ith image, and

{\hat{μ}}_{i j}

is the average predicted quality score of all patches of the i-th ISAR image:

T S S C L O S S = \frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{L} {(μ_{i j} - {\bar{μ}}_{i j})}^{2} / L

(6)

The final loss function is

L_{Quality} = \frac{1}{N} \sum_{i = 1}^{N} {({\hat{y}}_{i} - y_{i})}^{2} + \frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{L} {(μ_{i j} - {\bar{μ}}_{i j})}^{2} / L

(7)

where L is the total number of patches.

3. Experimental Setup and Analysis of Results

We collect the dataset by searching for ISAR images on the Internet and make simulation ISAR images. We then acquire 564 ISAR images containing satellite targets, aircraft targets, vehicle targets, ship targets, etc. ISAR images of satellite targets, ISAR images of aircraft targets, ISAR images of vehicle targets, and ISAR images of ship targets each account for about 1/5 of the total. The remaining 1/5 images are cluttered and blurred images. The experiments described in this paper were conducted using an NVIDIA GeForce RTX 2080 graphics card (NVIDIA, Santa Clara, CA, USA) with PyTorch v1.8. Input images were cropped to a size of 224 × 224 pixels, with a vertical flip probability set at 0.5. Each input consisted of a batch size of 8 images. For feature extraction, we utilized a pre-trained ViT model as the backbone. Due to the confidentiality of ISAR images, we created an ISAR image database using simulation software along with collected images. The ISAR images were divided into a training set and a test set in an 80:20 ratio, with labels assigned based on the average scores given by professionals. We recognize that the dataset, comprising 564 ISAR images, may not be sufficiently large or diverse to fully capture the variability and complexity of real-world scenarios. To address these concerns, we have implemented several strategies:

Data Augmentation: We have applied data augmentation techniques to artificially expand the dataset, introducing variations that can help the model learn more robust features.

Cross-Validation: We have employed cross-validation to ensure that the model’s performance is consistent across different subsets of the data, thereby reducing the risk of overfitting to a particular subset.

The evaluation criteria are SRCC and PLCC, with SRCC being known as the Spearman rank-order correlation coefficient (SRCC), which measures the similarity of two datasets. The value range is 0–1, and when the output value is 1, it indicates that the two datasets are identical.

S R C C = 1 - \frac{6 \sum_{i = 1}^{N} {d_{i}}^{2}}{N (N^{2} - 1)}

(8)

N denotes the number of samples, and

d_{i}

denotes the difference between the subjective quality score ranking of the i-th image and the objective quality score ranking.

The Pearson linear correlation coefficient (PLCC) describes the correlation between the objective evaluation scores of the algorithm and the subjective scores of the human eye.

s_{i} a n d {\hat{s}}_{i}

denote the predicted quality score and objective quality score of the ith image, respectively.

μ_{s_{i}}

and

μ_{{\hat{s}}_{i}}

denote the mean predicted quality score and the mean objective quality score, respectively.

P L C C = \frac{\sum_{i = 1}^{N} (s_{i} - μ_{s_{i}}) ({\hat{s}}_{i} - μ_{{\hat{s}}_{i}})}{\sqrt{\sum_{i = 1}^{N} {(s_{i} - μ_{s_{i}})}^{2}} \sqrt{\sum_{i = 1}^{N} {({\hat{s}}_{i} - μ_{{\hat{s}}_{i}})}^{2}}}

(9)

Figure 4 shows the loss variation curve, SRCC variation curve, and PLCC variation curve, respectively. The number of epochs is on the abscissa, while the value on the ordinate represents the measurement value. It can be seen that the loss in the training process decreases with the epoch number up. The PLCC grows stably while the SRCC improves to a maximum of 0.9103. In the first 50 epochs, the fitting ability of the model is increasing and the performance of our model on the dataset is improving.

We show some typical ISAR images including satellite targets and ship targets in Figure 5. The first image is of average quality, the middle two images are of good quality, and the last image is of medium to good quality, effectively representing the overall distribution of the dataset. The ground truth score and prediction score are displayed below the images. Notably, there is only a small gap between the ground truth score and the predicted score, indicating that our model achieves optimal performance.

From the experiment results in Table 1, we can see the performance of different image quality assessment methods (traditional methods, BRISQUE, NIQE, VGG16, Resnet18, Resnet34, Resnet50, ViT, CNN-based Regression, KNN-based Regression, VPSL, EBFF, and “ours”) on two metrics: the SRCC and the PLCC. These metrics are used to evaluate the performance of the image quality assessment methods and measure the correlation between predicted and actual results. Traditional methods like CQEM, BRISQUE, and NIQE are all manual feature extraction methods, and their performance on this task is relatively low, with SRCC and PLCC not reaching 0.7. VGG16, Resnet18, Resnet34 and Resnet50 are deep learning methods with different architectures. CNN-based Regression and KNN-based Regression [9] use a CNN and KNN to learn the mapping between ISAR image features and their corresponding quality scores. This method, proposed in 2023, features a relatively rudimentary network architecture and yields lower performance compared to ours. VPSL [17], proposed in 2023, aims to meet the auto-filtering requirements for ISAR images of space targets by introducing a hierarchical evaluation approach based on fusion features. However, the SVM classifier used is not able to effectively extract features from ISAR images. EBFF [18] involves dual-channel input by incorporating the original ISAR images alongside eye-tracking-based heatmaps into a residual network. Utilizing ISAR images and heatmaps can lead to high computational costs and resource demands. The method in this paper is based on the Gram–T backbone architecture and the attention mechanism. According to the results in the table, our method achieves the best results in SRCC and PLCC metrics, which are 0.9103 and 0.8627, respectively, with a total score of 1.773. This indicates that our method has high accuracy and relevance in predicting image quality, outperforming the traditional methods and other typical deep learning models. This table reflects the inference times for each method on a 1080Ti GPU, highlighting that the ViT model achieves competitive inference times due to its efficient parallel processing, making it suitable for real-time applications, such as ISAR image quality assessment.

4. Discussion of the Method

4.1. Ablation Studies

The ablation studies presented in Table 2 demonstrate the progressive improvement in performance as different components are added. The Gram–T model shows a significant improvement over traditional methods, with both SRCC and PLCC exceeding 0.8 and obtaining a total score of 1.730. The addition of the CAB module further enhances performance, increasing the total score to 1.744. The inclusion of the IRAB module leads to further improvements in both SRCC and PLCC, with the combined CAB and IRAB modules achieving the highest performance, with a total score of 1.773. These results highlight the effectiveness of the model structure and the attention mechanisms in improving the model’s ability to capture discriminative features and enhance image quality assessment.

Additionally, in Table 3, experiments on the selection of scale factors are conducted. It is obvious that the model performs best when the scale factor is 0.8.

4.2. Attention Heatmap Analysis

From the attention heatmap in Figure 6, it can be seen that after the Gram–T extracts the features, the CAB module and IRAB module play a constraining role in the attention weights. We selected four representative ISAR images: the first and second images are of good quality, the third image is slightly blurry, and the fourth is cluttered. The attention high-response area focuses on salient objects, which is consistent with human eye judgment. Therefore, the model is conducive to improving assessment performance.

4.3. Potential Applications of Gram–T

The Gram–T model, with its ability to capture global information and feature interactions, holds promise for applications beyond ISAR image quality assessment. In medical imaging, the model could be used to enhance the quality assessment of MRI scans, aiding in the diagnosis of various conditions. For instance, Mayo Clinic’s Neurology AI Program has demonstrated significant improvements in the accuracy of brain image interpretations by integrating AI with clinician expertise. Similarly, in general image quality assessments, the Gram–T model could be leveraged to extract robust image representations, which are beneficial for image assessment tasks.

5. Conclusions

In this paper, a new backbone model, called Gram–T, is used for the quality assessment of ISAR images. The Gramin Transformer computes the Gramin matrix of features to reinforce the score token through the attention layer in order to strengthen the correlation between features and the image quality score in the process of feature extraction rather than the final fully connected layer. Furthermore, this paper applies the attention mechanism in both the channel and spatial dimensions. The spatial attention module captures long semantic association information across locations, while the channel attention module captures long semantic association information across channels. Experimental results show that our method achieves the best results in SRCC and PLCC metrics, respectively, compared to state-of-the-art networks. Ablation studies demonstrate the effectiveness of the network’s architecture and the spatial and channel attention mechanisms.

Author Contributions

Conceptualization, J.Z.; methodology, J.Z.; software, J.Z.; validation, J.Z.; formal analysis, J.Z.; investigation, J.Z.; resources, J.Z.; data curation, J.Z.; writing—original draft preparation, J.Z.; writing—review and editing, Z.Z. and X.T.; visualization, J.Z.; supervision, Z.Z. and X.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lee, S.J.; Lee, M.J.; Kim, K.T.; Bae, J.H. Classification of ISAR Images Using Variable Cross-Range Resolutions. IEEE Trans. Aerosp. Electron. Syst. 2018, 54, 2291–2303. [Google Scholar] [CrossRef]
Benedek, C.; Martorella, M. Moving Target Analysis in ISAR Image Sequences with a Multiframe Marked Point Process Model. IEEE Trans. Geosci. Remote Sens. 2014, 52, 2234–2246. [Google Scholar] [CrossRef]
Bosse, S.; Maniry, D.; Müller, K.; Wiegand, T.; Samek, W. Deep Neural Networks for No-Reference and Full-Reference Image Quality Assessment. IEEE Trans. Image Process. 2018, 27, 206–219. [Google Scholar] [CrossRef]
Ebrahimi, S.; Vladimir, Y.M. Image Quality Improvement in Kidney Stone Detection on Computed Tomography Images. J. Image Graph. 2015, 3, 40–46. [Google Scholar] [CrossRef]
Chandler, D.M. Seven challenges in image quality assessment: Past present and future research. Int. Sch. Res. Not. 2013, 2013, 905685. [Google Scholar] [CrossRef]
Ju, Y.W.; Zhang, Y. Research on ISAR image quality evaluation. Syst. Eng. Electron. 2015, 37, 297–303. [Google Scholar]
Huang, L.; Wang, Y.; Jin, S. A Quantitative Evaluation Approach for ISAR Image Performance. Radar Sci. Technol. 2017, 15, 43–49. [Google Scholar]
Li, J.; Tian, B.; Li, S.; Wang, Y.; He, T.; Xu, S. Inverse Synthetic Aperture Radar Image Quality Assessment Based on BP Neural Network. In Proceedings of the 2023 8th International Conference on Signal and Image Processing (ICSIP), Wuxi, China, 8–10 July 2023; pp. 414–418. [Google Scholar]
Jasinski, T.; Rosenberg, L.; Antipov, I. Automated ISAR Image Quality Assessment. In Proceedings of the 2023 IEEE International Radar Conference (RADAR), Sydney, Australia, 6–10 November 2023; pp. 1–5. [Google Scholar] [CrossRef]
Madhusudana, P.C.; Birkbeck, N.; Wang, Y.; Adsumilli, B.; Bovik, A.C. Image quality assessment using contrastive learning. IEEE Trans. Image Process. 2022, 31, 4149–4161. [Google Scholar] [CrossRef] [PubMed]
Zhao, K.; Yuan, K.; Sun, M.; Li, M.; Wen, X. Quality-aware pre-trained models for blind image quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22302–22313. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Cao, M.; Fan, Y.; Zhang, Y.; Wang, J.; Yang, Y. Vdtr: Video deblurring with transformer. arXiv 2022, arXiv:2204.08023. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
Ke, J.; Wang, Q.; Wang, Y.; Milanfar, P.; Yang, F. Musiq: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 5148–5157. [Google Scholar]
Tong, J.; Yang, Q.; Shen, L.; Li, B.; Chu, S. ISAR image evaluation method based on visual perception supervised learning. In Proceedings of the SPIE 12803, Fifth International Conference on Artificial Intelligence and Computer Science (AICS 2023), Wuhan, China, 26–28 July 2023. [Google Scholar]
Zhang, J. ISAR Image Quality Grade Evaluation of Space Targets Based on Fusion Feature. In Proceedings of the World Conference on Intelligent and 3-D Technologies (WCI3DT 2022) Methods, Algorithms and Applications; Springer Nature: Singapore, 2023; pp. 53–63. [Google Scholar]
Mittal, A.; Moorthy, A.K.; Bovik, A.C. No-reference image quality assessment in the spatial domain. IEEE Trans. Image Process. 2012, 21, 4695–4708. [Google Scholar] [CrossRef]
Mittal, A.; Soundararajan, R.; Bovik, A.C. Making a “completely blind” image quality analyzer. IEEE Signal Process. Lett. 2012, 20, 209–212. [Google Scholar] [CrossRef]
Liu, S.; Deng, W. Very deep convolutional neural network based image classification using small training sample size. In Proceedings of the 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), Kuala Lumpur, Malaysia, 3–6 November 2015; pp. 730–734. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]

Figure 1. The architecture of the proposed model. The image is partitioned into 8 × 8-sized patches. Then, the linear projection layer performs a convolution operation on these patches to acquire patch embeddings, which can be processed by the Transformer encoder. The Gram–T model computes the Gramin matrix of extracted features to obtain score tokens through one attention layer. Moreover, the CAB and IRAB strength interactions between channels and regions within features are output by the Transformer encoder. Finally, the prediction score from the IRAB is added by a score token from Gram–T to obtain the final score.

Figure 2. CAB module.

Figure 3. IRAB module.

Figure 4. Display of the results in the training process.

Figure 5. Example images from ISAR image dataset. The first caption row below the images refers to the ground truth scores of the according images, and the second row refers to prediction scores.

Figure 6. Attention heatmap analysis. The attention heatmap is on the left, and the original image is on the right. In our experiments, the blue color in the attention heatmap represents the high weight, and the red color represents low weight.

Table 1. Effects of different models on ISAR image quality assessment. Our models are compared with popular networks, including Resnet and Vit. Our models perform better than the other ones we compared them to on the ISRA image dataset.

Methods	SRCC	PLCC	Score	Inference Time(s)
MQEM [7]	0.6411	0.5652	1.206	0.30
BRISQUE [19]	0.6501	0.5844	1.235	0.20
NIQE [20]	0.5312	0.5382	1.069	0.10
VGG16 [21]	0.4564	0.4148	0.871	0.15
Resnet18	0.7102	0.6507	1.361	0.24
Resnet34	0.8697	0.8296	1.699	0.36
Resnet50 [22]	0.8712	0.8499	1.721	0.47
ViT [12]	0.8731	0.8467	1.720	0.28
CNN-based Regression [9]	0.5531	0.5437	1.097	0.43
KNN-based Regression [9]	0.4626	0.4782	0.941	0.22
VPSL [17]	0.6916	0.6372	1.329	0.34
EBFF [18]	0.7941	0.7219	1.516	0.65
Ours	0.9103	0.8627	1.773	0.31

Table 2. Ablation studies of our modules.

Methods	SRCC	PLCC	Score
Gram–T	0.8831	0.8477	1.730
Gram–T+CAB	0.8874	0.8567	1.744
Gram–T+IRAB	0.9072	0.8645	1.771
Gram–T+CAB+IRAB	0.9103	0.8627	1.773

Table 3. Ablation studies of scale factor.

Scale Factor	SRCC	PLCC	Score
0	0.8874	0.8567	1.744
0.5	0.9071	0.8560	1.763
1	0.8972	0.8551	1.752
0.8	0.9103	0.8627	1.773

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, J.; Zhao, Z.; Tian, X. ISAR Image Quality Assessment Based on Visual Attention Model. Appl. Sci. 2025, 15, 1996. https://doi.org/10.3390/app15041996

AMA Style

Zhang J, Zhao Z, Tian X. ISAR Image Quality Assessment Based on Visual Attention Model. Applied Sciences. 2025; 15(4):1996. https://doi.org/10.3390/app15041996

Chicago/Turabian Style

Zhang, Jun, Zhicheng Zhao, and Xilan Tian. 2025. "ISAR Image Quality Assessment Based on Visual Attention Model" Applied Sciences 15, no. 4: 1996. https://doi.org/10.3390/app15041996

APA Style

Zhang, J., Zhao, Z., & Tian, X. (2025). ISAR Image Quality Assessment Based on Visual Attention Model. Applied Sciences, 15(4), 1996. https://doi.org/10.3390/app15041996

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu