Interactive Change-Aware Transformer Network for Remote Sensing Image Change Captioning
<p>A visualization of the existing method and our proposed method. (<b>a</b>) The existing method [<a href="#B3-remotesensing-15-05611" class="html-bibr">3</a>] uses a hierarchical approach that tends to integrate the unchanged focused information from each encoder layer, disrupting the change feature learning in the decoder and generating inferior change descriptions. Our proposed method attentively aggregates the essential features for more informative caption generation. (<b>b</b>) Existing methods [<a href="#B2-remotesensing-15-05611" class="html-bibr">2</a>,<a href="#B3-remotesensing-15-05611" class="html-bibr">3</a>,<a href="#B4-remotesensing-15-05611" class="html-bibr">4</a>,<a href="#B9-remotesensing-15-05611" class="html-bibr">9</a>] overlook the change in objects with various scales, generating inferior change descriptions. Ours can extract discriminative information across various scales (e.g., a small scale) for change captioning. Blue indicates that the word “house” is attended to the particular region in the image, while reddish colors suggest a lower level of focus on it. The bluer the color, the higher the attention value.</p> "> Figure 2
<p>Overview of the proposed ICT-Net. It consists of three components: a multi-scale feature extractor to extract visual features, an Interactive Change-Aware Encoder (ICE) with a Multi-Layer Adaptive Fusion (MAF) module to capture the semantic changes between bitemporal features, and a change caption decoder with a Cross Gated-Attention (CGA) module to generate change descriptions.</p> "> Figure 3
<p>Structure of the Multi-Layer Adaptive Fusion module.</p> "> Figure 4
<p>Structure of the Cross Gated-Attention module.</p> "> Figure 5
<p>Comparison of attention maps generated using DAE and DAE + CAE. <math display="inline"><semantics> <msub> <mi>M</mi> <mrow> <mi>l</mi> <mi>a</mi> <mi>r</mi> <mi>g</mi> <mi>e</mi> <mi>r</mi> </mrow> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi>M</mi> <mrow> <mi>s</mi> <mi>m</mi> <mi>a</mi> <mi>l</mi> <mi>l</mi> </mrow> </msub> </semantics></math> denote the attention maps for large and small changes captured between bitemporal image features, respectively. <math display="inline"><semantics> <msub> <mi>I</mi> <msub> <mi>t</mi> <mn>0</mn> </msub> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi>I</mi> <msub> <mi>t</mi> <mn>1</mn> </msub> </msub> </semantics></math> denote input RS images. Note that regions appearing more blue indicate higher levels of attention. We use the red dotted box to ground the small change areas to ease the visualization.</p> "> Figure 6
<p>Visualization of the generated attention map of the caption decoder using the existing MBF [<a href="#B3-remotesensing-15-05611" class="html-bibr">3</a>] method and the proposed MAF. The word highlighted in red in the caption corresponds to the blue region in the generated attention map. Note that regions appearing more blue indicate higher levels of attention.</p> "> Figure 7
<p>Visualization of captured multi-scale word and feature attention maps in the change caption decoder of the GCA module, where <math display="inline"><semantics> <msub> <mi>L</mi> <mrow> <mi>w</mi> <mi>o</mi> <mi>r</mi> <mi>d</mi> <mi>s</mi> </mrow> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi>S</mi> <mrow> <mi>w</mi> <mi>o</mi> <mi>r</mi> <mi>d</mi> <mi>s</mi> </mrow> </msub> </semantics></math> denote the attention maps that capture large and small object changes for each object word (highlighted in red) in the generated change caption, respectively. We use the red bounding boxes to indicate the small-scale object change regions for image pairs (1), (2), and (3). (4), (5), and (6) include middle to large-scale changes. The regions appearing more blue indicate higher levels of attention.</p> "> Figure 8
<p>Qualitative results on the LEVIR-CC dataset. The <math display="inline"><semantics> <msub> <mi>I</mi> <mrow> <mi>t</mi> <mn>0</mn> </mrow> </msub> </semantics></math> image was captured “before”, and the <math display="inline"><semantics> <msub> <mi>I</mi> <mrow> <mi>t</mi> <mn>1</mn> </mrow> </msub> </semantics></math> was captured “after”. GT represents the ground truth caption. We use red bounding boxes to indicate the small-scale object change regions for image pairs (1) and (2). (3) and (4) include middle to large-scale changes. Green and blue words highlighted the correctly predicted change objects for the existing method (a) and ours (b), respectively.</p> ">
Abstract
:1. Introduction
- We propose an Interactive Change-Aware Transformer Network (ICT-Net) to accurately capture and describe changes in objects in remote sensing bitemporal images.
- We introduce the Interactive Change-Aware Encoder (ICE) equipped with the Multi-Layer Adaptive Fusion (MAF) module. It effectively captures change information from bitemporal features and extracts essential change-aware features from each encoder layer, contributing to improved change caption generation.
- We present the Cross Gated-Attention (CGA) module, a novel module designed to effectively utilize multi-scale change-aware representations during the sentence-generation process. This module empowers the change caption decoder to explore the relationships between words and multi-scale features, facilitating the discernment of critical representations for better change captioning.
2. Related Works
2.1. Remote Sensing Image Change Captioning
2.2. Remote Sensing Image Captioning
2.3. Remote Sensing Change Detection
2.4. Natural Image Captioning
3. Methodology
3.1. Multi-Scale Feature Extraction
3.2. Interactive Change-Aware Encoder
3.3. Multi-Scale Change Caption Decoder
Algorithm 1: ICTNet |
1 Input: I← (, ) 2 Output: change caption 3 Step1: Feature extraction 4 for i in (t0, t1) do 5 ← Backbone(I) 6 end 7 Step2: Interactive Change-Aware Encoder (ICE) 8 forl in (1∽L) do 9 ← DAE() 10 ← CAE(, ) 11 ← MAF() 12 end 13 Step3: Multi-scale change caption decoder 14 forl in (1∽L) do 15 ← Masked-Attention() 16 ← CGA(, ) 17 ← LN(Linear(; )) 18 end 19 Step 4: Predict change caption 20 P ← Softmax(Linear()) 21 Use probability P to predict caption words y in vocabulary |
3.4. Training Objective
4. Experiments
4.1. Dataset and Evaluation Metrics
4.2. Experimental Setup
4.3. Comparison with State-of-the-Art Methods
4.4. Ablation Studies
4.4.1. Interactive Change-Aware Encoder
4.4.2. Multi-Layer Adaptive Fusion module
4.4.3. Cross Gated-Attention Module
4.5. Qualitative Analysis
4.6. Parametric Analysis
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Chouaf, S.; Hoxha, G.; Smara, Y.; Melgani, F. Captioning Changes in Bi-Temporal Remote Sensing Images. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; pp. 2891–2894. [Google Scholar] [CrossRef]
- Hoxha, G.; Chouaf, S.; Melgani, F.; Smara, Y. Change Captioning: A New Paradigm for Multitemporal Remote Sensing Image Analysis. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5627414. [Google Scholar] [CrossRef]
- Liu, C.; Zhao, R.; Chen, H.; Zou, Z.; Shi, Z. Remote Sensing Image Change Captioning With Dual-Branch Transformers: A New Method and a Large Scale Dataset. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5633520. [Google Scholar] [CrossRef]
- Liu, C.; Zhao, R.; Chen, J.; Qi, Z.; Zou, Z.; Shi, Z. A Decoupling Paradigm with Prompt Learning for Remote Sensing Image Change Captioning. TechRxiv 2023. [Google Scholar] [CrossRef]
- Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3156–3164. [Google Scholar]
- Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 6–11 July 2015; pp. 2048–2057. [Google Scholar]
- Cornia, M.; Stefanini, M.; Baraldi, L.; Cucchiara, R. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10578–10587. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.u.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Liu, C.; Yang, J.; Qi, Z.; Zou, Z.; Shi, Z. Progressive Scale-aware Network for Remote sensing Image Change Captioning. arXiv 2023, arXiv:2303.00355. [Google Scholar]
- Zhang, Z.; Zhang, W.; Yan, M.; Gao, X.; Fu, K.; Sun, X. Global visual feature and linguistic state guided attention for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5615216. [Google Scholar] [CrossRef]
- Huang, W.; Wang, Q.; Li, X. Denoising-based multiscale feature fusion for remote sensing image captioning. IEEE Geosci. Remote Sens. Lett. 2020, 18, 436–440. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Zhao, R.; Shi, Z.; Zou, Z. High-resolution remote sensing image captioning based on structured attention. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5603814. [Google Scholar] [CrossRef]
- Sumbul, G.; Nayak, S.; Demir, B. SD-RSIC: Summarization-Driven Deep Remote Sensing Image Captioning. IEEE Trans. Geosci. Remote Sens. 2021, 59, 6922–6934. [Google Scholar] [CrossRef]
- Wang, Y.; Zhang, W.; Zhang, Z.; Gao, X.; Sun, X. Multiscale Multiinteraction Network for Remote Sensing Image Captioning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 2154–2165. [Google Scholar] [CrossRef]
- Hoxha, G.; Melgani, F.; Demir, B. Toward remote sensing image retrieval under a deep image captioning perspective. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 4462–4475. [Google Scholar] [CrossRef]
- Chen, J.; Dai, X.; Guo, Y.; Zhu, J.; Mei, X.; Deng, M.; Sun, G. Urban Built Environment Assessment Based on Scene Understanding of High-Resolution Remote Sensing Imagery. Remote Sens. 2023, 15, 1436. [Google Scholar] [CrossRef]
- Zhang, X.; Li, Y.; Wang, X.; Liu, F.; Wu, Z.; Cheng, X.; Jiao, L. Multi-Source Interactive Stair Attention for Remote Sensing Image Captioning. Remote Sens. 2023, 15, 579. [Google Scholar] [CrossRef]
- Zhou, H.; Du, X.; Xia, L.; Li, S. Self-Learning for Few-Shot Remote Sensing Image Captioning. Remote Sens. 2022, 14, 4606. [Google Scholar] [CrossRef]
- Ren, Z.; Gou, S.; Guo, Z.; Mao, S.; Li, R. A Mask-Guided Transformer Network with Topic Token for Remote Sensing Image Captioning. Remote Sens. 2022, 14, 2939. [Google Scholar] [CrossRef]
- Li, Y.; Fang, S.; Jiao, L.; Liu, R.; Shang, R. A Multi-Level Attention Model for Remote Sensing Image Captions. Remote Sens. 2020, 12, 939. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 x 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Lu, X.; Wang, B.; Zheng, X.; Li, X. Exploring Models and Data for Remote Sensing Image Caption Generation. IEEE Trans. Geosci. Remote Sens. 2018, 56, 2183–2195. [Google Scholar] [CrossRef]
- Li, X.; Zhang, X.; Huang, W.; Wang, Q. Truncation Cross Entropy Loss for Remote Sensing Image Captioning. IEEE Trans. Geosci. Remote Sens. 2021, 59, 5246–5257. [Google Scholar] [CrossRef]
- Ma, X.; Zhao, R.; Shi, Z. Multiscale Methods for Optical Remote-Sensing Image Captioning. IEEE Geosci. Remote Sens. Lett. 2021, 18, 2001–2005. [Google Scholar] [CrossRef]
- Peng, X.; Zhong, R.; Li, Z.; Li, Q. Optical remote sensing image change detection based on attention mechanism and image difference. IEEE Trans. Geosci. Remote Sens. 2020, 59, 7296–7307. [Google Scholar] [CrossRef]
- Chen, H.; Qi, Z.; Shi, Z. Remote Sensing Image Change Detection with Transformers. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5607514. [Google Scholar] [CrossRef]
- Zhang, C.; Wang, L.; Cheng, S.; Li, Y. SwinSUNet: Pure transformer network for remote sensing image change detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5224713. [Google Scholar] [CrossRef]
- Tong, L.; Wang, Z.; Jia, L.; Qin, Y.; Wei, Y.; Yang, H.; Geng, Y. Fully Decoupled Residual ConvNet for Real-Time Railway Scene Parsing of UAV Aerial Images. IEEE Trans. Intell. Transp. Syst. 2022, 23, 14806–14819. [Google Scholar] [CrossRef]
- Cheng, G.; Wang, G.; Han, J. ISNet: Towards improving separability for remote sensing image change detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5623811. [Google Scholar] [CrossRef]
- Chen, G.; Zhao, Y.; Wang, Y.; Yap, K.H. SSN: Stockwell Scattering Network for SAR Image Change Detection. IEEE Geosci. Remote Sens. Lett. 2023, 20, 4001405. [Google Scholar] [CrossRef]
- Bao, T.; Fu, C.; Fang, T.; Huo, H. PPCNET: A Combined Patch-Level and Pixel-Level End-to-End Deep Network for High-Resolution Remote Sensing Image Change Detection. IEEE Geosci. Remote Sens. Lett. 2020, 17, 1797–1801. [Google Scholar] [CrossRef]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Saha, S.; Bovolo, F.; Bruzzone, L. Unsupervised Deep Change Vector Analysis for Multiple-Change Detection in VHR Images. IEEE Trans. Geosci. Remote Sens. 2019, 57, 3677–3693. [Google Scholar] [CrossRef]
- Tang, X.; Zhang, H.; Mou, L.; Liu, F.; Zhang, X.; Zhu, X.X.; Jiao, L. An Unsupervised Remote Sensing Change Detection Method Based on Multiscale Graph Convolutional Network and Metric Learning. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5609715. [Google Scholar] [CrossRef]
- Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Gao, J.; Qing, L.; Li, L.; Cheng, Y.; Peng, Y. Multi-scale features based interpersonal relation recognition using higher-order graph neural network. Neurocomputing 2021, 456, 243–252. [Google Scholar] [CrossRef]
- Wu, K.; Yang, Y.; Liu, Q.; Zhang, X.P. Focal Stack Image Compression Based on Basis-Quadtree Representation. IEEE Trans. Multimed. 2023, 25, 3975–3988. [Google Scholar] [CrossRef]
- Wang, Y.; Hou, J.; Chau, L.P. Object counting in video surveillance using multi-scale density map regression. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 2422–2426. [Google Scholar]
- Zhou, Y.; Wang, Y.; Chau, L.P. Moving Towards Centers: Re-Ranking With Attention and Memory for Re-Identification. IEEE Trans. Multimed. 2023, 25, 3456–3468. [Google Scholar] [CrossRef]
- Chen, S.; Sun, P.; Song, Y.; Luo, P. Diffusiondet: Diffusion model for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 19830–19843. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Jhamtani, H.; Berg-Kirkpatrick, T. Learning to Describe Differences Between Pairs of Similar Images. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 4024–4034. [Google Scholar] [CrossRef]
- Qiu, Y.; Satoh, Y.; Suzuki, R.; Iwata, K.; Kataoka, H. 3D-Aware Scene Change Captioning From Multiview Images. IEEE Robot. Autom. Lett. 2020, 5, 4743–4750. [Google Scholar] [CrossRef]
- Tu, Y.; Yao, T.; Li, L.; Lou, J.; Gao, S.; Yu, Z.; Yan, C. Semantic Relation-aware Difference Representation Learning for Change Captioning. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, 1–6 August 2021; pp. 63–73. [Google Scholar] [CrossRef]
- Qiu, Y.; Yamamoto, S.; Nakashima, K.; Suzuki, R.; Iwata, K.; Kataoka, H.; Satoh, Y. Describing and Localizing Multiple Changes With Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 1971–1980. [Google Scholar]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar] [CrossRef]
- Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
- Banerjee, S.; Lavie, A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
- Vedantam, R.; Lawrence Zitnick, C.; Parikh, D. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Park, D.H.; Darrell, T.; Rohrbach, A. Robust Change Captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Method | B-1 | B-2 | B-3 | B-4 | M | R | C |
---|---|---|---|---|---|---|---|
Capt-Dual-Att [55] | 79.51 | 70.57 | 63.23 | 57.46 | 36.56 | 70.69 | 124.42 |
DUDA [55] | 81.44 | 72.22 | 64.24 | 57.79 | 37.15 | 71.04 | 124.32 |
[49] | 79.90 | 70.26 | 62.68 | 56.68 | 36.17 | 69.46 | 120.39 |
[49] | 80.42 | 70.87 | 62.86 | 56.38 | 37.29 | 70.32 | 124.44 |
PSNet [9] | 83.86 | 75.13 | 67.89 | 62.11 | 38.80 | 73.60 | 132.62 |
RSICCFormer [3] | 84.72 | 76.12 | 68.87 | 62.77 | 39.61 | 74.12 | 134.12 |
PromNet [4] | 83.66 | 75.73 | 69.10 | 63.54 | 38.82 | 73.72 | 136.44 |
Ours | 86.06 | 78.12 | 71.45 | 66.12 | 40.51 | 75.21 | 138.36 |
Method | B-1 | B-2 | B-3 | B-4 | M | R | C |
---|---|---|---|---|---|---|---|
CNN−RNN [1] | 71.85 | 60.40 | 52.18 | 45.94 | 27.43 | 54.13 | 71.64 |
[49] | 66.81 | 56.89 | 48.57 | 41.53 | 26.16 | 54.63 | 78.58 |
RSICCFormer [3] | 69.02 | 59.78 | 52.42 | 46.39 | 28.18 | 56.81 | 80.08 |
Ours | 72.40 | 62.62 | 55.03 | 48.92 | 30.22 | 58.52 | 85.93 |
DAE | CAE | MAF | CGA | B-1 | B-2 | B-3 | B-4 | M | R | C |
---|---|---|---|---|---|---|---|---|---|---|
✗ | ✗ | ✗ | ✗ | 76.43 | 66.36 | 58.00 | 49.50 | 33.34 | 69.53 | 124.74 |
✓ | ✗ | ✗ | ✗ | 78.93 | 68.61 | 58.88 | 49.84 | 34.15 | 71.81 | 129.15 |
✗ | ✓ | ✗ | ✗ | 79.84 | 70.40 | 61.13 | 53.08 | 34.43 | 71.45 | 128.35 |
✓ | ✓ | ✗ | ✗ | 80.74 | 72.57 | 65.91 | 56.84 | 36.16 | 72.76 | 132.67 |
✓ | ✓ | ✓ | ✗ | 84.43 | 76.88 | 70.46 | 65.36 | 39.81 | 74.69 | 135.25 |
✓ | ✓ | ✓ | ✓ | 86.06 | 78.12 | 71.45 | 66.12 | 40.51 | 75.21 | 138.36 |
Test Range | ICE | MAF | CGA | B-1 | B-2 | B-3 | B-4 | M | R | C |
---|---|---|---|---|---|---|---|---|---|---|
Test Set (only no-change) | ✓ | ✗ | ✗ | 95.15 | 94.38 | 94.14 | 93.48 | 73.91 | 95.67 | - |
✓ | ✓ | ✗ | 95.80 | 95.36 | 95.13 | 94.97 | 74.62 | 96.24 | - | |
✓ | ✓ | ✓ | 97.43 | 97.03 | 96.80 | 96.97 | 75.96 | 97.24 | - | |
Test Set (only change) | ✓ | ✗ | ✗ | 71.61 | 57.40 | 45.44 | 36.19 | 23.21 | 48.17 | 55.50 |
✓ | ✓ | ✗ | 74.42 | 60.57 | 48.36 | 38.72 | 25.02 | 53.12 | 60.30 | |
✓ | ✓ | ✓ | 76.50 | 62.21 | 49.95 | 40.37 | 25.78 | 52.85 | 89.82 | |
Test Set (entire set) | ✓ | ✗ | ✗ | 80.74 | 72.57 | 65.91 | 56.84 | 36.16 | 72.76 | 132.67 |
✓ | ✓ | ✗ | 84.43 | 76.88 | 70.46 | 65.36 | 39.81 | 74.69 | 135.25 | |
✓ | ✓ | ✓ | 86.06 | 78.12 | 71.45 | 66.12 | 40.51 | 75.21 | 138.36 |
Conv5 | Conv4 | Conv3 | B-1 | B-2 | B-3 | B-4 | M | R | C |
---|---|---|---|---|---|---|---|---|---|
✓ | ✗ | ✗ | 73.33 | 65.24 | 60.18 | 56.90 | 33.47 | 65.94 | 112.95 |
✗ | ✓ | ✗ | 85.02 | 77.16 | 70.89 | 65.61 | 39.32 | 74.68 | 134.91 |
✗ | ✗ | ✓ | 81.57 | 75.41 | 69.24 | 64.89 | 37.91 | 73.09 | 127.20 |
✓ | ✓ | ✗ | 84.29 | 76.08 | 69.58 | 64.59 | 39.96 | 74.30 | 132.35 |
✓ | ✓ | ✓ | 84.36 | 76.09 | 69.41 | 64.29 | 39.53 | 73.74 | 133.53 |
✗ | ✓ | ✓ | 86.06 | 78.12 | 71.45 | 66.12 | 40.51 | 75.21 | 138.36 |
E.L. | D.L. | B-1 | B-2 | B-3 | B-4 | M | R | C |
---|---|---|---|---|---|---|---|---|
1 | 1 | 82.70 | 73.07 | 64.49 | 56.78 | 37.23 | 74.45 | 133.57 |
1 | 2 | 82.63 | 73.44 | 65.35 | 58.32 | 37.81 | 74.84 | 135.38 |
2 | 1 | 86.06 | 78.12 | 71.45 | 66.12 | 40.51 | 75.21 | 138.36 |
2 | 2 | 84.16 | 76.01 | 69.49 | 64.38 | 39.51 | 74.13 | 134.01 |
2 | 3 | 85.16 | 76.75 | 69.68 | 64.09 | 40.01 | 74.83 | 136.85 |
3 | 1 | 83.05 | 74.19 | 66.79 | 60.91 | 38.59 | 74.02 | 133.37 |
Beam Size | B-1 | B-2 | B-3 | B-4 | M | R | C |
---|---|---|---|---|---|---|---|
1 | 84.27 | 75.51 | 67.87 | 67.71 | 39.66 | 74.38 | 135.26 |
2 | 86.12 | 78.07 | 71.15 | 65.54 | 40.29 | 75.09 | 137.86 |
3 | 86.06 | 78.12 | 71.45 | 66.12 | 40.51 | 75.21 | 138.36 |
4 | 85.58 | 77.57 | 71.01 | 65.85 | 40.26 | 74.99 | 137.55 |
5 | 85.30 | 77.32 | 70.83 | 65.76 | 40.24 | 74.84 | 137.28 |
6 | 85.27 | 77.28 | 70.81 | 65.78 | 40.25 | 74.84 | 137.32 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Cai, C.; Wang, Y.; Yap, K.-H. Interactive Change-Aware Transformer Network for Remote Sensing Image Change Captioning. Remote Sens. 2023, 15, 5611. https://doi.org/10.3390/rs15235611
Cai C, Wang Y, Yap K-H. Interactive Change-Aware Transformer Network for Remote Sensing Image Change Captioning. Remote Sensing. 2023; 15(23):5611. https://doi.org/10.3390/rs15235611
Chicago/Turabian StyleCai, Chen, Yi Wang, and Kim-Hui Yap. 2023. "Interactive Change-Aware Transformer Network for Remote Sensing Image Change Captioning" Remote Sensing 15, no. 23: 5611. https://doi.org/10.3390/rs15235611