RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery
<p>Samples from the RS-instructions Dataset: (<b>a</b>) Image from RSVQA-LR dataset and (<b>b</b>) image from UCM dataset.</p> "> Figure 2
<p>Architecture of RS-LLaVA: In this model, the image encoder and language decoder are frozen, while we fine-tune the model using the LoRA method. The LoRA method adds extra weight to the original LLM.</p> "> Figure 3
<p>Loss during the fine-tuning process on the UAV dataset.</p> "> Figure 4
<p>Loss during the fine-tuning process on the LR-VQA dataset.</p> "> Figure 5
<p>Loss during the fine-tuning process on the RS-instructional dataset.</p> "> Figure 6
<p>Sample of RS-LLaVA results from UCM-captions dataset.</p> "> Figure 7
<p>Sample of RS-LLaVA results from UAV dataset.</p> "> Figure 8
<p>Sample of RS-LLaVA results from RSIVQA-DOTA dataset.</p> "> Figure 9
<p>Sample of RS-LLaVA results from the RSVQA-LR dataset.</p> ">
Abstract
:1. Introduction
- (1)
- We propose RS-LLaVA based on the LLaVA model [15], a large vision-language model that jointly performs captioning and question answering for RS images. The model is specifically adapted for RS data through LoRA fine-tuning.
- (2)
- We develop the RS-instructions dataset, a multi-task instruction-following dataset by integrating diverse image-text pairs from captioning and VQA datasets.
- (3)
- We demonstrate the RS-LLaVA’s effectiveness in multi-task mode compared to single-task state-of-the-art models. This model marks a promising step towards developing universal, multi-task models for RS data analysis.
2. Related Works
2.1. NLP in Remote Sensing
2.2. Multi-Tasking in NLP for RS
2.3. Vision-Language Models in General Computer Vision
2.4. Vision-Language Models in RS
3. The RS-Instructions Dataset
- The UCM-caption [2] is a captioning dataset derived from the University of California Merced land-use (UCM) dataset [58], which was initially designed for scene classification purposes. Each image in the dataset is assigned to one of 21 land-use classes. The dataset comprises a total of 2100 RGB images, with 100 images per class. The UCM-caption images have a size of 256 × 256 pixels and a spatial resolution of 0.3048 m. Each image is associated with five distinct captions. Consequently, the dataset encompasses a collection of 10,500 sentences. To facilitate experimentation and evaluation, the dataset is split into three subsets: the training set encompasses 80% of the images, amounting to 1680 images; the evaluation dataset encompasses 10% of the images, totaling 210 images; and the remaining 10% of images, also amounting to 210 images, are designated for the test dataset.
- UAV [23] is a captioning dataset that was captured near the city of Civezzano, Italy, on 17 October 2012, using an unmanned aerial vehicle equipped with an EOS 550D camera. It comprises a total of ten RGB images, each with a resolution of 2 cm and a size of 5184 × 3456 pixels, resulting in a spatial resolution of 2 cm. Among the ten images, six are allocated for training purposes, one for validation, and three for testing. From these images, crops of size 256 × 256 pixels are extracted. Specifically, the training images yield a total of 1746 crops, while the testing image provides 882 crops. Each crop is associated with three descriptions, authored by different annotators.
- RSVQA-LR [3] consists of 772 low-resolution images. This dataset was curated using seven tiles captured by the Sentinel-2 satellite, covering an area of 6.55 km² in the Netherlands. Each image in the dataset has dimensions of 256 × 256 pixels and consists of RGB spectral channels, with a spatial resolution of 10 m. The dataset comprises a total of 772 images, which are split into 572, 100, and 100 images for training, validation, and testing, respectively. The total number of questions in the dataset is 77,232, with each image annotated with approximately 100–101 questions. The questions in the dataset cover four categories: object presence (answer: yes/no), comparisons between objects (answer: yes/no), rural/urban classification (answer: rural/urban), and object counting.
- RSIVQA-DOTA [30] is a VQA dataset is based on the DOTA [59] object detection dataset. It includes questions about scenes, objects, relative locations, color, and shape. The total number of image/question/answer triplets in the dataset is 16,430. The questions are of three types: presence, counting and other. The dataset is split into three sets: the training set which represents 80% of the entire set, the testing set that comprises 10%, and the validation set that comprises 10%.
4. The RS-LLaVA Model
4.1. Model Architecture
4.2. Model Training
5. Experimental Results
5.1. Experimental Settings
5.2. Evaluation Metrics
5.3. Results
5.3.1. Results on the Captioning Task
5.3.2. Results of the RS-LLaVA on the VQA Task
5.3.3. Comparison with State-of-the-Art Methods
5.3.4. Qualitative Results
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Bashmal, L.; Bazi, Y.; Melgani, F.; Al Rahhal, M.M.; Al Zuair, M.A. Language Integration in Remote Sensing: Tasks, datasets, and future directions. IEEE Geosci. Remote Sens. Mag. 2023, 11, 63–93. [Google Scholar] [CrossRef]
- Qu, B.; Li, X.; Tao, D.; Lu, X. Deep semantic understanding of high resolution remote sensing image. In Proceedings of the 2016 International Conference on Computer, Information and Telecommunication Systems (CITS), Kunming, China, 6–8 July 2016; pp. 1–5. [Google Scholar] [CrossRef]
- Lobry, S.; Marcos, D.; Murray, J.; Tuia, D. RSVQA: Visual Question Answering for Remote Sensing Data. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8555–8566. [Google Scholar] [CrossRef]
- Abdullah, T.; Bazi, Y.; Al Rahhal, M.M.; Mekhalfi, M.L.; Rangarajan, L.; Zuair, M. TextRS: Deep Bidirectional Triplet Network for Matching Text to Remote Sensing Images. Remote Sens. 2020, 12, 405. [Google Scholar] [CrossRef]
- Zhao, R.; Shi, Z. Text-to-Remote-Sensing-Image Generation with Structured Generative Adversarial Networks. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
- Bejiga, M.B.; Melgani, F.; Vascotto, A. Retro-Remote Sensing: Generating Images from Ancient Texts. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 950–960. [Google Scholar] [CrossRef]
- Bejiga, M.B.; Hoxha, G.; Melgani, F. Improving Text Encoding for Retro-Remote Sensing. IEEE Geosci. Remote Sens. Lett. 2021, 18, 622–626. [Google Scholar] [CrossRef]
- Zhan, Y.; Xiong, Z.; Yuan, Y. RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
- Liu, C.; Zhao, R.; Chen, H.; Zou, Z.; Shi, Z. Remote Sensing Image Change Captioning with Dual-Branch Transformers: A New Method and a Large Scale Dataset. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–20. [Google Scholar] [CrossRef]
- Hoxha, G.; Chouaf, S.; Melgani, F.; Smara, Y. Change Captioning: A New Paradigm for Multitemporal Remote Sensing Image Analysis. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
- Yuan, Z.; Mou, L.; Xiong, Z.; Zhu, X.X. Change Detection Meets Visual Question Answering. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
- Bashmal, L.; Bazi, Y.; Melgani, F.; Ricci, R.; Al Rahhal, M.M.; Zuair, M. Visual Question Generation From Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 3279–3293. [Google Scholar] [CrossRef]
- OpenAI. ChatGPT. OpenAI API, 2023. Available online: https://openai.com/blog/chatgpt (accessed on 1 April 2024).
- OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
- Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual Instruction Tuning. arXiv 2023, arXiv:2304.08485. [Google Scholar]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
- Liu, C.; Zhao, R.; Shi, Z. Remote-Sensing Image Captioning Based on Multilayer Aggregated Transformer. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
- Shi, Z.; Zou, Z. Can a Machine Generate Humanlike Language Descriptions for a Remote Sensing Image? IEEE Trans. Geosci. Remote Sens. 2017, 55, 3623–3634. [Google Scholar] [CrossRef]
- Wang, B.; Lu, X.; Zheng, X.; Li, X. Semantic Descriptions of High-Resolution Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1274–1278. [Google Scholar] [CrossRef]
- Lu, X.; Wang, B.; Zheng, X.; Li, X. Exploring Models and Data for Remote Sensing Image Caption Generation. IEEE Trans. Geosci. Remote Sens. 2018, 56, 2183–2195. [Google Scholar] [CrossRef]
- Sumbul, G.; Nayak, S.; Demir, B. SD-RSIC: Summarization-Driven Deep Remote Sensing Image Captioning. IEEE Trans. Geosci. Remote Sens. 2021, 59, 6922–6934. [Google Scholar] [CrossRef]
- Ramos, R.; Martins, B. Using Neural Encoder-Decoder Models with Continuous Outputs for Remote Sensing Image Captioning. IEEE Access 2022, 10, 24852–24863. [Google Scholar] [CrossRef]
- Hoxha, G.; Melgani, F. A Novel SVM-Based Decoder for Remote Sensing Image Captioning. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
- Li, X.; Zhang, X.; Huang, W.; Wang, Q. Truncation Cross Entropy Loss for Remote Sensing Image Captioning. IEEE Trans. Geosci. Remote Sens. 2021, 59, 5246–5257. [Google Scholar] [CrossRef]
- Huang, W.; Wang, Q.; Li, X. Denoising-Based Multiscale Feature Fusion for Remote Sensing Image Captioning. IEEE Geosci. Remote Sens. Lett. 2021, 18, 436–440. [Google Scholar] [CrossRef]
- Zia, U.; Riaz, M.M.; Ghafoor, A. Transforming remote sensing images to textual descriptions. Int. J. Appl. Earth Obs. Geoinf. 2022, 108, 102741. [Google Scholar] [CrossRef]
- Chen, Z.; Wang, J.; Ma, A.; Zhong, Y. TypeFormer: Multiscale Transformer with Type Controller for Remote Sensing Image Caption. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
- Zhuang, S.; Wang, P.; Wang, G.; Wang, D.; Chen, J.; Gao, F. Improving Remote Sensing Image Captioning by Combining Grid Features and Transformer. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
- Al Rahhal, M.M.; Bazi, Y.; Alsaleh, S.O.; Al-Razgan, M.; Mekhalfi, M.L.; Al Zuair, M.; Alajlan, N. Open-ended remote sensing visual question answering with transformers. Int. J. Remote Sens. 2022, 43, 6809–6823. [Google Scholar] [CrossRef]
- Zheng, X.; Wang, B.; Du, X.; Lu, X. Mutual Attention Inception Network for Remote Sensing Visual Question Answering. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
- Yuan, Z.; Mou, L.; Wang, Q.; Zhu, X.X. From Easy to Hard: Learning Language-Guided Curriculum for Visual Question Answering on Remote Sensing Data. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–11. [Google Scholar] [CrossRef]
- Bazi, Y.; Rahhal, M.M.A.; Mekhalfi, M.L.; Zuair, M.A.A.; Melgani, F. Bi-Modal Transformer-Based Approach for Visual Question Answering in Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–11. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems 30, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Yang, Q.; Ni, Z.; Ren, P. Meta captioning: A meta learning based remote sensing image captioning framework. ISPRS J. Photogramm. Remote Sens. 2022, 186, 190–200. [Google Scholar] [CrossRef]
- Wang, S.; Ye, X.; Gu, Y.; Wang, J.; Meng, Y.; Tian, J.; Hou, B.; Jiao, L. Multi-label semantic feature fusion for remote sensing image captioning. ISPRS J. Photogramm. Remote Sens. 2022, 184, 1–18. [Google Scholar] [CrossRef]
- Murali, N.; Shanthi, A.P. Remote Sensing Image Captioning via Multilevel Attention-Based Visual Question Answering. In Innovations in Computational Intelligence and Computer Vision; Roy, S., Sinwar, D., Perumal, T., Slowik, A., Tavares, J.M.R.S., Eds.; Springer Nature: Singapore, 2022; pp. 465–475. [Google Scholar]
- Wang, Q.; Huang, W.; Zhang, X.; Li, X. Word–Sentence Framework for Remote Sensing Image Captioning. IEEE Trans. Geosci. Remote Sens. 2021, 59, 10532–10543. [Google Scholar] [CrossRef]
- Zhang, Z.; Diao, W.; Zhang, W.; Yan, M.; Gao, X.; Sun, X. LAM: Remote Sensing Image Captioning with Label-Attention Mechanism. Remote Sens. 2019, 11, 2349. [Google Scholar] [CrossRef]
- Zhao, R.; Shi, Z.; Zou, Z. High-Resolution Remote Sensing Image Captioning Based on Structured Attention. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
- Wang, Y.; Zhang, W.; Zhang, Z.; Gao, X.; Sun, X. Multiscale Multiinteraction Network for Remote Sensing Image Captioning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 2154–2165. [Google Scholar] [CrossRef]
- Yuan, Z.; Li, X.; Wang, Q. Exploring Multi-Level Attention and Semantic Relationship for Remote Sensing Image Captioning. IEEE Access 2020, 8, 2608–2620. [Google Scholar] [CrossRef]
- Kandala, H.; Saha, S.; Banerjee, B.; Zhu, X.X. Exploring Transformer and Multilabel Classification for Remote Sensing Image Captioning. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
- Ye, X.; Wang, S.; Gu, Y.; Wang, J.; Wang, R.; Hou, B.; Giunchiglia, F.; Jiao, L. A Joint-Training Two-Stage Method for Remote Sensing Image Captioning. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]
- Hoxha, G.; Melgani, F.; Demir, B. Toward Remote Sensing Image Retrieval Under a Deep Image Captioning Perspective. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 4462–4475. [Google Scholar] [CrossRef]
- Ma, X.; Zhao, R.; Shi, Z. Multiscale Methods for Optical Remote-Sensing Image Captioning. IEEE Geosci. Remote Sens. Lett. 2021, 18, 2001–2005. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the International Conference on Machine Learning PMLR, Virtual, 18–24 July 2021. [Google Scholar]
- Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.H.; Li, Z.; Duerig, T. Scaling Up Visual and Vision-Language Representation Learning with Noisy Text Supervision. arXiv 2021, arXiv:2102.05918. [Google Scholar]
- Hu, R.; Singh, A. UniT: Multimodal Multitask Learning with a Unified Transformer. arXiv 2021, arXiv:2102.10772. [Google Scholar]
- Wang, P.; Yang, A.; Men, R.; Lin, J.; Bai, S.; Li, Z.; Ma, J.; Zhou, C.; Zhou, J.; Yang, H. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. arXiv 2022, arXiv:2202.03052. [Google Scholar]
- Wang, J.; Yang, Z.; Hu, X.; Li, L.; Lin, K.; Gan, Z.; Liu, Z.; Liu, C.; Wang, L. GIT: A Generative Image-to-Text Transformer for Vision and Language. arXiv 2022, arXiv:2205.14100. [Google Scholar]
- Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A Visual Language Model for Few-Shot Learning. arXiv 2022, arXiv:2204.14198. [Google Scholar]
- Gong, T.; Lyu, C.; Zhang, S.; Wang, Y.; Zheng, M.; Zhao, Q.; Liu, K.; Zhang, W.; Luo, P.; Chen, K. MultiModal-GPT: A Vision and Language Model for Dialogue with Humans. arXiv 2023, arXiv:2305.04790. [Google Scholar]
- Qiu, C.; Yu, A.; Yi, X.; Guan, N.; Shi, D.; Tong, X. Open Self-Supervised Features for Remote-Sensing Image Scene Classification Using Very Few Samples. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
- Li, X.; Wen, C.; Hu, Y.; Zhou, N. RS-CLIP: Zero shot remote sensing scene classification via contrastive vision-language supervision. Int. J. Appl. Earth Obs. Geoinf. 2023, 124, 103497. [Google Scholar] [CrossRef]
- Rahhal, M.M.; Bazi, Y.; Elgibreen, H.; Zuair, M. Vision-Language Models for Zero-Shot Classification of Remote Sensing Images. Appl. Sci. 2023, 13, 12462. [Google Scholar] [CrossRef]
- Ricci, R.; Bazi, Y.; Melgani, F. Machine-to-Machine Visual Dialoguing with ChatGPT for Enriched Textual Image Description. Remote Sens. 2024, 16, 441. [Google Scholar] [CrossRef]
- Li, J.; Li, D.; Savarese, S.; Hoi, S. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
- Yang, Y.; Newsam, S. Bag-of-visual-words and Spatial Extensions for Land-use Classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, in GIS ’10, San Jose, CA, USA, 3–5 November 2010; ACM: New York, NY, USA, 2010; pp. 270–279. [Google Scholar] [CrossRef]
- Xia, G.-S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar] [CrossRef]
- Chiang, W.L.; Li, Z.; Lin, Z.; Sheng, Y.; Wu, Z.; Zhang, H.; Zheng, L.; Zhuang, S.; Zhuang, Y.; Gonzalez, J.E.; et al. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. March 2023. Available online: https://lmsys.org/blog/2023-03-30-vicuna/ (accessed on 2 March 2024).
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics—ACL ’02, Philadelphia, PA, USA, 6–12 July 2002; p. 311. [Google Scholar] [CrossRef]
- Denkowski, M.; Lavie, A. Meteor Universal: Language Specific Translation Evaluation for Any Target Language. In Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, ML, USA, 26–27 June 2014; pp. 376–380. [Google Scholar] [CrossRef]
- Lin, C.-Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Workshop Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; p. 10. [Google Scholar]
- Vedantam, R.; Zitnick, C.L.; Parikh, D. CIDEr: Consensus-based image description evaluation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 4566–4575. [Google Scholar] [CrossRef]
- Zhang, Z.; Zhang, W.; Diao, W.; Yan, M.; Gao, X.; Sun, X. VAA: Visual Aligning Attention Model for Remote Sensing Image Captioning. IEEE Access 2019, 7, 137355–137364. [Google Scholar] [CrossRef]
- Li, Y.; Zhang, X.; Gu, J.; Li, C.; Wang, X.; Tang, X.; Jiao, L. Recurrent Attention and Semantic Gate for Remote Sensing Image Captioning. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]
- Cheng, Q.; Huang, H.; Xu, Y.; Zhou, Y.; Li, H.; Wang, Z. NWPU-Captions Dataset and MLCA-Net for Remote Sensing Image Captioning. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–19. [Google Scholar] [CrossRef]
Dataset | Task | #Images | Image Size | Text per Image | Resolution (m) |
---|---|---|---|---|---|
UCM-captions [2] | Captioning | 2100 | 256 × 256 | 5 | 0.3048 |
UAV [23] | Captioning | 2628 | 256 × 256 | 3 | 0.02 |
RSVQA-LR [3] | VQA | 772 | 256 × 256 | 100–101 | 10 |
RSIVQA-DOTA [30] | VQA | 1868 | Varies | 3–24 | Varies |
RS-instructions dataset | Captioning + VQA | 7058 | Varies | Varies | Varies |
Training | Merced | BLEU1 | BLEU2 | BLEU3 | BLEU4 | METEOR | ROUGH | CIDEr | Training Time |
---|---|---|---|---|---|---|---|---|---|
Single | CLIP336-Vicuna7B | 88.77 | 82.84 | 77.62 | 72.78 | 46.75 | 83.72 | 343.21 | 2.10 h |
CLIP336-Vicuna13B | 90.41 | 84.54 | 79.39 | 74.50 | 48.62 | 86.09 | 355.06 | 3.73 h | |
Joint | CLIP336-Vicuna7B | 88.70 | 82.88 | 77.70 | 72.84 | 47.98 | 85.17 | 349.43 | 11.04 h |
CLIP336-Vicuna13B | 90.00 | 84.88 | 80.30 | 76.03 | 49.21 | 85.78 | 355.61 | 17.04 h |
Training | UAV | BLEU1 | BLEU2 | BLEU3 | BLEU4 | METEOR | ROUGH | CIDEr | Training Time |
---|---|---|---|---|---|---|---|---|---|
Single | CLIP336-Vicuna7B | 82.60 | 72.67 | 63.04 | 53.27 | 42.03 | 78.21 | 427.17 | 1.83 h |
CLIP336-Vicuna13B | 82.64 | 72.51 | 62.57 | 52.79 | 42.77 | 79.15 | 423.18 | 3.81 h | |
Joint | CLIP336-Vicuna7B | 79.81 | 69.27 | 59.00 | 49.02 | 40.46 | 76.80 | 404.54 | 11.04 h |
CLIP336-Vicuna13B | 79.82 | 69.60 | 58.84 | 49.24 | 40.14 | 76.28 | 390.30 | 17.04 h |
Training | VQA_LR | Count | Presence | Comparisons | Urban/Rural | Average | Overall | Training Time |
---|---|---|---|---|---|---|---|---|
Single | CLIP-336-Vicuna-7B | 75.05 | 92.97 | 91.23 | 95.00 | 88.56 | 87.20 | 3.25 h |
CLIP-336Vicuna-13B | 75.87 | 92.32 | 91.37 | 95.00 | 88.64 | 87.22 | 7.10 h | |
Joint | CLIP-336-Vicuna-7B | 74.38 | 92.80 | 91.33 | 94.00 | 88.13 | 86.95 | 11.04 h |
CLIP-336Vicuna-13B | 73.76 | 92.27 | 91.37 | 95.00 | 88.10 | 86.58 | 19.40 h |
Training | VQA_DOTA | Count | Yes/No | Training Time | ||
---|---|---|---|---|---|---|
RMSE | Pr | Re | F1 | |||
Single | CLIP-336-Vicuna-7B | 221.40 | 91.49 | 72.07 | 80.63 | 1.95 h |
CLIP-336-Vicuna-13B | 226.79 | 89.03 | 82.79 | 85.80 | 3.88 h | |
Joint | CLIP-336-Vicuna-7B | 209.47 | 85.26 | 86.15 | 85.70 | 11.04 h |
CLIP-336-Vicuna-13B | 232.75 | 100 | 33.28 | 49.94 | 19.40 h |
Method | BLEU1 | BLEU2 | BLEU3 | BLEU4 | METEOR | ROUGH | CIDEr |
---|---|---|---|---|---|---|---|
CSMLF [19] | 43.61 | 27.28 | 18.55 | 12.10 | 13.20 | 39.27 | 22.27 |
VGG19+LSTM [2] | 63.80 | 53.60 | 37.70 | 21.90 | 20.60 | - | 45.10 |
GoogLeNet -hard att. [20] | 83.75 | 76.22 | 70.42 | 65.62 | 44.89 | 79.62 | 320.01 |
VAA [65] | 81.92 | 75.11 | 69.27 | 63.87 | 43.80 | 78.24 | 339.46 |
Yuan et al. [41] | 83.30 | 77.12 | 71.54 | 66.23 | 43.71 | 77.63 | 316.84 |
ResNet18 MSF [25] | 83.06 | 75.98 | 69.72 | 63.45 | - | 73.18 | 329.56 |
SD-RISC [21] | 74.80 | 66.40 | 59.80 | 53.80 | 39.00 | 69.50 | 213.20 |
Hoxha et al. [23] | 76.53 | 69.47 | 64.17 | 37.02 | 37.02 | 68.77 | 292.28 |
Li et al. [24] | 82.10 | 76.22 | 71.40 | 67.00 | 47.75 | 75.67 | 285.47 |
MSA [45] | 83.37 | 78.22 | 74.06 | 70.21 | 45.04 | 79.18 | 325.71 |
Word-sentence [37] | 79.31 | 72.37 | 66.71 | 62.02 | 43.95 | 71.32 | 278.71 |
Structured att. [39] | 85.38 | 80.35 | 75.72 | 71.49 | 46.32 | 81.41 | 334.89 |
Zia et al. [26] | 83.90 | 76.90 | 71.50 | 67.50 | 44.60 | - | 323.10 |
Li et al. [66] | 85.18 | 79.25 | 74.32 | 69.76 | 45.71 | 80.72 | 338.87 |
SCST [28] | 83.40 | 77.60 | 72.30 | 67.60 | - | 76.00 | 336.00 |
Wang et al. [40] | 84.30 | 77.50 | 71.10 | 65.10 | 45.30 | 78.50 | 338.10 |
MLCA [67] | 82.60 | 77.00 | 71.70 | 66.80 | 43.50 | 77.20 | 324.00 |
Ye et al. [43] | 86.96 | 82.24 | 77.88 | 73.76 | 49.06 | 83.64 | 371.02 |
CLIP336-Vicuna7B (Joint) | 88.70 | 82.88 | 77.70 | 72.84 | 47.98 | 85.17 | 349.43 |
CLIP336-Vicuna13B (Joint) | 90.00 | 84.88 | 80.30 | 76.03 | 49.21 | 85.78 | 355.61 |
Method | BLEU1 | BLEU2 | BLEU3 | BLEU4 | METEOR | ROUGH | CIDEr |
---|---|---|---|---|---|---|---|
Hoxha et al. [23] | 68.84 | 58.05 | 48.33 | 39.22 | 32.81 | 69.63 | 391.31 |
Hoxha et al. [23] | 65.13 | 56.53 | 48.15 | 39.69 | 32.17 | 69.31 | 389.45 |
Bashmal et al. [12] | 77.11 | 66.45 | 55.99 | 45.17 | 38.18 | 75.19 | 390.27 |
CLIP336-Vicuna7B (Joint) | 79.81 | 69.27 | 59.00 | 49.02 | 40.46 | 76.80 | 404.54 |
CLIP336-Vicuna13B (Joint) | 79.82 | 69.60 | 58.84 | 49.24 | 40.14 | 76.28 | 390.30 |
Method | Count | Presence | Comparisons | Urban/Rural | Average | Overall |
---|---|---|---|---|---|---|
Lobry et al. [3] | 67.01 | 87.46 | 81.50 | 90.00 | 81.49 | 79.08 |
Yuan et al. [31] | 68.53 | 90.13 | 86.91 | 92.00 | 84.39 | 82.50 |
Bazi et al. [32] | 72.22 | 91.06 | 91.16 | 92.66 | 86.78 | 85.56 |
CLIP336-Vicuna7B (Joint) | 74.38 | 92.80 | 91.33 | 94.00 | 88.13 | 86.95 |
CLIP336-Vicuna13B (Joint) | 73.76 | 92.27 | 91.37 | 95.00 | 88.10 | 86.58 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Bazi, Y.; Bashmal, L.; Al Rahhal, M.M.; Ricci, R.; Melgani, F. RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery. Remote Sens. 2024, 16, 1477. https://doi.org/10.3390/rs16091477
Bazi Y, Bashmal L, Al Rahhal MM, Ricci R, Melgani F. RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery. Remote Sensing. 2024; 16(9):1477. https://doi.org/10.3390/rs16091477
Chicago/Turabian StyleBazi, Yakoub, Laila Bashmal, Mohamad Mahmoud Al Rahhal, Riccardo Ricci, and Farid Melgani. 2024. "RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery" Remote Sensing 16, no. 9: 1477. https://doi.org/10.3390/rs16091477