Abstract
Large Vision Language Models (LVLMs) have shown promise in remote sensing applications, yet struggle with “uncommon” objects that lack sufficient public labeled data. This paper presents Enhanced-RSCLIP, a novel dual-prompt architecture that combines text prompting with exemplar-image processing for cattle herd detection in satellite imagery. Our approach introduces a key innovation where an exemplar-image preprocessing module using crop-based or attention-based algorithms extracts focused object features which are fed as a dual stream to a contrastive learning framework that fuses textual descriptions with visual exemplar embeddings. We evaluated our method on a custom dataset of 260 satellite images across UK and Nigerian regions. Enhanced-RSCLIP with crop-based exemplar processing achieved 72% accuracy in cattle detection and 56.2% overall accuracy on cross-domain transfer tasks, significantly outperforming text-only CLIP (31% overall accuracy). The dual-prompt architecture enables effective few-shot learning and cross-regional transfer from data-rich (UK) to data-sparse (Nigeria) environments, demonstrating a 41% improvement over baseline approaches for uncommon object detection in satellite imagery.
| Original language | English |
|---|---|
| Article number | 3071 |
| Journal | Electronics |
| Volume | 14 |
| Issue number | 15 |
| DOIs | |
| Publication status | Published - 31 Jul 2025 |
Keywords
- LVLM
- LVM
- RS-CLIP
- few-shot
- remote sensing
- satellite imagery
ASJC Scopus subject areas
- Control and Systems Engineering
- Signal Processing
- Hardware and Architecture
- Computer Networks and Communications
- Electrical and Electronic Engineering
Fingerprint
Dive into the research topics of 'Large vision language model: enhanced-RSCLIP with exemplar-image prompting for uncommon object detection in satellite imagery'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver