🏆 Contributions

We generate a novel instruction-tuning dataset aimed at equipping vision language models with the ability of understanding multispectral data.

We introduce a first-of-its-kind spectral-domain vision-language framework designed to effectively process multispectral data.

We demonstrate the benefits of language-grounded features, which lead to improved performance in classification and scene description generation tasks on multispectral imagery.

Visual Backbone

The visual backbone of Spectral-LLaVA leverages the encoder component of SpectralGPT to extract multispectral features. This encoder is designed to learn robust spectrally-aware visual representations from multispectral data, capturing essential spectral and spatial correlations. Unlike the original SpectralGPT framework, which includes masking and reconstruction, Spectral-LLaVA focuses solely on the encoder's pre-trained representations for feature extraction.

Multimodal Projector

Following the LLaVA framework, Spectral-LLaVA employs a trainable linear projection layer to align visual and language modalities. For an input image X_v, the pre-trained SpectralGPT encoder extracts multispectral features Z_v = g(X_v). Then, a projection matrix W is applied to convert Z_v into language embedding tokens H_v, matching the dimensionality of the word embedding space in the language model.

Large Language Model (LLM)

Spectral-LLaVA utilizes the LLaMA3 model as its language backbone. This decoder-only large language model is fine-tuned to integrate multispectral features and perform downstream tasks.

Spectra-LLaVA: Architecture

RS Multimodal Instruction Dataset

The pipeline starts by converting multispectral images into RGB-domain optical images, a critical step to standardize the input format and ensure compatibility with the state-of- the-art image captioning model ShareCaptioner, part of the ShareGPT4V project. This model is then employed to generate detailed captions that describe the scene content. To improve caption accuracy and semantic richness, we integrate metadata—including image labels and spatial attributes—into the captioning process. A model-generated language dataset, considered pseudo-data due to its uncertain accuracy, demon- strates utility through experiments with language-grounded features. Results show that integrating generated captions enhances visual features semantic representation, highlighting their value in contextual understanding

Qualitative Results

Spectral Features vs. Language-Grounded Spectral Features

First, as a qualitative evaluation, we employ t-SNE (scikit-learn t-SNE) on SpectralGPT vision-only features and Spectral-LLaVA language-grounded features derived from class labels for EuroSAT data samples with given category labels. The results are visualized in the figure below.

t-SNE Results — t-SNE visualization of SpectralGPT vision-only features and Spectral-LLaVA language-grounded features. The language-grounded features show a better-clustered categorial structure compared to vision-only features.

BibTeX


                @misc{karanfil2025multispectral,
                    title={A Vision-Language Framework for Multispectral Scene Representation Using Language-Grounded Features}, 
                    author={Enes Karanfil and Nevrez Imamoglu and Erkut Erdem and Aykut Erdem},
                    year={2025},
                    archivePrefix={arXiv},
                    institution={AIST, Tokyo, Japan; Hacettepe University, Ankara, Turkey; Koç University, Istanbul, Turkey}
                }

Acknowledgement

This work was supported by the AIST policy-based budget project, “R&D on Generative AI Foundation Models for the Physical Domain”.