Visual Backbone
The visual backbone of Spectral-LLaVA leverages the encoder component of SpectralGPT to extract multispectral features. This encoder is designed to learn robust spectrally-aware visual representations from multispectral data, capturing essential spectral and spatial correlations. Unlike the original SpectralGPT framework, which includes masking and reconstruction, Spectral-LLaVA focuses solely on the encoder's pre-trained representations for feature extraction.
Multimodal Projector
Following the LLaVA framework, Spectral-LLaVA employs a trainable linear projection layer to align visual and language modalities.
For an input image Xv
, the pre-trained SpectralGPT encoder extracts multispectral features
Zv = g(Xv)
. Then, a projection matrix W
is applied to convert
Zv
into language embedding tokens Hv
, matching the dimensionality of the
word embedding space in the language model.
Large Language Model (LLM)
Spectral-LLaVA utilizes the LLaMA3 model as its language backbone. This decoder-only large language model is fine-tuned to integrate multispectral features and perform downstream tasks.