The LLaVA (Large Language and Vision Assistant) model.
This provides the main model implementation combining a vision tower (CLIP) with
language model (Llama) for multimodal capabilities. The architecture implements the training-free projection technique.