Expand description
Multimodal multi-purpose model combining Gemma-based language model with SigLIP image understanding
See PaLiGemma details at:
The model is a multimodal combination of:
- SigLIP vision encoder
- Gemma language model
- Cross-projection layers
References: