Mobile CLIP model, combining a lightweight vision encoder with a text encoder
A mobile-optimized CLIP implementation that uses:
- FastViT as the vision encoder
- OpenCLIP text encoder
- Projection layers to align the feature spaces
See model details at:
References: