Based on the BEIT vision-language model.
See “BEIT: BERT Pre-Training of Image Transformers”, Bao et al. 2021