Expand description
Vision Transformer (ViT) implementation.
Vision Transformer applies transformer architecture to image classification by splitting images into patches and processing them as a sequence.
Key characteristics:
- Image patches as sequence tokens
- Self-attention between patches
- Position embeddings
- CLS token for classification
- Layer normalization
References: