Module vit

Module vit 

Source
Expand description

Vision Transformer (ViT) implementation.

Vision Transformer applies transformer architecture to image classification by splitting images into patches and processing them as a sequence.

Key characteristics:

  • Image patches as sequence tokens
  • Self-attention between patches
  • Position embeddings
  • CLS token for classification
  • Layer normalization

References:

Structsยง

Config
Embeddings
Encoder
Model