Paper

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Architecture

Untitled

Gif Version

Gif Version

Vision Transformer

Untitled

  1. N input patches → shape 3x16x16
  2. Linear projection to D-dimensional vector: FC (=CNN)
  3. Add positional embedding: learned D-dim vector per position
  4. Transformer! (output = input, D dims)
  5. Special extra input: classification token (D dims, learned)
  6. Linear projection to C-dim vector of predicted class scores

ViT vs ResNets

Untitled

ViT vs CNN

CNN

Untitled

Untitled

In most CNNs (including ResNets), decrease resolution and increase channels as you go deeper in the network (Hierarchical Architecture)

ViT

Untitled

Untitled

In a ViT, all blocks have same resolution and number of channels (Isotropic architecture)

→ may lose details (high features)

Inductive bias

ViT has much less image-specific inductive bias than CNNs.