Paper
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Architecture


Gif Version
Vision Transformer

- N input patches → shape 3x16x16
- Linear projection to D-dimensional vector: FC (=CNN)
- Add positional embedding: learned D-dim vector per position
- Transformer! (output = input, D dims)
- Special extra input: classification token (D dims, learned)
- Linear projection to C-dim vector of predicted class scores
- Not quite:
- In step 2, layer is Conv2D (pxp, 3→D, stride=p)
- MLPs in Transformer are stacks of 1x1 convolution
ViT vs ResNets

- Claim: ViT models have “less inductive bias” than ResNets, so need more pretraining data to learn good features
- (Not sure this explanation: “inductive bias” is not a well-defined concept we can measure!)
- CNN → local process, equivalent (shared kernel) ⇒ inductive bias
ViT vs CNN
CNN


In most CNNs (including ResNets), decrease resolution and increase channels as you go deeper in the network (Hierarchical Architecture)
ViT


In a ViT, all blocks have same resolution and number of channels (Isotropic architecture)
→ may lose details (high features)
Inductive bias
ViT has much less image-specific inductive bias than CNNs.