Paper

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Architecture

Untitled

Gif Version

Untitled

Not quite:
- In step 2, layer is Conv2D (pxp, 3→D, stride=p)
- MLPs in Transformer are stacks of 1x1 convolution

Untitled

Claim: ViT models have “less inductive bias” than ResNets, so need more pretraining data to learn good features
(Not sure this explanation: “inductive bias” is not a well-defined concept we can measure!)
CNN → local process, equivalent (shared kernel) ⇒ inductive bias

Untitled

In most CNNs (including ResNets), decrease resolution and increase channels as you go deeper in the network (Hierarchical Architecture)

Untitled

In a ViT, all blocks have same resolution and number of channels (Isotropic architecture)

→ may lose details (high features)

ViT has much less image-specific inductive bias than CNNs.