Paper

MLP-Mixer: An all-MLP Architecture for Vision

MLP-Mixer: An all-MLP Architecture for Vision

Review: Vision Transfomer

Untitled

ViT

Vision Transformer: Another Look

Untitled

Q: Can we use something simpler than self-attention to mix across tokens?

A: MLP!

Untitled

Architecture

Untitled

MLP-Mixer consists of per-patch linear embeddings, Mixer layers, and a classifer head. Mixer layers contain one token-mixing MLP and one channel-mixing MLP, each consisting of two fully-connected layers and a GELU nonlinearity. Other components include: skip-connections, dropout, and layer norm on the channels.

Untitled

In the extreme case, our architecture can be seen as a very special CNN, which uses 1×1 convolutions for channel mixing, and single-channel depth-wise convolutions of a full receptive field and parameter sharing for token mixing.

→ 그렇게 보일지 몰라도 결과면에서는 경쟁력이 있음