MLP-Mixer: An all-MLP Architecture for Vision
MLP-Mixer: An all-MLP Architecture for Vision
Q: Can we use something simpler than self-attention to mix across tokens?
A: MLP!
MLP-Mixer consists of per-patch linear embeddings, Mixer layers, and a classifer head. Mixer layers contain one token-mixing MLP and one channel-mixing MLP, each consisting of two fully-connected layers and a GELU nonlinearity. Other components include: skip-connections, dropout, and layer norm on the channels.
In the extreme case, our architecture can be seen as a very special CNN, which uses 1×1 convolutions for channel mixing, and single-channel depth-wise convolutions of a full receptive field and parameter sharing for token mixing.
→ 그렇게 보일지 몰라도 결과면에서는 경쟁력이 있음