Paper

MLP-Mixer: An all-MLP Architecture for Vision

MLP-Mixer: An all-MLP Architecture for Vision

Review: Vision Transfomer

Untitled

ViT

Vision Transformer: Another Look

Untitled

Q: Can we use something simpler than self-attention to mix across tokens?

A: MLP!

Untitled

Mixer’s architecture is based on MLPs that are repeatedly applied across either spatial locations or feature channels.
Mixer relies only on basic matrix multiplication routines, changes to data layout, and scalar nonlinearities

Architecture

Untitled

MLP-Mixer consists of per-patch linear embeddings, Mixer layers, and a classifer head. Mixer layers contain one token-mixing MLP and one channel-mixing MLP, each consisting of two fully-connected layers and a GELU nonlinearity. Other components include: skip-connections, dropout, and layer norm on the channels.

Untitled

In the extreme case, our architecture can be seen as a very special CNN, which uses 1×1 convolutions for channel mixing, and single-channel depth-wise convolutions of a full receptive field and parameter sharing for token mixing.

→ 그렇게 보일지 몰라도 결과면에서는 경쟁력이 있음