Paper

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Hierarchical ViT

Untitled

(a) Swin Transformer builds hierarchical feature maps by merging image patches in deeper layers and has linear computation complexity to input image size due to computation of self-attention only within each local window. Thus, it can serve as a general-purpose backbone for both image classification and dense recognition tasks.

Untitled

Problem: 224x224 image with 56x56 grid of 4x4 patches: attention matrix has $56^4=9.8\text{M}$ entries
Solution: don’t use full attention (global), instead use attention over patches (local)

Window Attention

Untitled

With H x W grid of tokens, each attention matrix is $H^2W^2$ — quadratic in image size

Rather than allowing each token to attend to all other tokens, instead divide into windows of M x M tokens (here M=4); only compute attention window within each window

Total size of all attention matrices in now: $M^2HW$

Linear in image size for fixed M! Swin uses M=7 throughout the network

Problem: token들이 같은 window 안에서만 상호작용하고 다른 window 끼리는 상호작용이 없음

Shifted Window Attention

Solution: Alternate between normal windows and shifted windows in successive Transformer blocks

Untitled

Relative position bias

ViT adds positional embedding to input tokens, encodes absolute position of each token in the image.

Swin does not use positional embeddings, instead encodes relative position between patches when computing attention: $Q,K,V\in\R^{M^2\times d}$ (Query, Key, Value) $B\in\R^{M^2\times M^2}$ (Learned Biases)

$$ \text{Attention}(Q,K,V)=\text{SoftMax}\left(\frac{QK^\top}{\sqrt D}+B\right)V $$