Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
(a) Swin Transformer builds hierarchical feature maps by merging image patches in deeper layers and has linear computation complexity to input image size due to computation of self-attention only within each local window. Thus, it can serve as a general-purpose backbone for both image classification and dense recognition tasks.
With H x W grid of tokens, each attention matrix is $H^2W^2$ — quadratic in image size
Rather than allowing each token to attend to all other tokens, instead divide into windows of M x M tokens (here M=4); only compute attention window within each window
Total size of all attention matrices in now: $M^2HW$
Linear in image size for fixed M! Swin uses M=7 throughout the network
Problem: token들이 같은 window 안에서만 상호작용하고 다른 window 끼리는 상호작용이 없음
Solution: Alternate between normal windows and shifted windows in successive Transformer blocks
ViT adds positional embedding to input tokens, encodes absolute position of each token in the image.
Swin does not use positional embeddings, instead encodes relative position between patches when computing attention: $Q,K,V\in\R^{M^2\times d}$ (Query, Key, Value) $B\in\R^{M^2\times M^2}$ (Learned Biases)
$$ \text{Attention}(Q,K,V)=\text{SoftMax}\left(\frac{QK^\top}{\sqrt D}+B\right)V $$