$Q$: Query, $K$: Key, $V$: Value
$QK^T = [\text{ sim}_{ij}]$: inner product → similarity between query and key
$1/\sqrt{d_k}$ → normalization: prevent Gradient Vanishing (곱이 너무 커지는 것을 방지)
Outputs:
Operations:
Inputs: