Paper
Attention Is All You Need
Attention Is All You Need
Attention with RNN

Repeat…
Inputs:
- Query vector: $q=s \in \R^{D_Q}$
- Input vectors: $X \in \R^{N_x \times D_Q}$
- Similarity function: $f_\text{att}$
Computation:
- Similarities: $e_i=f_\text{att}(q,X_i) \in \R^{N_x}$
- Attention wights: $a=\text{softmax}(e) \in \R^{N_x}$
- Output vector: $y=\sum_ia_iX_i \in \R^{D_x}$
Attention Layer

Changes:
-
Use scaled dot product for similarity function
-
Multiple query vectors (→ matrix)
-
Separate key and value
$X$를 같게 사용하는 것이 non-optimal 할 수도…
Self-Attention Layer