You got me looking for attention~
You got me looking for attention~
Neural Machine Translation by Jointly Learning to Align and Translate
Neural Machine Translation by Jointly Learning to Align and Translate
An encoder reads the input sentences, a sequence of vectors $\mathbf x = (x_1,...,x_{T_x})$, into a vector $c$. The most common appraoch is to use an RNN:
$$ h_t = f(x_t,h_{t-1}) \\ c=q(\{h_1,...,h_{T_x}\}) $$
where $h_t \in \R^n$ is a hidden state at time $t$, and $c$ is a vector generated from the sequence of the hidden states. $f$ and $q$ are some nonlinear functions. LSTM use $f$ and $q=h_T$, for instance.
The decoder is often trained to predict the next word $y_{t'}$, given the context vector $c$ and all the previously predicted words $\{y_1,...,t_{t'-1}\}$. In other words, the decoder defines a probability over the translation $\mathbf y$ by decomposing the joint probability into the ordered conditionals:
$$ p(\mathbf y)=\prod^T_{t=1}p(y_t|\{y_1,...,y_{t-1}\}, c) $$
where $\mathbf y=(y_1,...,y_{T_y})$.
With an RNN, each conditional probability is modeled as
$$ p(y_t|\{y_1,...,y_{t-1}\}, c)=g(y_{t-1}, s_t, c) $$
where $g$ is nonlinear (multi-layered) function that outputs the probability of $y_t$, and $s_t$ is the hidden state of the RNN.
For time $t$ and $i \in [1,T_x]$
RNN hidden state : $s_t = f(s_{t-1}, y_{t-1}, c_t)$
Alignment score $e_{t,i}=a(s_{t-1}, h_i)$
→ scores how well the inputs around position $i$ and the output at position $t$ match