Paper

Neural Machine Translation by Jointly Learning to Align and Translate

Neural Machine Translation by Jointly Learning to Align and Translate

RNN Encoder-Decoder

An encoder reads the input sentences, a sequence of vectors $\mathbf x = (x_1,...,x_{T_x})$, into a vector $c$. The most common appraoch is to use an RNN:

$$ h_t = f(x_t,h_{t-1}) \\ c=q(\{h_1,...,h_{T_x}\}) $$

where $h_t \in \R^n$ is a hidden state at time $t$, and $c$ is a vector generated from the sequence of the hidden states. $f$ and $q$ are some nonlinear functions. LSTM use $f$ and $q=h_T$, for instance.

The decoder is often trained to predict the next word $y_{t'}$, given the context vector $c$ and all the previously predicted words $\{y_1,...,t_{t'-1}\}$. In other words, the decoder defines a probability over the translation $\mathbf y$ by decomposing the joint probability into the ordered conditionals:

$$ p(\mathbf y)=\prod^T_{t=1}p(y_t|\{y_1,...,y_{t-1}\}, c) $$

where $\mathbf y=(y_1,...,y_{T_y})$.

With an RNN, each conditional probability is modeled as

$$ p(y_t|\{y_1,...,y_{t-1}\}, c)=g(y_{t-1}, s_t, c) $$

where $g$ is nonlinear (multi-layered) function that outputs the probability of $y_t$, and $s_t$ is the hidden state of the RNN.

Learning to Align and Translate

Decoder

스크린샷 2023-04-25 오전 10.43.58.png

For time $t$ and $i \in [1,T_x]$
RNN hidden state : $s_t = f(s_{t-1}, y_{t-1}, c_t)$
Alignment score $e_{t,i}=a(s_{t-1}, h_i)$

→ scores how well the inputs around position $i$ and the output at position $t$ match