Paper

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Basic Architecture

스크린샷 2023-04-25 오전 11.40.17.png

Goal: Single raw image → caption $\mathbf y$ encoded as a sequence of 1-of-K encoded words

$y=\{\mathbf y_1,...,\mathbf y_C\}, \mathbf y_i\in\R^K$

K: vocabulary size, C: length of the caption

스크린샷 2023-04-25 오전 11.41.53.png

Repeat…

스크린샷 2023-04-25 오전 11.54.04.png

스크린샷 2023-04-25 오후 2.03.41.png