Maximize the expected total reward sum
Key difficulties in learning → Feedback is sequential, evaluative, and sampled
Sequential
Evaluative
The goodness of feedback is only relative
must explore to understand goodness of feedback
→ some opportunity losses (increases regret) (explore 하지 않으면 모르니까)
Opposite: supervised feedback — Answer is provided during learning
Sampled
→ need generalization
State가 너무 많고 복잡해서 sampling을 통해 일부밖에 볼 수 없으므로, 새로운 state가 나와도 이전과 비슷한 state들을 통해 Value 값 예측 → Regression Problem
Approximation in value space
$$ v_{t+1}(S_t)=\argmax_{a\in \mathcal A_t} E[r_t+\gamma v_t(S_{t+1})] $$
→ Approximate $v_t(S_t)$ as $\tilde v_t(S_{t+1})$ by using neural network
Parametric class of functions → choose $\tilde v_t$
The family of functions $\tilde v_t(s;\theta_t)$, where $\theta_t=(\theta_{t,1},...,\theta_{t,m})$ is tunable scalar parameters, is called approximated architecture
ex) Gaussian function