Goal of RL problems

Maximize the expected total reward sum

Key difficulties in learning → Feedback is sequential, evaluative, and sampled

Challenges

Sequential
- Agents receive delayed feedback
- Opposite: one-shot feedback (e.g. MAB problems)
Evaluative
- The goodness of feedback is only relative
- must explore to understand goodness of feedback
  
  → some opportunity losses (increases regret) (explore 하지 않으면 모르니까)
- Opposite: supervised feedback — Answer is provided during learning
Sampled
- Agents CANNOT obtain all possible samples, exhaustively
→ need generalization
- curse of dimensionality and randomness
- Opposite: exhaustive feedback — PI and VI guarantee convergences, and require sampling every state-action pair infinitely

Untitled

Sequential → estimate the values (expected returns)
Evaluative → balance exploration and exploitation
Sampled → generalize the value functions: use neural network for the generalization

State가 너무 많고 복잡해서 sampling을 통해 일부밖에 볼 수 없으므로, 새로운 state가 나와도 이전과 비슷한 state들을 통해 Value 값 예측 → Regression Problem

Untitled

Approximation in value space

$$ v_{t+1}(S_t)=\argmax_{a\in \mathcal A_t} E[r_t+\gamma v_t(S_{t+1})] $$

→ Approximate $v_t(S_t)$ as $\tilde v_t(S_{t+1})$ by using neural network

Parametric class of functions → choose $\tilde v_t$

The family of functions $\tilde v_t(s;\theta_t)$, where $\theta_t=(\theta_{t,1},...,\theta_{t,m})$ is tunable scalar parameters, is called approximated architecture

ex) Gaussian function