Contents

Basic procedure

Gathering experience samples
Learning to estimate something
Improving a policy

Generalized Policy Iteration (GPI)

Policy iteration, independent of the details of PE and PI processes

Monte Carlo Control

Policy evaluation = Monte Carlo prediction → improving policies after each episode
Policy improvement = $\epsilon$-greedy policy
Estimating “action-value” function $q_\pi(s,a)$ (Q-factor)
- Rather than state-value function $v_\pi(s)$
- Allow model-free RL without MDP kernel (MDP kernel은 v-value에서 q-value 구할 때 이용)

Untitled

vs. Policy Iteration
Algorithm
- Backward-view

Drawback of MC control

an apisode 끝날 때까지 기다려야 함 → learning is slow
MC prediction 대신 TD prediction?

SARSA (State-Action-Reward-State-Action)

Use TD learning method → policy improvement after each step
Bootstrapping on the estimate → less variation and more bias