Contents
Basic procedure
Gathering experience samples
Learning to estimate something
Improving a policy
Generalized Policy Iteration (GPI)
Policy iteration, independent of the details of PE and PI processes
Monte Carlo Control
Policy evaluation = Monte Carlo prediction → improving policies after each episode
Policy improvement = $\epsilon$-greedy policy
Estimating “action-value” function $q_\pi(s,a)$ (Q-factor)
Rather than state-value function $v_\pi(s)$
Allow
model-free
RL without MDP kernel (MDP kernel은 v-value에서 q-value 구할 때 이용)
vs. Policy Iteration
Algorithm
Backward-view
Drawback of MC control
an apisode 끝날 때까지 기다려야 함 → learning is slow
MC prediction 대신 TD prediction?
SARSA (State-Action-Reward-State-Action)
Use TD learning method → policy improvement after each step
Bootstrapping on the estimate → less variation and more bias