Contents
Dynamic Programming (DP)
-
collections of algorithms that can be used to compute optimal policies, given a perfect model of the environment as a Markov Decision Process
-
Problem types:
- Finite horizon (episodic) vs. infinite horizon (continuting)
- Deterministic vs. stochastic
-
Suboptimal solution to DP
= approximate DP = neuro-dynamic programming = reinforcement learning
Problem formulation
- Find policy $\pi$ that maximizes the expected reward sum i.e.,
$$
\max_{a_t\in A_t(s_t)} E \left \{ \sum_t (s_t, a_t) \right \}
$$
-
An action sequence $\{a^_0,...,a^t,...,a^*{T-1}\}$ is optimal → the subsequence $\{a^_t,...,a^_{T-1}\}$ is optimal for the tail subproblem from $t$

DP algorithm (Backward induction)
Based on the principle of optimality (tail subproblem)
$$
v^(s_t)=\max_{a_t\in A_t(s_t)} E [r_t(s_t,a_t)+v^(s_{t+1}|s_t,a_t)] \text{ for all possible }s_t
$$
- $v^*(s_{t+1}|s_t,a_t)$ → starting from $s_{t+1}$
However,
- Requires the model (MDP kernel)
- Suffers from the curse of dimensionality
Prediction
Without MDP kernel
- DP method requires the knowledge of environment
- e.g., state transition probabilities, reward distributions
- How to estimate the value of policies w/o MDP kernel? → Prediction problem
- Our goal is to find an optimal policy
- How to estimate the optimal value w/o MDP kernel?