Contents

Dynamic Programming (DP)

collections of algorithms that can be used to compute optimal policies, given a perfect model of the environment as a Markov Decision Process
Problem types:
- Finite horizon (episodic) vs. infinite horizon (continuting)
- Deterministic vs. stochastic
Suboptimal solution to DP

= approximate DP = neuro-dynamic programming = reinforcement learning

Problem formulation

Find policy $\pi$ that maximizes the expected reward sum i.e.,

$$ \max_{a_t\in A_t(s_t)} E \left \{ \sum_t (s_t, a_t) \right \} $$

An action sequence $\{a^_0,...,a^t,...,a^*{T-1}\}$ is optimal → the subsequence $\{a^_t,...,a^_{T-1}\}$ is optimal for the tail subproblem from $t$

DP algorithm (Backward induction)

Based on the principle of optimality (tail subproblem)

$$ v^(s_t)=\max_{a_t\in A_t(s_t)} E [r_t(s_t,a_t)+v^(s_{t+1}|s_t,a_t)] \text{ for all possible }s_t $$

$v^*(s_{t+1}|s_t,a_t)$ → starting from $s_{t+1}$

However,

Requires the model (MDP kernel)
Suffers from the curse of dimensionality

Prediction

Without MDP kernel

DP method requires the knowledge of environment
- e.g., state transition probabilities, reward distributions
- How to estimate the value of policies w/o MDP kernel? → Prediction problem
Our goal is to find an optimal policy
- How to estimate the optimal value w/o MDP kernel?