Contents

Episodic task (finite horizon problem)

Episode (trial): agent-environment interactions naturally breaks
- each episode begins independently
- ex. 10 games of chess (each game becomes an episode)
Return $G_t$: the reward sum “after” time step t
- $G_t=R_{t+1}+R_{t+2}+...+R_T$
- T is the time of termination (= the end of the episode)

Continuing task (infinite horizon problem)

Agent-environment interaction does not end
- Since $T = \infin$, simple sum of rewards may diverge → need discounting
Discounted return $G_t$ with discount rate $\gamma$ ($0 <\gamma <1$)
- $G_t=R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+ ... = \sum^\infin_{k=0}\gamma^kR_{t+k+1}$
- $G_t=R_{t+1}+\gamma G_{t+1}$ : recursive form
Subscript $t$ does not mean $G_t$ is a function of $t$.

Computing the value

Policy $\pi$: mapping state to action (or prob. distribution over actions)
- $\pi(a|s)$ : probability of taking action a in state s
Once policy $\pi$ is fixed, there is the “expectation of the return”

스크린샷 2023-05-13 오후 3.20.40.png

System behaviors by env → 통제할 수 없는 행동. slippery 처럼 action 취해도 확률 가지는거
Our control through policy → policy 에 따른 행동
$s \rightarrow E[G_t|S_t=s]$ 로 return 값 계산 가능

Value function

Expected return of the state or action
State-value function:

$v_\pi(s)=E_\pi[G_t|S_t=s]=E_\pi[R_{t+1}+\gamma G_{t+1}|S_t=s]$
Action-value function:

$q_\pi(s,a)=E_\pi[G_t|S_t=s,A_t=a]=E_\pi[R_{t+1}+\gamma G_{t+1}|S_t=s,A_t=a]$
Comparison