RL Interaction

스크린샷 2023-05-12 오후 6.50.46.png

Discrete-Time Markov Chain (DTMC)

A discrete-time stochastic model → a sequence of possible events:

각 event의 사건의 확률은 이전 사건으로부터 얻어진 state에 의존
A stochastic process $\{X_n, n=0,1,2,...\}$

$X_n$ → state at (discrete) time step n
DTMC

$$ \begin{align*} & P(X_{n+1}=j|X_n=i, X_{n-1}=i_{n-1}, ..., X_0=i_0) \\ & =P(X_{n+1}=j|X_n=i) \\ & = P_{ij} \end{align*} $$

where $P_{ij}$ is independent of the past history and of the time step (n)

Rules
- win prob = p, lose prob = 1 - p
- win → +1 point, lose → -1 point
- how to define a state as your money? (total sum)
Consider
- $P(X_{n+1}=j|X_n=i, X_{n-1}=i_{n-1}, ..., X_0=i_0)$
- $j=i+1 \rightarrow p$, $j = i-1 \rightarrow 1-p$
- It depends only on $X_n$ → Markov property
- It does not depend on time(n) → Stationary property

$P=P_{ij}$ 를 알면 stationary probability distribution $\Pi_i$ 를 알 수 있음:

$\Pi_i=\text{Prob\{current state is i\}} = E(\text{state i})$

→ average time spent in state i, inter-visit(재방문) time to state i 같은 정보를 알 수 있음

스크린샷 2023-05-12 오후 7.13.42.png

p(next state | current state, action) 로 표현

In mathematics, a MDP is a discrete-time stochastic control process.
Formulating RL problems through MDP
- Agent: Action $A_t=a$ on state $S_t=s$
- Environment: $R_{t+1}=r$, transiting next state $S_{t+1}=s'$
Should satisfy Markov property and stationary property
- Markov: 바로 이전 state, action에 대해서만 의존 / stationary: 시간에 의존 X
- $P(S_{t+1}, R_{t+1}|A_t, S_t, A_{t-1}, S_{t-1},...,A_1,S_1)=P(S_{t+1}, A_{t+1}|S_t, A_t)$
Kernel of MDP describes environment’s behavior
- $P(S_{t+1}=s', R_{t+1}=r'|S_t=s,A_t=a)=p(s',r|s,a)$