第2章马尔可夫决策过程

2.1 马尔可夫决策过程（上）

Markov Decision Process（MDP）

在这里插入图片描述

Markov Decision Process can model a lot of real-world problem. It formally describes the framework of reinforcement learning
Under MDP, the environment is fully observable.
1. Optimal control primarily deals with continuous MDPs
2. Partially observable problems can be converted into MDPs

Markov Property

The history of states: $h_{t}=\left \{ s_{1},s_{2},s_{3},...,s_{t} \right \}$
State $s_{t}$ is Markovian if and only if:
$p(s_{t+1}|s_{t})=p(s_{t+1}|h_{t})$

$p(s_{t+1}|s_{t},a_{t})=p(s_{t+1}|h_{t},a_{t})$
“The future is independent of the past given the present”

Markov Process/Markov Chain

在这里插入图片描述

State transition matrix P specifies $p(s_{t+1}=s'|s_{t}=s)$
$P=\begin{bmatrix} P(s_{1}|s_{1}) & P(s_{2}|s_{1}) & ... & P(s_{N}|s_{1})\\ P(s_{1}|s_{2}) & P(s_{2}|s_{2}) & ... & P(s_{N}|s_{2})\\ ... & ... & \ddots & ...\\ P(s_{1}|s_{N}) & P(s_{2}|s_{N}) & ... & P(s_{N}|s_{N}) \end{bmatrix}$

Example of MP

在这里插入图片描述

Sample episodes starting from $s_{3}$
1. $s_{3},s_{4},s_{5},s_{6},s_{6}$
2. $s_{3},s_{2},s_{3},s_{2},s_{1}$
3. $s_{3},s_{4},s_{4},s_{5},s_{5}$

Markov Reward Process (MRP)

Markov Reward Process is a Markov Chain + reward
Definition of Markov Reward Process (MRP)
1. S is a (finite) set of states (s ∈ S)
2. P is dynamics/transition model that specifies $P(S_{t+1}=s'|s_{t}=s)$
3. R is a reward function $R(s_{t}=s)=E[r_{t}|s_{t}=s]$
4. Discount factor $\gamma ∈[0,1]$
If finite number of states, R can be a vector

Example of MRP

在这里插入图片描述

Reward: +5 in $s_{1}$ , +10 in $s_{7}$ , 0 in all other states. So that we can represent R = [5, 0, 0, 0, 0, 0, 10]

Return and Value function

Definition of Horizon
1. Number of maximum time steps in each episode
2. Can be infinite, otherwise called finite Markov (reward) Process
Definition of Return
1. Discounted sum of rewards from time step t to horizon
  $G_{t}=R_{t+1}+γR_{t+2}+γ^{2}R_{t+3}+γ^{3}R_{t+4}+...+γ^{T-t-1}R_{T}$
Definition of state value function $V_{t}(s)$ for a MRP
1. Expected return from t in state s
  ${V_{t}(s)=E[G_{t}|s_{t}=s]} =E[R_{t+1}+γR_{t+2}+γ^{2}R_{t+3}+γ^{3}R_{t+4}+...+γ^{T-t-1}R_{T}|s_{t}=s]$
2. Present value of future rewards

Why Discount Factor γ

Avoid infinite returns in cycle Markov processes
Uncertainly about the future may not be fully represented
If the reward is financial, immediate rewards may earn more interest than delayed rewards
Animal/human behaviour shows preference for immediate reward
It is sometimes possible to use undiscounted Markov reward processes (i.e. γ = 1), e.g if all sequences terminate.
1. γ = 0: Only care about the immediate reward
2. γ = 1: Future reward is equal to the immediate reward.

Example of MRP

在这里插入图片描述

Reward: +5 in $s_{1}$ , +10 in $s_{7}$ , 0 in all other states. So that we can represent R = [5, 0, 0, 0, 0, 0, 10]
Sample returns G for a 4-step episodes with γ = 1/2
1. return for $s_{4},s_{5},s_{6},s_{7}$ : $0+\frac{1}{2}×0+\frac{1}{4}×0+\frac{1}{8}×10=1.25$
2. return for $s_{4},s_{3},s_{2},s_{1}$ : $0+\frac{1}{2}×0+\frac{1}{4}×0+\frac{1}{8}×5=0.625$
3. return for $s_{4},s_{5},s_{6},s_{6}$ : = 0
How to compute the value function? For example, the value of state $s_{4}$ as $V(s_{4})$

Compute the Value of a Markov Reward Process

Value function: expected return from starting in state s
${V(s)=E[G_{t}|s_{t}=s]} =E[R_{t+1}+γR_{t+2}+γ^{2}R_{t+3}+γ^{3}R_{t+4}+...+γ^{T-t-1}R_{T}|s_{t}=s]$
MRP value function satisfies the following Bellman equation:
$V(s)=\underset{Immediate \,reward}{\underbrace{R(s)}}+\underset{Discounted \, sum\, of \, future \, reward}{\underbrace{\gamma \sum_{s\in S'}^{}P(s'|s)V(s')}}$
Practice: To derive the Bellman equation for V(s)
1. Hint: $V(s)=E[R_{t+1}+γE[R_{t+2}+γ^{2}R_{t+3}+...]|s_{t}=s]$

Understanding Bellman equation

Bellman equation describes the iterative relations of states
$V(s)=R(s)+\gamma \sum_{s'\in S}^{}P(s'|s)V(s')$

在这里插入图片描述

Matrix Form of Bellman Equation for MRP

Therefore, we can express V(s) using the matrix form:

在这里插入图片描述

Analytic solution for value of MRP: $V=(I-γP)^{-1}R$
1. Matrix inverse takes the complexity $O(N^{3})$ for N states
2. Only possible for a small MRPs

Iterative Algorithm for Computing Value of a MRP

Iterative methods for large MRPs:
1. Dynamic Programming
2. Monte-Carlo evaluation
3. Temporal-Difference learning

Monte Carlo Algorithm for Computing Value of a MRP

Algorithm 1 Monte Carlo simulation to calculate MRP value function

$i\leftarrow 0,G_{t}\leftarrow 0$
while $i \neq = N$ do
generate an episode, starting from state s and time t
Using the generated episode, calculate return $g=\sum_{i=t}^{H-1}\gamma ^{i-t}r_{i}$
$G_{t}\leftarrow G_{t}+g,i\leftarrow i+1$
end while
$V_{t}(s)\leftarrow G_{t}/N$
For example: to calculate $V(s_{4})$ we can generate a lot of trajectories then take the average of the returns:
1. return for $s_{4},s_{5},s_{6},s_{7}$ : $0+\frac{1}{2}×0+\frac{1}{4}×0+\frac{1}{8}×10=1.25$
2. return for $s_{4},s_{3},s_{2},s_{1}$ : $0+\frac{1}{2}×0+\frac{1}{4}×0+\frac{1}{8}×5=0.625$
3. return for $s_{4},s_{5},s_{6},s_{6}$ : = 0
4. more trajectories

Iterative Algorithm for Computing Value of a MRP

Algorithm 1 Iterative Algorithm to calculate MRP value function

for all states $s∈S,V'(s)\leftarrow 0,V(s)\leftarrow ∞$
while $||V-V'||>\epsilon$ do
$V\leftarrow V'$
For all states $s∈S,V'(s)=R(s)+\gamma \sum_{s'\in S}^{}P(s'|s)V(s')$
end while
return $V^{'} (s)$ for all $s \in S$

Markov Decision Process (MDP)

Markov Decision Process is Markov Reward Process with decisions.
Definition of MDP
1. S is a finite set of states
2. A is a finite set of actions
3. $P^{a}$ is dynamics/transition model for each action
  
  $P(s_{t+1}=s'|s_{t}=s,a_{t}=a)$
4. R is a reward function $R(s_{t}=s,a_{t}=a)=E[r_{t}|s_{t}=s,a_{t}=a]$
5. Discount factor $γ \in [0, 1]$
MDP is a turple: $(S, A, P, R, γ)$

Policy in MDP

Policy specifies what action to take in each state
Give a state, specify a distribution over actions
Policy: $\pi(a|s)=P(a_{t}=a|s_{t}=s)$
Policies are stationary (time-independent), $A_{t}\sim \pi(a|s)$ for any t > 0
Given an MDP $(S, A, P, R, γ)$ and a policy $\pi$
The state sequence $S_{1},S_{2},...$ is a Markov process $(S,P^{\pi})$
The state and reward sequence $S_{1},R_{1},S_{2},R_{2},...$ is a Markov reward process $(S,P^{\pi},R^{\pi},γ)$ where,
$P^{\pi}(s'|s)=\sum_{a∈A}\pi(a|s)P(s'|s,a)\\$

$R^{\pi}(s)=\sum_{a∈A}\pi(a|s)P(s,a)$

Comparison of MP/MRP and MDP

在这里插入图片描述

Value function for MDP

The state-value function $v^{\pi}(s)$ of an MDP is the expected return starting from state s, and following policy $\pi$
$v^{\pi}(s)=E{\pi}[G_{t}|s_{t}=s]$
The action-value function $q^{\pi}(s,a)$ is the expected return starting from state s, taking action a, and following policy $\pi$
$q^{\pi}(s,a)=E{\pi}[G_{t}|s_{t}=s,A_{t}=a]$
We have the relation between $v^{\pi}(s)$ and $q^{\pi}(s,a)$
$v^{\pi}(s)=\sum_{a∈A}\pi(a|s)q^{\pi}(s,a)$

Bellman Expection Equation

The state-value function can be decomposed into immediate reward plus discounted value of successor state,
$v^{\pi}(s)=E_{\pi}[R_{t+1}+γv^{\pi}(s_{t+1})|s_{t}=s]$
The action-value function can similarly be decomposed
$q^{\pi}(s,a)=E_{\pi}[R_{t+1}+γq^{\pi}(s_{t+1},A_{t+1})|s_{t}=s,A_{t}=a]$

Bellman Expection Equation for $V^{\pi}$ and $Q^{\pi}$

$v^{\pi}(s)=\sum_{a∈A}\pi(a|s)q^{\pi}(s,a)$

$q^{\pi}(s,a)=R_{s}^{a}+γ\sum_{s'∈S}P(s'|s,a)v^{\pi}(s')$

Thus
$v^{\pi}(s)=\sum_{a∈A}\pi(a|s)(R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v^{\pi}(s'))$

$q^{\pi}(s,a)=R(s,a)+γ\sum_{s'∈S}P(s'|s,a)\sum_{a'∈A}\pi(a'|s')q^{\pi}(s',a')$

Backup Diagram for $V^{\pi}$

在这里插入图片描述

$v^{\pi}(s)=\sum_{a∈A}\pi(a|s)(R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v^{\pi}(s'))$

Backup Diagram for $Q^{\pi}$

在这里插入图片描述

$q^{\pi}(s,a)=R(s,a)+γ\sum_{s'∈S}P(s'|s,a)\sum_{a'∈A}\pi(a'|s')q^{\pi}(s',a')$

Policy Evaluation

Evaluate the value of state given a policy $\pi$ : compute $v^{\pi}(s)$
Also called as (value) prediction

Example: Navigate the boat

在这里插入图片描述

Example: Policy Evaluation

在这里插入图片描述

Two actions: Left and Right
For all actions, reward: +5 in $s_{1}$ , +10 in $s_{7}$ , 0 in all other states. So that we can represent R = [5, 0, 0, 0, 0, 0, 10]
Let’s have a deterministic policy $\pi(s)=$ Left and $γ = 0$ for any state s, then what is the value of the policy?
1. $v^{\pi}= [5, 0, 0, 0, 0, 0, 10]$
Iteration: $v_{k}^{\pi}(s)=r(s,\pi(s))+γ\sum_{s'∈S}P(s'|s,\pi(s))v_{k-1}^{\pi}(s')$
$R = [5, 0, 0, 0, 0, 0, 10]$
Practice 1: Deterministic policy $\pi(s)=$ Left and $γ = 0.5$ for any state s, then what are the states values under the policy?
Practice 2: Stochastic policy $P(\pi(s)=Left)=0.5$ and $P(\pi(s)=Right)=0.5$ and $γ = 0.5$ for any state s, then what are the states values under the policy?
Iteration: $v_{k}^{\pi}(s)=r(s,\pi(s))+γ\sum_{s'∈S}P(s'|s,\pi(s))v_{k-1}^{\pi}(s')$

2.2 马尔可夫决策过程（下）

Decison Making in Markov Decision Process（MDP）

Prediction (evaluate a given policy):
1. Input: MDP $< S, A, P, R, γ >$ and policy $\pi$ or MRP $<S,P^{\pi},R^{\pi},γ>$
2. Output: value function $v^{\pi}$
Control (search the optimal policy):
1. Input: MDP $< S, A, P, R, γ >$
2. Output: optimal value function $v^{*}$ and optimal policy $\pi^{*}$
Prediction and control in MDP can be solved by dynamic programming.

Dynamic programming

Dynamic programming is a very general solution method for problems which have two properties:

Optimal substructure
1. Principle of optimality applies
2. Optimal solution can be decomposed into subproblems
Overlapping subproblems
1. Subproblems recur many times
2. Solutions can be cached and reused

Markov decision processes satisfy both properties

Bellman equation gives recursive decomposition
Value function stores and reuses solutions

Policy evaluation on MDP

Objective: Evaluate a given policy $\pi$ for a MDP
Output: the value function under policy $v^{\pi}$
Solution: iteration on Bellman expectation backup
Algorithm: Synchronous backup
1. At each iteration t+1
  
  update $v_{t+1}(s)$ from $v_{t}(s')$ for all states $s \in S$ where s’ is a successor state of s
  $v_{t+1}(s)=\sum_{a∈A}\pi(a|s)(R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v_{t}(s'))$
Convergence: $v_{1}\rightarrow v_{2}\rightarrow ...\rightarrow v^{\pi}$

Policy evaluation: Iteration on Bellman expectation backup

Bellman expectation backup for a particular policy
$v_{t+1}(s)=\sum_{a∈A}\pi(a|s)(R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v_{t}(s'))$
Or if in the form of MRP $<S,P^{\pi},R^{\pi},γ>$
$v_{t+1}(s)=R^{\pi}(s)+\gamma P^{\pi}(s'|s)V_{t}(s')$

Evaluating a Random Policy in the Small Gridworld

Example 4.1 in the Sutton RL textbook

在这里插入图片描述

Undiscounted episodic MDP ( $γ = 1$ )
Nonterminal states 1, …, 14
Two terminal states (two shaded squares)
Action leading out of grid leaves state unchanged, $P (7∣7, r i g h t) = 1$
Reward is -1 until the terminal state is reach
Transition is deterministic given the action, e.g., $P (6∣5, r i g h t) = 1$
Uniform random policy $\pi(l|.)=\pi(r|.)=\pi(u|.)=\pi(d|.)=0.25$

A live demo on policy evaluation

$v^{\pi}(s)=\sum_{a∈A}\pi(a|s)(R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v^{\pi}(s'))$

https://cs.stanford.edu/people/karpathy/reinforcejs/gridworld_dp.html

Optimal Value Function

The optimal state-value function $v^{*}(s)$ is the maximum value function over all policies
$v^{*}(s)=\underset{\pi}{max}\,v^{\pi}(s)$
The optimal policy
$\pi^{*}(s)=arg\,\underset{\pi}{max}\,v^{\pi}(s)$
An MDP is “solved” when we know the optimal value
There exists a unique optimal value function, but could be multiple optimal policies (two actions that have the same optimal value function)

Finding Optimal Policy

An optimal policy can be found by maximizing over $q^{*}(s,a)$ ,
$\pi^{*}(a|s)\begin{cases} 1, & \text{ if } a=arg\,max_{a\in A}\,q^{*}(s,a) \\ 0, & \text{ otherwise } \end{cases}$
There is always a deterministic optimal policy for any MDP
If we know $q^{*}(s,a)$ , we immediately have the optimal policy

Policy Search

One option is to enumerate search the best policy
Number of deterministic policies is $A|^{|S|}$
Other approaches such as policy iteration and value iteration are more efficient

MDP Control

Compute the optimal policy
$\pi^{*}(s)=arg\,\underset{\pi}{max}\,v^{\pi}(s)$
Optimal policy for a MDP in an infinite horizon problem (agent acts forever) is
1. Deterministic
2. Stationary (does not depend on time step)
3. Unique? Not necessarily, may have state-actions with identical optimal values

Improving a Policy through Policy Iteration

Iterate through the two steps:
1. Evaluate the policy $\pi$ (computing v given current $\pi$ )
2. Improve the policy by acting greedily with respect to $v^{\pi}$
  $\pi'=greedy(v^{\pi})$

在这里插入图片描述

Policy Improvement

Compute the state-action value of a policy $\pi$ :
$q^{\pi_{i}}(s,a)=R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v^{\pi_{i}}(s')$
Compute new policy $\pi_{i+1}$ for all $s \in S$ following
$\pi_{i+1}(s)=arg\,\underset{a}{max}\,q^{\pi_{i}}(s,a)$

在这里插入图片描述

Monotonic Improvement in Policy

Consider a deterministic policy $a=\pi(s)$
We improve the policy through
$\pi'(s)=arg\,\underset{a}{max}\,q^{\pi}(s,a)$
This improves the value from any state s over one step,
$q^{\pi}(s,\pi'(s))=\underset{a∈A}{max}\,q^{\pi}(s,a)≥q^{\pi}(s,\pi(s))=v^{\pi}(s)$
It therefore improves the value function, $v_{\pi'(s)}≥v^{\pi}(s)$

$v^{\pi}(s)≤q^{\pi}(s,\pi'(s))=E_{\pi'}[R_{t+1}+γv^{\pi}(S_{t+1}|S_{t}=s)]$

$≤E_{\pi'}[R_{t+1}+γq^{\pi}(S_{t+1},\pi'(S_{t+1}))|S_{t}=s)]$

$≤E_{\pi'}[R_{t+1}+γR_{t+2}+γ^{2}q^{\pi}(S_{t+2},\pi'(S_{t+2}))|S_{t}=s)]$

$≤E_{\pi'}[R_{t+1}+γR_{t+2}+...|S_{t}=s)]=v_{\pi'}(s)$
If iImprovements stop,
$q^{\pi}(s,\pi'(s))=\underset{a∈A}{max}\,q^{\pi}(s,a)≥q^{\pi}(s,\pi(s))=v^{\pi}(s)$
Thus the Bellman optimality equation has been satisified

$v^{\pi}(s)=\underset{a∈A}{max}\,q^{\pi}(s,a)$
Therefore $v^{\pi}(s)=v^{*}(s)$ for all $s \in S$ , so $\pi$ is an optimal policy

Bellman Optimality Equation

1️⃣The optimal value functions are reached by the Bellman optimality equations:
$v^{*}(s)=\underset{a}{max}\,q^{\pi}(s,a)$

$q^{*}(s,a)=R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v^{*}(s')$

thus
$v^{*}(s)=\underset{a}{max}R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v^{*}(s')$

$q^{*}(s,a)=R(s,a)+γ\sum_{s'∈S}P(s'|s,a)\,\underset{a'}{max}\,q^{*}(s',a')$

Value Iteration by turning the Bellman Optimality Equation as update rule

1️⃣If we know the solution to subproblem $v^{*}(s')$ , which is optimal.

2️⃣Then the solution for the optimal $v^{*}(s)$ can be found by iteration over the following Bellman Optimality backup rule,
$v(s)\leftarrow\underset{a∈A}{max}(R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v^{*}(s'))$

3️⃣The idea of value iteration is to apply these updates iteratively

Algorithm of Value Iteration

Objective: find the optimal policy $\pi$
Solution: iteration on the Bellman optimality backup
Value Iteration algorithm:
1. initialize $k = 1$ and $v_{0}(s)=0$ for all states s
2. For $k = 1 : H$
  1. for each states s
    $q_{k+1}(s,a)=R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v_{k}(s')$
    
    $v_{k+1}(s)=\underset{a}{max}\,q_{k+1}(s,a)$
  2. $k\leftarrow k+1$
3. To retrieve the optimal policy after value iteration:
  $\pi(s)=arg\,\underset{a}{max}\,R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v_{k+1}(s')$

Example: Shortest Path

在这里插入图片描述

After the optimal values are reached, we run policy extraction to retrieve the optimal policy.

Demo of policy iteration and value Iteration

在这里插入图片描述

1️⃣Policy iteration: Iteration of policy evaluation and policy improvement(update)

2️⃣Value iteration

3️⃣https://cs.stanford.edu/people/karpathy/reinforcejs/gridworld_dp.html

Policy iteration and value iteration on FrozenLake

1️⃣https://github.com/cuhkrlcourse/RLexample/tree/master/MDP

Different between Policy Iteration and Value Iteration

1️⃣Policy iteration includes: policy evaluation + policy improvement, and the two are repeated iteratively until policy converges.

2️⃣Value Iteration includes: finding optimal value function + one policy extraction.There is no repeat of the two because once the value function is optimal, then the policy out of it should also be optimal (i.e. converged).

3️⃣Finding optimal value function can also be seen as a combination of policy improvement (due to max) and truncated policy evaluation (the reassignment of v(s) after just one sweep of all states regardless of convergence).