第2章 马尔可夫决策过程

news2025/1/9 15:01:13

2.1 马尔可夫决策过程(上)

Markov Decision Process(MDP)

在这里插入图片描述

  1. Markov Decision Process can model a lot of real-world problem. It formally describes the framework of reinforcement learning
  2. Under MDP, the environment is fully observable.
    1. Optimal control primarily deals with continuous MDPs
    2. Partially observable problems can be converted into MDPs

Markov Property

  1. The history of states: h t = { s 1 , s 2 , s 3 , . . . , s t } h_{t}=\left \{ s_{1},s_{2},s_{3},...,s_{t} \right \} ht={s1,s2,s3,...,st}

  2. State s t s_{t} st is Markovian if and only if:
    p ( s t + 1 ∣ s t ) = p ( s t + 1 ∣ h t ) p(s_{t+1}|s_{t})=p(s_{t+1}|h_{t}) p(st+1st)=p(st+1ht)

    p ( s t + 1 ∣ s t , a t ) = p ( s t + 1 ∣ h t , a t ) p(s_{t+1}|s_{t},a_{t})=p(s_{t+1}|h_{t},a_{t}) p(st+1st,at)=p(st+1ht,at)

  3. “The future is independent of the past given the present”

Markov Process/Markov Chain

在这里插入图片描述

  1. State transition matrix P specifies p ( s t + 1 = s ′ ∣ s t = s ) p(s_{t+1}=s'|s_{t}=s) p(st+1=sst=s)
    P = [ P ( s 1 ∣ s 1 ) P ( s 2 ∣ s 1 ) . . . P ( s N ∣ s 1 ) P ( s 1 ∣ s 2 ) P ( s 2 ∣ s 2 ) . . . P ( s N ∣ s 2 ) . . . . . . ⋱ . . . P ( s 1 ∣ s N ) P ( s 2 ∣ s N ) . . . P ( s N ∣ s N ) ] P=\begin{bmatrix} P(s_{1}|s_{1}) & P(s_{2}|s_{1}) & ... & P(s_{N}|s_{1})\\ P(s_{1}|s_{2}) & P(s_{2}|s_{2}) & ... & P(s_{N}|s_{2})\\ ... & ... & \ddots & ...\\ P(s_{1}|s_{N}) & P(s_{2}|s_{N}) & ... & P(s_{N}|s_{N}) \end{bmatrix} P= P(s1s1)P(s1s2)...P(s1sN)P(s2s1)P(s2s2)...P(s2sN).........P(sNs1)P(sNs2)...P(sNsN)

Example of MP

在这里插入图片描述

  1. Sample episodes starting from s 3 s_{3} s3
    1. s 3 , s 4 , s 5 , s 6 , s 6 s_{3},s_{4},s_{5},s_{6},s_{6} s3,s4,s5,s6,s6
    2. s 3 , s 2 , s 3 , s 2 , s 1 s_{3},s_{2},s_{3},s_{2},s_{1} s3,s2,s3,s2,s1
    3. s 3 , s 4 , s 4 , s 5 , s 5 s_{3},s_{4},s_{4},s_{5},s_{5} s3,s4,s4,s5,s5

Markov Reward Process (MRP)

  1. Markov Reward Process is a Markov Chain + reward
  2. Definition of Markov Reward Process (MRP)
    1. S is a (finite) set of states (s ∈ S)
    2. P is dynamics/transition model that specifies P ( S t + 1 = s ′ ∣ s t = s ) P(S_{t+1}=s'|s_{t}=s) P(St+1=sst=s)
    3. R is a reward function R ( s t = s ) = E [ r t ∣ s t = s ] R(s_{t}=s)=E[r_{t}|s_{t}=s] R(st=s)=E[rtst=s]
    4. Discount factor γ ∈ [ 0 , 1 ] \gamma ∈[0,1] γ[0,1]
  3. If finite number of states, R can be a vector

Example of MRP

在这里插入图片描述

Reward: +5 in s 1 s_{1} s1, +10 in s 7 s_{7} s7, 0 in all other states. So that we can represent R = [5, 0, 0, 0, 0, 0, 10]

Return and Value function

  1. Definition of Horizon

    1. Number of maximum time steps in each episode
    2. Can be infinite, otherwise called finite Markov (reward) Process
  2. Definition of Return

    1. Discounted sum of rewards from time step t to horizon
      G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + γ 3 R t + 4 + . . . + γ T − t − 1 R T G_{t}=R_{t+1}+γR_{t+2}+γ^{2}R_{t+3}+γ^{3}R_{t+4}+...+γ^{T-t-1}R_{T} Gt=Rt+1+γRt+2+γ2Rt+3+γ3Rt+4+...+γTt1RT
  3. Definition of state value function V t ( s ) V_{t}(s) Vt(s) for a MRP

    1. Expected return from t in state s
      V t ( s ) = E [ G t ∣ s t = s ] = E [ R t + 1 + γ R t + 2 + γ 2 R t + 3 + γ 3 R t + 4 + . . . + γ T − t − 1 R T ∣ s t = s ] {V_{t}(s)=E[G_{t}|s_{t}=s]} =E[R_{t+1}+γR_{t+2}+γ^{2}R_{t+3}+γ^{3}R_{t+4}+...+γ^{T-t-1}R_{T}|s_{t}=s] Vt(s)=E[Gtst=s]=E[Rt+1+γRt+2+γ2Rt+3+γ3Rt+4+...+γTt1RTst=s]

    2. Present value of future rewards

Why Discount Factor γ

  1. Avoid infinite returns in cycle Markov processes
  2. Uncertainly about the future may not be fully represented
  3. If the reward is financial, immediate rewards may earn more interest than delayed rewards
  4. Animal/human behaviour shows preference for immediate reward
  5. It is sometimes possible to use undiscounted Markov reward processes (i.e. γ = 1), e.g if all sequences terminate.
    1. γ = 0: Only care about the immediate reward
    2. γ = 1: Future reward is equal to the immediate reward.

Example of MRP

在这里插入图片描述

  1. Reward: +5 in s 1 s_{1} s1, +10 in s 7 s_{7} s7, 0 in all other states. So that we can represent R = [5, 0, 0, 0, 0, 0, 10]
  2. Sample returns G for a 4-step episodes with γ = 1/2
    1. return for s 4 , s 5 , s 6 , s 7 s_{4},s_{5},s_{6},s_{7} s4,s5,s6,s7 : 0 + 1 2 × 0 + 1 4 × 0 + 1 8 × 10 = 1.25 0+\frac{1}{2}×0+\frac{1}{4}×0+\frac{1}{8}×10=1.25 0+21×0+41×0+81×10=1.25
    2. return for s 4 , s 3 , s 2 , s 1 s_{4},s_{3},s_{2},s_{1} s4,s3,s2,s1 : 0 + 1 2 × 0 + 1 4 × 0 + 1 8 × 5 = 0.625 0+\frac{1}{2}×0+\frac{1}{4}×0+\frac{1}{8}×5=0.625 0+21×0+41×0+81×5=0.625
    3. return for s 4 , s 5 , s 6 , s 6 s_{4},s_{5},s_{6},s_{6} s4,s5,s6,s6 : = 0
  3. How to compute the value function? For example, the value of state s 4 s_{4} s4 as V ( s 4 ) V(s_{4}) V(s4)

Compute the Value of a Markov Reward Process

  1. Value function: expected return from starting in state s
    V ( s ) = E [ G t ∣ s t = s ] = E [ R t + 1 + γ R t + 2 + γ 2 R t + 3 + γ 3 R t + 4 + . . . + γ T − t − 1 R T ∣ s t = s ] {V(s)=E[G_{t}|s_{t}=s]} =E[R_{t+1}+γR_{t+2}+γ^{2}R_{t+3}+γ^{3}R_{t+4}+...+γ^{T-t-1}R_{T}|s_{t}=s] V(s)=E[Gtst=s]=E[Rt+1+γRt+2+γ2Rt+3+γ3Rt+4+...+γTt1RTst=s]

  2. MRP value function satisfies the following Bellman equation:
    V ( s ) = R ( s ) ⏟ I m m e d i a t e   r e w a r d + γ ∑ s ∈ S ′ P ( s ′ ∣ s ) V ( s ′ ) ⏟ D i s c o u n t e d   s u m   o f   f u t u r e   r e w a r d V(s)=\underset{Immediate \,reward}{\underbrace{R(s)}}+\underset{Discounted \, sum\, of \, future \, reward}{\underbrace{\gamma \sum_{s\in S'}^{}P(s'|s)V(s')}} V(s)=Immediatereward R(s)+Discountedsumoffuturereward γsSP(ss)V(s)

  3. Practice: To derive the Bellman equation for V(s)

    1. Hint: V ( s ) = E [ R t + 1 + γ E [ R t + 2 + γ 2 R t + 3 + . . . ] ∣ s t = s ] V(s)=E[R_{t+1}+γE[R_{t+2}+γ^{2}R_{t+3}+...]|s_{t}=s] V(s)=E[Rt+1+γE[Rt+2+γ2Rt+3+...]st=s]

Understanding Bellman equation

  1. Bellman equation describes the iterative relations of states
    V ( s ) = R ( s ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s ) V ( s ′ ) V(s)=R(s)+\gamma \sum_{s'\in S}^{}P(s'|s)V(s') V(s)=R(s)+γsSP(ss)V(s)

在这里插入图片描述

Matrix Form of Bellman Equation for MRP

Therefore, we can express V(s) using the matrix form:

在这里插入图片描述

  1. Analytic solution for value of MRP: V = ( I − γ P ) − 1 R V=(I-γP)^{-1}R V=(IγP)1R
    1. Matrix inverse takes the complexity O ( N 3 ) O(N^{3}) O(N3) for N states
    2. Only possible for a small MRPs

Iterative Algorithm for Computing Value of a MRP

  1. Iterative methods for large MRPs:
    1. Dynamic Programming
    2. Monte-Carlo evaluation
    3. Temporal-Difference learning

Monte Carlo Algorithm for Computing Value of a MRP

Algorithm 1 Monte Carlo simulation to calculate MRP value function

  1. i ← 0 , G t ← 0 i\leftarrow 0,G_{t}\leftarrow 0 i0,Gt0

  2. while i ≠ N i≠N i=N do

  3. ​ generate an episode, starting from state s and time t

  4. ​ Using the generated episode, calculate return g = ∑ i = t H − 1 γ i − t r i g=\sum_{i=t}^{H-1}\gamma ^{i-t}r_{i} g=i=tH1γitri

  5. G t ← G t + g , i ← i + 1 G_{t}\leftarrow G_{t}+g,i\leftarrow i+1 GtGt+g,ii+1

  6. end while

  7. V t ( s ) ← G t / N V_{t}(s)\leftarrow G_{t}/N Vt(s)Gt/N

  8. For example: to calculate V ( s 4 ) V(s_{4}) V(s4) we can generate a lot of trajectories then take the average of the returns:

    1. return for s 4 , s 5 , s 6 , s 7 s_{4},s_{5},s_{6},s_{7} s4,s5,s6,s7​ : 0 + 1 2 × 0 + 1 4 × 0 + 1 8 × 10 = 1.25 0+\frac{1}{2}×0+\frac{1}{4}×0+\frac{1}{8}×10=1.25 0+21×0+41×0+81×10=1.25
    2. return for s 4 , s 3 , s 2 , s 1 s_{4},s_{3},s_{2},s_{1} s4,s3,s2,s1 : 0 + 1 2 × 0 + 1 4 × 0 + 1 8 × 5 = 0.625 0+\frac{1}{2}×0+\frac{1}{4}×0+\frac{1}{8}×5=0.625 0+21×0+41×0+81×5=0.625
    3. return for s 4 , s 5 , s 6 , s 6 s_{4},s_{5},s_{6},s_{6} s4,s5,s6,s6 : = 0
    4. more trajectories

Iterative Algorithm for Computing Value of a MRP

Algorithm 1 Iterative Algorithm to calculate MRP value function

  1. for all states s ∈ S , V ′ ( s ) ← 0 , V ( s ) ← ∞ s∈S,V'(s)\leftarrow 0,V(s)\leftarrow ∞ sS,V(s)0,V(s)
  2. while ∣ ∣ V − V ′ ∣ ∣ > ϵ ||V-V'||>\epsilon ∣∣VV∣∣>ϵ do
  3. V ← V ′ V\leftarrow V' VV
  4. ​ For all states s ∈ S , V ′ ( s ) = R ( s ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s ) V ( s ′ ) s∈S,V'(s)=R(s)+\gamma \sum_{s'\in S}^{}P(s'|s)V(s') sS,V(s)=R(s)+γsSP(ss)V(s)
  5. end while
  6. return V ′ ( s ) V'(s) V(s) for all s ∈ S s∈S sS

Markov Decision Process (MDP)

  1. Markov Decision Process is Markov Reward Process with decisions.

  2. Definition of MDP

    1. S is a finite set of states

    2. A is a finite set of actions

    3. P a P^{a} Pa is dynamics/transition model for each action

      P ( s t + 1 = s ′ ∣ s t = s , a t = a ) P(s_{t+1}=s'|s_{t}=s,a_{t}=a) P(st+1=sst=s,at=a)

    4. R is a reward function R ( s t = s , a t = a ) = E [ r t ∣ s t = s , a t = a ] R(s_{t}=s,a_{t}=a)=E[r_{t}|s_{t}=s,a_{t}=a] R(st=s,at=a)=E[rtst=s,at=a]

    5. Discount factor γ ∈ [ 0 , 1 ] γ∈[0,1] γ[0,1]

  3. MDP is a turple: ( S , A , P , R , γ ) (S,A,P,R,γ) (S,A,P,R,γ)

Policy in MDP

  1. Policy specifies what action to take in each state

  2. Give a state, specify a distribution over actions

  3. Policy: π ( a ∣ s ) = P ( a t = a ∣ s t = s ) \pi(a|s)=P(a_{t}=a|s_{t}=s) π(as)=P(at=ast=s)

  4. Policies are stationary (time-independent), A t ∼ π ( a ∣ s ) A_{t}\sim \pi(a|s) Atπ(as) for any t > 0

  5. Given an MDP ( S , A , P , R , γ ) (S,A,P,R,γ) (S,A,P,R,γ) and a policy π \pi π

  6. The state sequence S 1 , S 2 , . . . S_{1},S_{2},... S1,S2,... is a Markov process ( S , P π ) (S,P^{\pi}) (S,Pπ)

  7. The state and reward sequence S 1 , R 1 , S 2 , R 2 , . . . S_{1},R_{1},S_{2},R_{2},... S1,R1,S2,R2,... is a Markov reward process ( S , P π , R π , γ ) (S,P^{\pi},R^{\pi},γ) (S,Pπ,Rπ,γ) where,
    P π ( s ′ ∣ s ) = ∑ a ∈ A π ( a ∣ s ) P ( s ′ ∣ s , a ) P^{\pi}(s'|s)=\sum_{a∈A}\pi(a|s)P(s'|s,a)\\ Pπ(ss)=aAπ(as)P(ss,a)

R π ( s ) = ∑ a ∈ A π ( a ∣ s ) P ( s , a ) R^{\pi}(s)=\sum_{a∈A}\pi(a|s)P(s,a) Rπ(s)=aAπ(as)P(s,a)

Comparison of MP/MRP and MDP

在这里插入图片描述

Value function for MDP

  1. The state-value function v π ( s ) v^{\pi}(s) vπ(s) of an MDP is the expected return starting from state s, and following policy π \pi π
    v π ( s ) = E π [ G t ∣ s t = s ] v^{\pi}(s)=E{\pi}[G_{t}|s_{t}=s] vπ(s)=Eπ[Gtst=s]

  2. The action-value function q π ( s , a ) q^{\pi}(s,a) qπ(s,a) is the expected return starting from state s, taking action a, and following policy π \pi π
    q π ( s , a ) = E π [ G t ∣ s t = s , A t = a ] q^{\pi}(s,a)=E{\pi}[G_{t}|s_{t}=s,A_{t}=a] qπ(s,a)=Eπ[Gtst=s,At=a]

  3. We have the relation between v π ( s ) v^{\pi}(s) vπ(s) and q π ( s , a ) q^{\pi}(s,a) qπ(s,a)
    v π ( s ) = ∑ a ∈ A π ( a ∣ s ) q π ( s , a ) v^{\pi}(s)=\sum_{a∈A}\pi(a|s)q^{\pi}(s,a) vπ(s)=aAπ(as)qπ(s,a)

Bellman Expection Equation

  1. The state-value function can be decomposed into immediate reward plus discounted value of successor state,
    v π ( s ) = E π [ R t + 1 + γ v π ( s t + 1 ) ∣ s t = s ] v^{\pi}(s)=E_{\pi}[R_{t+1}+γv^{\pi}(s_{t+1})|s_{t}=s] vπ(s)=Eπ[Rt+1+γvπ(st+1)st=s]

  2. The action-value function can similarly be decomposed
    q π ( s , a ) = E π [ R t + 1 + γ q π ( s t + 1 , A t + 1 ) ∣ s t = s , A t = a ] q^{\pi}(s,a)=E_{\pi}[R_{t+1}+γq^{\pi}(s_{t+1},A_{t+1})|s_{t}=s,A_{t}=a] qπ(s,a)=Eπ[Rt+1+γqπ(st+1,At+1)st=s,At=a]

Bellman Expection Equation for V π V^{\pi} Vπ and Q π Q^{\pi} Qπ

v π ( s ) = ∑ a ∈ A π ( a ∣ s ) q π ( s , a ) v^{\pi}(s)=\sum_{a∈A}\pi(a|s)q^{\pi}(s,a) vπ(s)=aAπ(as)qπ(s,a)

q π ( s , a ) = R s a + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) v π ( s ′ ) q^{\pi}(s,a)=R_{s}^{a}+γ\sum_{s'∈S}P(s'|s,a)v^{\pi}(s') qπ(s,a)=Rsa+γsSP(ss,a)vπ(s)

Thus
v π ( s ) = ∑ a ∈ A π ( a ∣ s ) ( R ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) v π ( s ′ ) ) v^{\pi}(s)=\sum_{a∈A}\pi(a|s)(R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v^{\pi}(s')) vπ(s)=aAπ(as)(R(s,a)+γsSP(ss,a)vπ(s))

q π ( s , a ) = R ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) ∑ a ′ ∈ A π ( a ′ ∣ s ′ ) q π ( s ′ , a ′ ) q^{\pi}(s,a)=R(s,a)+γ\sum_{s'∈S}P(s'|s,a)\sum_{a'∈A}\pi(a'|s')q^{\pi}(s',a') qπ(s,a)=R(s,a)+γsSP(ss,a)aAπ(as)qπ(s,a)

Backup Diagram for V π V^{\pi} Vπ

在这里插入图片描述

v π ( s ) = ∑ a ∈ A π ( a ∣ s ) ( R ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) v π ( s ′ ) ) v^{\pi}(s)=\sum_{a∈A}\pi(a|s)(R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v^{\pi}(s')) vπ(s)=aAπ(as)(R(s,a)+γsSP(ss,a)vπ(s))

Backup Diagram for Q π Q^{\pi} Qπ

在这里插入图片描述

q π ( s , a ) = R ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) ∑ a ′ ∈ A π ( a ′ ∣ s ′ ) q π ( s ′ , a ′ ) q^{\pi}(s,a)=R(s,a)+γ\sum_{s'∈S}P(s'|s,a)\sum_{a'∈A}\pi(a'|s')q^{\pi}(s',a') qπ(s,a)=R(s,a)+γsSP(ss,a)aAπ(as)qπ(s,a)

Policy Evaluation

  1. Evaluate the value of state given a policy π \pi π: compute v π ( s ) v^{\pi}(s) vπ(s)
  2. Also called as (value) prediction

Example: Navigate the boat

在这里插入图片描述

Example: Policy Evaluation

在这里插入图片描述

  1. Two actions: Left and Right

  2. For all actions, reward: +5 in s 1 s_{1} s1, +10 in s 7 s_{7} s7, 0 in all other states. So that we can represent R = [5, 0, 0, 0, 0, 0, 10]

  3. Let’s have a deterministic policy π ( s ) = \pi(s)= π(s)=Left and γ = 0 γ=0 γ=0 for any state s, then what is the value of the policy?

    1. v π = [ 5 , 0 , 0 , 0 , 0 , 0 , 10 ] v^{\pi}= [5, 0, 0, 0, 0, 0, 10] vπ=[5,0,0,0,0,0,10]
  4. Iteration: v k π ( s ) = r ( s , π ( s ) ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , π ( s ) ) v k − 1 π ( s ′ ) v_{k}^{\pi}(s)=r(s,\pi(s))+γ\sum_{s'∈S}P(s'|s,\pi(s))v_{k-1}^{\pi}(s') vkπ(s)=r(s,π(s))+γsSP(ss,π(s))vk1π(s)

  5. R = [ 5 , 0 , 0 , 0 , 0 , 0 , 10 ] R = [5, 0, 0, 0, 0, 0, 10] R=[5,0,0,0,0,0,10]

  6. Practice 1: Deterministic policy π ( s ) = \pi(s)= π(s)=Left and γ = 0.5 γ=0.5 γ=0.5 for any state s, then what are the states values under the policy?

  7. Practice 2: Stochastic policy P ( π ( s ) = L e f t ) = 0.5 P(\pi(s)=Left)=0.5 P(π(s)=Left)=0.5 and P ( π ( s ) = R i g h t ) = 0.5 P(\pi(s)=Right)=0.5 P(π(s)=Right)=0.5 and γ = 0.5 γ=0.5 γ=0.5 for any state s, then what are the states values under the policy?

  8. Iteration: v k π ( s ) = r ( s , π ( s ) ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , π ( s ) ) v k − 1 π ( s ′ ) v_{k}^{\pi}(s)=r(s,\pi(s))+γ\sum_{s'∈S}P(s'|s,\pi(s))v_{k-1}^{\pi}(s') vkπ(s)=r(s,π(s))+γsSP(ss,π(s))vk1π(s)

2.2 马尔可夫决策过程(下)

Decison Making in Markov Decision Process(MDP)

  1. Prediction (evaluate a given policy):
    1. Input: MDP < S , A , P , R , γ > <S,A,P,R,γ> <S,A,P,R,γ> and policy π \pi π or MRP < S , P π , R π , γ > <S,P^{\pi},R^{\pi},γ> <S,Pπ,Rπ,γ>
    2. Output: value function v π v^{\pi} vπ
  2. Control (search the optimal policy):
    1. Input: MDP < S , A , P , R , γ > <S,A,P,R,γ> <S,A,P,R,γ>
    2. Output: optimal value function v ∗ v^{*} v and optimal policy π ∗ \pi^{*} π
  3. Prediction and control in MDP can be solved by dynamic programming.

Dynamic programming

Dynamic programming is a very general solution method for problems which have two properties:

  1. Optimal substructure
    1. Principle of optimality applies
    2. Optimal solution can be decomposed into subproblems
  2. Overlapping subproblems
    1. Subproblems recur many times
    2. Solutions can be cached and reused

Markov decision processes satisfy both properties

  1. Bellman equation gives recursive decomposition
  2. Value function stores and reuses solutions

Policy evaluation on MDP

  1. Objective: Evaluate a given policy π \pi π for a MDP

  2. Output: the value function under policy v π v^{\pi} vπ

  3. Solution: iteration on Bellman expectation backup

  4. Algorithm: Synchronous backup

    1. At each iteration t+1

      update v t + 1 ( s ) v_{t+1}(s) vt+1(s) from v t ( s ′ ) v_{t}(s') vt(s) for all states s ∈ S s∈S sS where s’ is a successor state of s
      v t + 1 ( s ) = ∑ a ∈ A π ( a ∣ s ) ( R ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) v t ( s ′ ) ) v_{t+1}(s)=\sum_{a∈A}\pi(a|s)(R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v_{t}(s')) vt+1(s)=aAπ(as)(R(s,a)+γsSP(ss,a)vt(s))

  5. Convergence: v 1 → v 2 → . . . → v π v_{1}\rightarrow v_{2}\rightarrow ...\rightarrow v^{\pi} v1v2...vπ

Policy evaluation: Iteration on Bellman expectation backup

Bellman expectation backup for a particular policy
v t + 1 ( s ) = ∑ a ∈ A π ( a ∣ s ) ( R ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) v t ( s ′ ) ) v_{t+1}(s)=\sum_{a∈A}\pi(a|s)(R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v_{t}(s')) vt+1(s)=aAπ(as)(R(s,a)+γsSP(ss,a)vt(s))
Or if in the form of MRP < S , P π , R π , γ > <S,P^{\pi},R^{\pi},γ> <S,Pπ,Rπ,γ>
v t + 1 ( s ) = R π ( s ) + γ P π ( s ′ ∣ s ) V t ( s ′ ) v_{t+1}(s)=R^{\pi}(s)+\gamma P^{\pi}(s'|s)V_{t}(s') vt+1(s)=Rπ(s)+γPπ(ss)Vt(s)

Evaluating a Random Policy in the Small Gridworld

Example 4.1 in the Sutton RL textbook

在这里插入图片描述

  1. Undiscounted episodic MDP ( γ = 1 γ=1 γ=1)
  2. Nonterminal states 1, …, 14
  3. Two terminal states (two shaded squares)
  4. Action leading out of grid leaves state unchanged, P ( 7 ∣ 7 , r i g h t ) = 1 P(7|7,right)=1 P(7∣7,right)=1
  5. Reward is -1 until the terminal state is reach
  6. Transition is deterministic given the action, e.g., P ( 6 ∣ 5 , r i g h t ) = 1 P(6|5,right)=1 P(6∣5,right)=1
  7. Uniform random policy π ( l ∣ . ) = π ( r ∣ . ) = π ( u ∣ . ) = π ( d ∣ . ) = 0.25 \pi(l|.)=\pi(r|.)=\pi(u|.)=\pi(d|.)=0.25 π(l∣.)=π(r∣.)=π(u∣.)=π(d∣.)=0.25

A live demo on policy evaluation

v π ( s ) = ∑ a ∈ A π ( a ∣ s ) ( R ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) v π ( s ′ ) ) v^{\pi}(s)=\sum_{a∈A}\pi(a|s)(R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v^{\pi}(s')) vπ(s)=aAπ(as)(R(s,a)+γsSP(ss,a)vπ(s))

  1. https://cs.stanford.edu/people/karpathy/reinforcejs/gridworld_dp.html

Optimal Value Function

  1. The optimal state-value function v ∗ ( s ) v^{*}(s) v(s) is the maximum value function over all policies
    v ∗ ( s ) = m a x π   v π ( s ) v^{*}(s)=\underset{\pi}{max}\,v^{\pi}(s) v(s)=πmaxvπ(s)

  2. The optimal policy
    π ∗ ( s ) = a r g   m a x π   v π ( s ) \pi^{*}(s)=arg\,\underset{\pi}{max}\,v^{\pi}(s) π(s)=argπmaxvπ(s)

  3. An MDP is “solved” when we know the optimal value

  4. There exists a unique optimal value function, but could be multiple optimal policies (two actions that have the same optimal value function)

Finding Optimal Policy

  1. An optimal policy can be found by maximizing over q ∗ ( s , a ) q^{*}(s,a) q(s,a),
    π ∗ ( a ∣ s ) { 1 ,  if  a = a r g   m a x a ∈ A   q ∗ ( s , a ) 0 ,  otherwise  \pi^{*}(a|s)\begin{cases} 1, & \text{ if } a=arg\,max_{a\in A}\,q^{*}(s,a) \\ 0, & \text{ otherwise } \end{cases} π(as){1,0, if a=argmaxaAq(s,a) otherwise 

  2. There is always a deterministic optimal policy for any MDP

  3. If we know q ∗ ( s , a ) q^{*}(s,a) q(s,a), we immediately have the optimal policy

Policy Search

  1. One option is to enumerate search the best policy
  2. Number of deterministic policies is ∣ A ∣ ∣ S ∣ |A|^{|S|} AS
  3. Other approaches such as policy iteration and value iteration are more efficient

MDP Control

  1. Compute the optimal policy
    π ∗ ( s ) = a r g   m a x π   v π ( s ) \pi^{*}(s)=arg\,\underset{\pi}{max}\,v^{\pi}(s) π(s)=argπmaxvπ(s)

  2. Optimal policy for a MDP in an infinite horizon problem (agent acts forever) is

    1. Deterministic
    2. Stationary (does not depend on time step)
    3. Unique? Not necessarily, may have state-actions with identical optimal values

Improving a Policy through Policy Iteration

  1. Iterate through the two steps:

    1. Evaluate the policy π \pi π (computing v given current π \pi π)

    2. Improve the policy by acting greedily with respect to v π v^{\pi} vπ
      π ′ = g r e e d y ( v π ) \pi'=greedy(v^{\pi}) π=greedy(vπ)

在这里插入图片描述

Policy Improvement

  1. Compute the state-action value of a policy π \pi π:
    q π i ( s , a ) = R ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) v π i ( s ′ ) q^{\pi_{i}}(s,a)=R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v^{\pi_{i}}(s') qπi(s,a)=R(s,a)+γsSP(ss,a)vπi(s)

  2. Compute new policy π i + 1 \pi_{i+1} πi+1 for all s ∈ S s∈S sS following
    π i + 1 ( s ) = a r g   m a x a   q π i ( s , a ) \pi_{i+1}(s)=arg\,\underset{a}{max}\,q^{\pi_{i}}(s,a) πi+1(s)=argamaxqπi(s,a)

在这里插入图片描述

Monotonic Improvement in Policy

  1. Consider a deterministic policy a = π ( s ) a=\pi(s) a=π(s)

  2. We improve the policy through
    π ′ ( s ) = a r g   m a x a   q π ( s , a ) \pi'(s)=arg\,\underset{a}{max}\,q^{\pi}(s,a) π(s)=argamaxqπ(s,a)

  3. This improves the value from any state s over one step,
    q π ( s , π ′ ( s ) ) = m a x a ∈ A   q π ( s , a ) ≥ q π ( s , π ( s ) ) = v π ( s ) q^{\pi}(s,\pi'(s))=\underset{a∈A}{max}\,q^{\pi}(s,a)≥q^{\pi}(s,\pi(s))=v^{\pi}(s) qπ(s,π(s))=aAmaxqπ(s,a)qπ(s,π(s))=vπ(s)

  4. It therefore improves the value function, v π ′ ( s ) ≥ v π ( s ) v_{\pi'(s)}≥v^{\pi}(s) vπ(s)vπ(s)

    v π ( s ) ≤ q π ( s , π ′ ( s ) ) = E π ′ [ R t + 1 + γ v π ( S t + 1 ∣ S t = s ) ] v^{\pi}(s)≤q^{\pi}(s,\pi'(s))=E_{\pi'}[R_{t+1}+γv^{\pi}(S_{t+1}|S_{t}=s)] vπ(s)qπ(s,π(s))=Eπ[Rt+1+γvπ(St+1St=s)]

    ≤ E π ′ [ R t + 1 + γ q π ( S t + 1 , π ′ ( S t + 1 ) ) ∣ S t = s ) ] ≤E_{\pi'}[R_{t+1}+γq^{\pi}(S_{t+1},\pi'(S_{t+1}))|S_{t}=s)] Eπ[Rt+1+γqπ(St+1,π(St+1))St=s)]

    ≤ E π ′ [ R t + 1 + γ R t + 2 + γ 2 q π ( S t + 2 , π ′ ( S t + 2 ) ) ∣ S t = s ) ] ≤E_{\pi'}[R_{t+1}+γR_{t+2}+γ^{2}q^{\pi}(S_{t+2},\pi'(S_{t+2}))|S_{t}=s)] Eπ[Rt+1+γRt+2+γ2qπ(St+2,π(St+2))St=s)]

    ≤ E π ′ [ R t + 1 + γ R t + 2 + . . . ∣ S t = s ) ] = v π ′ ( s ) ≤E_{\pi'}[R_{t+1}+γR_{t+2}+...|S_{t}=s)]=v_{\pi'}(s) Eπ[Rt+1+γRt+2+...∣St=s)]=vπ(s)

  5. If iImprovements stop,
    q π ( s , π ′ ( s ) ) = m a x a ∈ A   q π ( s , a ) ≥ q π ( s , π ( s ) ) = v π ( s ) q^{\pi}(s,\pi'(s))=\underset{a∈A}{max}\,q^{\pi}(s,a)≥q^{\pi}(s,\pi(s))=v^{\pi}(s) qπ(s,π(s))=aAmaxqπ(s,a)qπ(s,π(s))=vπ(s)

  6. Thus the Bellman optimality equation has been satisified

    v π ( s ) = m a x a ∈ A   q π ( s , a ) v^{\pi}(s)=\underset{a∈A}{max}\,q^{\pi}(s,a) vπ(s)=aAmaxqπ(s,a)

  7. Therefore v π ( s ) = v ∗ ( s ) v^{\pi}(s)=v^{*}(s) vπ(s)=v(s) for all s ∈ S s∈S sS, so π \pi π is an optimal policy

Bellman Optimality Equation

1️⃣The optimal value functions are reached by the Bellman optimality equations:
v ∗ ( s ) = m a x a   q π ( s , a ) v^{*}(s)=\underset{a}{max}\,q^{\pi}(s,a) v(s)=amaxqπ(s,a)

q ∗ ( s , a ) = R ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) v ∗ ( s ′ ) q^{*}(s,a)=R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v^{*}(s') q(s,a)=R(s,a)+γsSP(ss,a)v(s)

thus
v ∗ ( s ) = m a x a R ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) v ∗ ( s ′ ) v^{*}(s)=\underset{a}{max}R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v^{*}(s') v(s)=amaxR(s,a)+γsSP(ss,a)v(s)

q ∗ ( s , a ) = R ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a )   m a x a ′   q ∗ ( s ′ , a ′ ) q^{*}(s,a)=R(s,a)+γ\sum_{s'∈S}P(s'|s,a)\,\underset{a'}{max}\,q^{*}(s',a') q(s,a)=R(s,a)+γsSP(ss,a)amaxq(s,a)

Value Iteration by turning the Bellman Optimality Equation as update rule

1️⃣If we know the solution to subproblem v ∗ ( s ′ ) v^{*}(s') v(s), which is optimal.

2️⃣Then the solution for the optimal v ∗ ( s ) v^{*}(s) v(s) can be found by iteration over the following Bellman Optimality backup rule,
v ( s ) ← m a x a ∈ A ( R ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) v ∗ ( s ′ ) ) v(s)\leftarrow\underset{a∈A}{max}(R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v^{*}(s')) v(s)aAmax(R(s,a)+γsSP(ss,a)v(s))

3️⃣The idea of value iteration is to apply these updates iteratively

Algorithm of Value Iteration

  1. Objective: find the optimal policy π \pi π

  2. Solution: iteration on the Bellman optimality backup

  3. Value Iteration algorithm:

    1. initialize k = 1 k=1 k=1 and v 0 ( s ) = 0 v_{0}(s)=0 v0(s)=0 for all states s

    2. For k = 1 : H k=1:H k=1:H

      1. for each states s
        q k + 1 ( s , a ) = R ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) v k ( s ′ ) q_{k+1}(s,a)=R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v_{k}(s') qk+1(s,a)=R(s,a)+γsSP(ss,a)vk(s)

        v k + 1 ( s ) = m a x a   q k + 1 ( s , a ) v_{k+1}(s)=\underset{a}{max}\,q_{k+1}(s,a) vk+1(s)=amaxqk+1(s,a)

      2. k ← k + 1 k\leftarrow k+1 kk+1

    3. To retrieve the optimal policy after value iteration:
      π ( s ) = a r g   m a x a   R ( s , a ) + γ ∑ s ′ ∈ S P ( s ′ ∣ s , a ) v k + 1 ( s ′ ) \pi(s)=arg\,\underset{a}{max}\,R(s,a)+γ\sum_{s'∈S}P(s'|s,a)v_{k+1}(s') π(s)=argamaxR(s,a)+γsSP(ss,a)vk+1(s)

Example: Shortest Path

在这里插入图片描述

After the optimal values are reached, we run policy extraction to retrieve the optimal policy.

Demo of policy iteration and value Iteration

在这里插入图片描述

1️⃣Policy iteration: Iteration of policy evaluation and policy improvement(update)

2️⃣Value iteration

3️⃣https://cs.stanford.edu/people/karpathy/reinforcejs/gridworld_dp.html

Policy iteration and value iteration on FrozenLake

1️⃣https://github.com/cuhkrlcourse/RLexample/tree/master/MDP

Different between Policy Iteration and Value Iteration

1️⃣Policy iteration includes: policy evaluation + policy improvement, and the two are repeated iteratively until policy converges.

2️⃣Value Iteration includes: finding optimal value function + one policy extraction.There is no repeat of the two because once the value function is optimal, then the policy out of it should also be optimal (i.e. converged).

3️⃣Finding optimal value function can also be seen as a combination of policy improvement (due to max) and truncated policy evaluation (the reassignment of v(s) after just one sweep of all states regardless of convergence).

Summary for Prediction and Control in MDP

在这里插入图片描述

End

1️⃣Optional Homework 1 is available at https://github.com/cuhkrlcourse/ierg6130-assignment

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/154383.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

Promise 实现 (从简易版到符合Promise A+规范)

前言 手写 Promise 是面试的时候大家都逃避的送命题&#xff0c;在学些了解后发现通过实现源码更能将新一代的异步方案理解的通透&#xff0c;知其然知其所以然的运用。 如果直接将源码贴到此处势必不能有更大的收获&#xff0c;下面就按实现版本来看做简要分析。 回顾 Prom…

SpringBoot测试类编写

前置要求: a.测试类上需要的注解 SpringBootTest AutoConfigureMockMvc Slf4j b.引入MockMvc类 Autowired private MockMvc mockMvc; c.如果需要前置条件可以用before注解 1.get/delete请求 // 查询Testvoid testQuery() throws Exception {String content mockMvc.perfor…

Django(15):身份和权限认证

目录1.Django中的身份认证模块1.1 用户模型1.2 认证模块1.3 项目搭建演示2.权限管理架构2.1 权限相关数据模型2.2 权限相关功能函数2.3 权限分配函数2.4 权限设置3.资源访问管理1.Django中的身份认证模块 1.1 用户模型 Django中有内建的用户模块django.contrib.auth.models.U…

2022 CNCC 中国计算机大会参会总结

前言 第 19 届 CNCC 于2022年12月8-10日召开&#xff0c;本届大会为期三天&#xff0c;首次采取全线上举办形式&#xff0c;主题为“算力、数据、生态”&#xff0c;重点在保持多样性、聚焦热点前沿话题、平衡学术界和产业界参与等维度展开讨论。大会由CCF会士、中国科学院院士…

【SpringBoot】一文带你入门SpringBoot

✅作者简介&#xff1a;热爱Java后端开发的一名学习者&#xff0c;大家可以跟我一起讨论各种问题喔。 &#x1f34e;个人主页&#xff1a;Hhzzy99 &#x1f34a;个人信条&#xff1a;坚持就是胜利&#xff01; &#x1f49e;当前专栏&#xff1a;【Spring】 &#x1f96d;本文内…

【职场进阶】做好项目管理,先从明确职责开始

优秀的项目管理一定是高效协调各方资源、反馈及时、调整迅速的。 同时可以做到让参与各方在整个项目过程中张弛有序、愉快合作&#xff0c;最终实现产品项目的效益最大化。 那什么是项目呢&#xff1f; 项目是为向客户提供独特的产品或服务而进行的临时性任务&#xff0c;项目有…

TypeScript 对象key为number时的坑

首先在js的对象中有一个设定&#xff0c;就是对象的key可以是字符串&#xff0c;也可以是数字。 不论key是字符串还是数字&#xff0c;遍历对象key的时候&#xff0c;这个key会变成字符串 通过[] 操作符访问key对应值时候&#xff0c;不论是数字还是字符串都转成了 字符串的k…

Chromedriver安装教程

第一步 查看你当前Chrome浏览器的版本&#xff0c;如下图所示&#xff1a; 第二步 查看当前Chrome浏览器的版本号&#xff0c;如下图所示,版本 108.0.5359.125&#xff08;正式版本&#xff09; &#xff08;64 位&#xff09;中的&#xff0c;108就是我们的版本号。 第三…

VTK-PointPlacer

前言&#xff1a;本博文主要研究VTK中点转换到曲面上的应用&#xff0c;相关的接口为vtkPolygonalSurfacePointPlacer&#xff0c;为深入研究将基类vtkPointPlacer开始讲解。主要应用为在PolyData表面进行画线。 vtkPointPlacer 描述&#xff1a;将2D display位置转换为世界坐…

ospf知识点汇总

OSPF &#xff1a; 开放式最短路径优先协议使用范围&#xff1a;IGP 协议算法特点&#xff1a; 链路状态型路由协议&#xff0c;SPF算法协议是否传递网络掩码&#xff1a;传递网络掩码协议封装&#xff1a;基于IP协议封装&#xff0c;协议号为 89一.OSPF 特点1.OSPF 是一种典型…

基于javaweb(springboot+mybatis)网上酒类商城项目设计和实现以及文档报告

基于javaweb(springbootmybatis)网上酒类商城项目设计和实现以及文档报告 博主介绍&#xff1a;5年java开发经验&#xff0c;专注Java开发、定制、远程、文档编写指导等,csdn特邀作者、专注于Java技术领域 作者主页 超级帅帅吴 Java毕设项目精品实战案例《500套》 欢迎点赞 收藏…

【Linux】Linux项目自动化构建工具—make/Makefile

目录一.什么是make/MakefileMakefilemake二.Makefile逻辑1.简单依赖2.复杂依赖三.make指令1.make的使用2.clean清理3.伪目标4.make如何确定是否编译访问时间的影响修改时间的影响一.什么是make/Makefile Makefile 在Windows下&#xff0c;我们使用VS、VS Code这些ide编写C/C程…

MySQL的客户端/服务器架构

以我们平时使用的微信为例&#xff0c;它其实是由两部分组成的&#xff0c;一部分是客户端程序&#xff0c;一部分是服务器程序。客户端可能有很多种形式&#xff0c;比如手机APP&#xff0c;电脑软件或者是网页版微信&#xff0c;每个客户端都有一个唯一的用户名&#xff0c;就…

赶紧收藏 | 50个超实用微信小程序,巨好用|||内含免费配音软件

现在App太多了&#xff0c;想用的功能都要下载&#xff0c;但是手机有258g内存不允许这么放肆呀&#xff0c;只能挖掘不占用存的方法了&#xff0c;小程序就解决了这个痛&#xff0c;节省内存&#xff0c;让手机不再卡顿&#xff0c;打游戏也舒服.给大家整理了50个很好用的小程…

【阶段三】Python机器学习11篇:机器学习项目实战:KNN(K近邻)回归模型

本篇的思维导图: 项目实战(KNN回归模型) K近邻算法回归模型则将离待预测样本点最近的K个训练样本点的平均值进行待预测样本点的回归预测。 项目背景 K近邻除了能进行分类分析,还能进行回归分析,即预测连续变量,此时的KNN称为K近邻回归模型。回归问题是一类…

synchronized 重量级锁分析

synchronized 重量级锁分析 1. 背景 在JDK1.6以前&#xff0c;synchronized 的工作方式都是这种重量级的锁。它的实现原理就是利用 kernel 中的互斥量,mutex。主要是内核中的mutex 能够保证它是一个互斥的量。如果线程1拿到了 mutex,那么线程2就拿不到了。这是内核帮我们保证…

二十三、Kubernetes中Pod控制器分类、ReplicaSet(RS)控制器详解

1、概述 Pod是kubernetes的最小管理单元&#xff0c;在kubernetes中&#xff0c;按照pod的创建方式可以将其分为两类&#xff1a; 自主式pod&#xff1a;kubernetes直接创建出来的Pod&#xff0c;这种pod删除后就没有了&#xff0c;也不会重建 控制器创建的pod&#xff1a;kub…

小米应用商店APP侵权投诉流程

目录一、官方指引二、侵权投诉流程1.侵权投诉通知和反通知流程2.受理渠道3.权利人发起侵权通知邮件一、官方指引 https://dev.mi.com/distribute/doc/details?pId1142 二、侵权投诉流程 1.侵权投诉通知和反通知流程 2.受理渠道 对外邮箱&#xff1a;developerxiaomi.com …

云呐|固定资产盘点APP

如果工人想做好他们的工作&#xff0c;他们想做好他们的工作。目前&#xff0c;行政事业单位对固定资产管理进行一物一卡一码管理&#xff0c;根据条形码粘贴和扫码总结&#xff0c;是目前科学完善的总结方法&#xff0c;具有快速、高效、准确的特点。对于这种方法&#xff0c;…

5.6、TCP超时重传时间的选择

超时重传时间的选择是 TCP 最复杂的问题之一 1、超时重传时间RTO的选取 假设主机 A 给主机 B 发送 TCP 数据报文段 000&#xff0c;并记录下当前的时间 主机 B 收到后&#xff0c;给主机 A 发送相应的确认报文段 主机 A 收到确认报文段后&#xff0c;记录下当前的时间 那么…