Reinforcement Learning with Code 【Chapter 10. Actor Critic】

news2025/1/14 19:35:55

Reinforcement Learning with Code 【Chapter 10. Actor Critic】

This note records how the author begin to learn RL. Both theoretical understanding and code practice are presented. Many material are referenced such as ZhaoShiyu’s Mathematical Foundation of Reinforcement Learning.
This code refers to Mofan’s reinforcement learning course.

文章目录

  • Reinforcement Learning with Code 【Chapter 10. Actor Critic】
      • 10.1 The simplest actor-critic algorithm (QAC)
      • 10.2 Advantage Actor-Critic (A2C)
      • 10.3 Off-policy Actor-Critic
    • Reference

10.1 The simplest actor-critic algorithm (QAC)

​ Recall the idea of policy gradient method is to search for an optimal policy by maximizing a scalar metric J ( θ ) J(\theta) J(θ). The metric has three options, average state value E [ v π ( S ) ] \mathbb{E}[v_\pi(S)] E[vπ(S)], average one step reward E [ r π ( S ) ] \mathbb{E}[r_\pi(S)] E[rπ(S)] or average state value from a specific state s 0 s_0 s0

​ According to policy gradient theorem in chapter 9, we are informed that

θ t + 1 = θ t + α ∇ θ J ( θ t ) = θ t + α E S ∼ η , A ∼ π [ ∇ θ ln ⁡ π ( A ∣ S , θ t ) q π ( S , A ) ] \begin{aligned} \theta_{t+1} & = \theta_t + \alpha \nabla_\theta J(\theta_t) \\ & = \theta_t + \alpha \mathbb{E}_{S\sim\eta, A\sim\pi} [\nabla_\theta \ln \pi(A|S,\theta_t)q_\pi(S,A)] \end{aligned} θt+1=θt+αθJ(θt)=θt+αESη,Aπ[θlnπ(AS,θt)qπ(S,A)]

where η \eta η is a distribution of the states. Since the true gradient is unknwon, we can use a stochastic gradient ot approximated it, hence we have

θ t + 1 = θ t + α ∇ θ ln ⁡ π ( a t ∣ s t , θ t ) q t ( s t , a t ) \theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln\pi(a_t|s_t,\theta_t)q_t(s_t,a_t) θt+1=θt+αθlnπ(atst,θt)qt(st,at)

  • In policy gradient method such as REINFORCE, we use the idea of Monte-Carlo to approximate the true value q t ( s t , a t ) q_t(s_t,a_t) qt(st,at). Where q t ( s t , a t ) q_t(s_t,a_t) qt(st,at) is approximated by an episode return that is q t ( s t , a t ) = ∑ k = t + 1 T γ k − t − 1 r q_t(s_t,a_t)=\sum_{k=t+1}^T \gamma^{k-t-1}r qt(st,at)=k=t+1Tγkt1r.
  • If q t ( s t , a t ) q_t(s_t,a_t) qt(st,at) is estimated by value function approximationg, and the value function is updated using the idea of TD learning. The corresponding algorithms are usually called actor-critic. Therefore, actor-critic methods can be seen as one of the policy gradient method.

​ When we use the parameterized value function q ( s , a ; w ) q(s,a;w) q(s,a;w) to approximate the q t ( s t , a t ) q_t(s_t,a_t) qt(st,at), and the value function is updated by the idea of Sarsa of TD learning, the algorithm is called Q actor-critic (QAC). The core idea of QAC is that

QAC: { Actor : θ t + 1 = θ t + α θ ∇ θ ln ⁡ π ( a t ∣ s t ; θ ) q ( s t , a t ; w t ) Critic : w t + 1 = w t + α w [ r t + 1 + γ q ( s t + 1 , a t + 1 ; w t ) − q ( s t , a t ; w t ) ] ∇ w q ( s t , a t ; w t ) \text{QAC:} \left\{ \begin{aligned} \text{Actor}: \theta_{t+1} & = \theta_t + \alpha_\theta \nabla_\theta \ln\pi(a_t|s_t;\theta) q(s_t,a_t;w_t) \\ \text{Critic}: w_{t+1} & = w_t + \alpha_w [r_{t+1}+\gamma q(s_{t+1},a_{t+1};w_t) - q(s_t,a_t;w_t)]\nabla_w q(s_t,a_t;w_t) \end{aligned} \right. QAC:{Actor:θt+1Critic:wt+1=θt+αθθlnπ(atst;θ)q(st,at;wt)=wt+αw[rt+1+γq(st+1,at+1;wt)q(st,at;wt)]wq(st,at;wt)

We use value function approximation to approximate true q value q t ( s t , a t ) q_t(s_t,a_t) qt(st,at). Meanwhile we use the idea of Sarsa to update our value function.

​ We can write the objective function of the update rule

QAC: { Actor : max ⁡ θ J ( θ ) = E S ∼ η , A ∼ π [ ln ⁡ π ( A ∣ S ; θ t ) q ( S , A ; w t ) ] Critic : min ⁡ w J ( w ) = E S ∼ η , A ∼ π [ ( R + γ q ( S ′ , A ; w t ) − q ( S t , A ; w t ) ) 2 ] \text{QAC:} \left \{ \begin{aligned} \textcolor{red}{\text{Actor}: \max_\theta J(\theta)} & \textcolor{red}{= \mathbb{E}_{S\sim_\eta, A\sim\pi}[\ln\pi(A|S;\theta_t)q(S,A;w_t)]} \\ \textcolor{red}{\text{Critic}: \min_w J(w)} & \textcolor{red}{= \mathbb{E}_{S\sim_\eta, A\sim\pi}[(R + \gamma q(S^\prime,A;w_t) - q(S_t,A;w_t))^2]} \end{aligned} \right. QAC: Actor:θmaxJ(θ)Critic:wminJ(w)=ESη,Aπ[lnπ(AS;θt)q(S,A;wt)]=ESη,Aπ[(R+γq(S,A;wt)q(St,A;wt))2]

Pesudocode

Image

10.2 Advantage Actor-Critic (A2C)

​ The core idea of A2C is to introduce a baseline to reduce estimation variance. That is
E S ∼ η , A ∼ π [ ∇ θ ln ⁡ π ( A ∣ S ; θ t ) q π ( S , A ) ] = E S ∼ η , A ∼ π [ ∇ θ ln ⁡ π ( A ∣ S ; θ t ) ( q π ( S , A ) − b ( S ) ) ] \mathbb{E}_{S\sim\eta,A\sim\pi}[\nabla_\theta \ln \pi(A|S;\theta_t)q_\pi(S,A)] = \mathbb{E}_{S\sim\eta,A\sim\pi}[\nabla_\theta \ln \pi(A|S;\theta_t)(q_\pi(S,A)-b(S))] ESη,Aπ[θlnπ(AS;θt)qπ(S,A)]=ESη,Aπ[θlnπ(AS;θt)(qπ(S,A)b(S))]
where the additional baseline b ( S ) b(S) b(S) is a scalar function of S S S. Add a baseline doesn’t affect the expectation of the above equation that is
E S ∼ η , A ∼ π [ ∇ θ ln ⁡ π ( A ∣ S ; θ t ) b ( S ) ] = 0 = ∑ s ∈ S η ( s ) ∑ a ∈ A π ( a ∣ s ; θ t ) ∇ θ ln ⁡ π ( a ∣ s ; θ t ) b ( s ) = ∑ s ∈ S η ( s ) ∑ a ∈ A ∇ θ π ( a ∣ s ; θ t ) b ( s ) = ∑ s ∈ S η ( s ) b ( s ) ∑ a ∈ A ∇ θ π ( a ∣ s ; θ t ) = ∑ s ∈ S η ( s ) b ( s ) ∇ θ π ∑ a ∈ A ( a ∣ s ; θ t ) = ∑ s ∈ S η ( s ) b ( s ) ∇ θ 1 = 0 \begin{aligned} \mathbb{E}_{S\sim\eta,A\sim\pi}[\nabla_\theta \ln \pi(A|S;\theta_t)b(S)] & = 0 \\ & = \sum_{s\in\mathcal{S}}\eta(s)\sum_{a\in\mathcal{A}}\pi(a|s;\theta_t) \nabla_\theta \ln \pi(a|s;\theta_t)b(s) \\ & = \sum_{s\in\mathcal{S}}\eta(s)\sum_{a\in\mathcal{A}}\nabla_\theta\pi(a|s;\theta_t)b(s) \\ & = \sum_{s\in\mathcal{S}}\eta(s)b(s)\sum_{a\in\mathcal{A}}\nabla_\theta\pi(a|s;\theta_t) \\ & = \sum_{s\in\mathcal{S}}\eta(s)b(s)\nabla_\theta\pi\sum_{a\in\mathcal{A}}(a|s;\theta_t) \\ & = \sum_{s\in\mathcal{S}}\eta(s)b(s)\nabla_\theta1 = 0 \end{aligned} ESη,Aπ[θlnπ(AS;θt)b(S)]=0=sSη(s)aAπ(as;θt)θlnπ(as;θt)b(s)=sSη(s)aAθπ(as;θt)b(s)=sSη(s)b(s)aAθπ(as;θt)=sSη(s)b(s)θπaA(as;θt)=sSη(s)b(s)θ1=0

How to find the optimal baseline? The derivation is omitted. The optimal baseline is

b ∗ ( s ) = E A ∼ π [ ∣ ∣ ∇ θ ln ⁡ π ( A ∣ s ; θ t ) ∣ ∣ 2 q π ( s , A ) ] E A ∼ π [ ∣ ∣ ∇ θ ln ⁡ π ( A ∣ s ; θ t ) ∣ ∣ 2 ] b^*(s) = \frac{\mathbb{E}_{A\sim\pi}[||\nabla_\theta \ln\pi(A|s;\theta_t)||^2 q_\pi(s,A)]}{\mathbb{E}_{A\sim\pi}[||\nabla_\theta \ln\pi(A|s;\theta_t)||^2]} b(s)=EAπ[∣∣θlnπ(As;θt)2]EAπ[∣∣θlnπ(As;θt)2qπ(s,A)]

But its too complex to use in practice. If the weight ∣ ∣ ∇ θ ln ⁡ π ( A ∣ s ; θ t ) ∣ ∣ 2 ||\nabla_\theta \ln\pi(A|s;\theta_t)||^2 ∣∣θlnπ(As;θt)2 is removed, we can obtain a suboptimal baseline that has a concise expression:

b † ( s ) = E [ q π ( s , A ) ] = v π ( s ) \textcolor{red}{b^\dagger (s) = \mathbb{E}[q_\pi(s,A)] = v_\pi(s)} b(s)=E[qπ(s,A)]=vπ(s)

The suboptimal baseline is the state value of state s s s.

When b ( s ) = v π ( s ) b(s)=v_\pi(s) b(s)=vπ(s), the gradient-ascent becomes
θ t + 1 = θ t + α E [ ∇ θ ln ⁡ π ( A ∣ S ; θ t ) [ q π ( S , A ) − v π ( S ) ] ] = θ t + α E [ ∇ θ ln ⁡ π ( A ∣ S ; θ t ) δ π ( S , A ) ] \begin{aligned} \theta_{t+1} & = \theta_t + \alpha\mathbb{E}[\nabla_\theta \ln\pi(A|S;\theta_t)[q_\pi(S,A)-v_\pi(S)]] \\ & = \theta_t + \alpha\mathbb{E}[\nabla_\theta\ln\pi(A|S;\theta_t) \delta_\pi(S,A)] \end{aligned} θt+1=θt+αE[θlnπ(AS;θt)[qπ(S,A)vπ(S)]]=θt+αE[θlnπ(AS;θt)δπ(S,A)]
Here,
δ π ( S , A ) = q π ( S , A ) − v π ( S ) \textcolor{red}{\delta_\pi(S,A) = q_\pi(S,A) - v_\pi(S)} δπ(S,A)=qπ(S,A)vπ(S)
is called advantage function, which reflects the advantage of one action over the others. More specifically, note that v π ( s ) = ∑ a ∈ A π ( a ∣ s ) q π ( s , a ) v_\pi(s)=\sum_{a\in\mathcal{A}}\pi(a|s)q_\pi(s,a) vπ(s)=aAπ(as)qπ(s,a) is the mean of the action value. If δ π ( s , a ) > 0 \delta_\pi(s,a)>0 δπ(s,a)>0, it means that the corresponding action has a greater value than the mean value.

The stochastic version is
θ t + 1 = θ t + α ∇ θ ln ⁡ π ( a t ∣ s t ; θ t ) [ q t ( s t , a t ) − v t ( s t ) ] = θ t + α ∇ θ ln ⁡ π ( a t ∣ s t ; θ t ) δ t ( s t , a t ) \begin{aligned} \theta_{t+1} & = \theta_t + \alpha\nabla_\theta \ln\pi(a_t|s_t;\theta_t)[q_t(s_t,a_t)-v_t(s_t)] \\ & = \theta_t + \alpha\nabla_\theta\ln\pi(a_t|s_t;\theta_t) \delta_t(s_t,a_t) \end{aligned} θt+1=θt+αθlnπ(atst;θt)[qt(st,at)vt(st)]=θt+αθlnπ(atst;θt)δt(st,at)
we need to estimate the true q-value q t ( s t , a t ) q_t(s_t,a_t) qt(st,at). There are many ways:

  • If q t ( s t , a t ) q_t(s_t,a_t) qt(st,at) and v t ( s t ) v_t(s_t) vt(st) are estimated by Monte-Carlo learning, the algorithm is called REINFORCE with a baseline.
  • If q t ( s t , a t ) q_t(s_t,a_t) qt(st,at) and v t ( s t ) v_t(s_t) vt(st) are estimated by TD learning, the algorithm is usually called advantage actor-critic (A2C).

q t ( s t , a t ) − v t ( s t ) = r t + 1 + γ q t ( s t + 1 , a t + 1 ) − v t ( s t ) ≈ r t + 1 + γ v t ( s t + 1 ) − v t ( s t ) \begin{aligned} q_t(s_t,a_t) - v_t(s_t) & = r_{t+1} +\gamma q_t(s_{t+1},a_{t+1}) - v_t(s_t) \\ & \textcolor{red}{\approx r_{t+1} +\gamma v_t(s_{t+1}) - v_t(s_t)} \end{aligned} qt(st,at)vt(st)=rt+1+γqt(st+1,at+1)vt(st)rt+1+γvt(st+1)vt(st)

Hence, we don’t need to maintain two networks to represent v π ( s ) v_\pi(s) vπ(s) and q π ( s , a ) q_\pi(s,a) qπ(s,a). We just need one network to represent v π ( s ) v_\pi(s) vπ(s).

In A2C we use one policy network π ( a ∣ s ; θ ) \pi(a|s;\theta) π(as;θ) and one state value network v ( s ; w ) v(s;w) v(s;w). The core idea of A2C is that

A2C : { Advantage : δ t = r t + 1 + γ v ( s t + 1 ; w t ) − v ( s t ; w t ) Actor : θ t + 1 = θ t + α θ δ t ∇ θ ln ⁡ π ( a t ∣ s t ; θ ) Critic : w t + 1 = w t + α w δ t ∇ w v ( s t , ; w t ) \text{A2C}: \left \{ \begin{aligned} \text{Advantage}: \delta_t & = r_{t+1} + \gamma v(s_{t+1};w_t) - v(s_t;w_t) \\ \text{Actor}: \theta_{t+1} & = \theta_t + \alpha_\theta \textcolor{blue}{\delta_t}\nabla_\theta \ln\pi(a_t|s_t;\theta) \\ \text{Critic}: w_{t+1} & = w_t + \alpha_w \textcolor{blue}{\delta_t} \nabla_w v(s_t,;w_t) \end{aligned} \right. A2C: Advantage:δtActor:θt+1Critic:wt+1=rt+1+γv(st+1;wt)v(st;wt)=θt+αθδtθlnπ(atst;θ)=wt+αwδtwv(st,;wt)

We can write the objective function of the update rule

A2C : { Advantage: Δ ( S ) = R + γ v ( S ′ ; w t ) − v ( S ; w t ) Actor : max ⁡ θ J ( θ ) = E S ∼ η , A ∼ π [ ln ⁡ π ( A ∣ S ; θ ) Δ ( S ) ] Critic : min ⁡ w J ( w ) = E S ∼ η [ ( R + γ v ( S ′ ; w ) − v ( S ; w ) ) 2 ] = E S ∼ η [ Δ ( S ) ] \text{A2C}: \left\{ \begin{aligned} \text{Advantage:} \Delta(S) & = R+\gamma v(S^\prime;w_t) - v(S;w_t) \\ \textcolor{red}{\text{Actor}: \max_\theta J(\theta)} & \textcolor{red}{= \mathbb{E}_{S\sim_\eta, A\sim\pi}[\ln\pi(A|S;\theta)\Delta(S)]} \\ \textcolor{red}{\text{Critic}: \min_w J(w)} & \textcolor{red}{= \mathbb{E}_{S\sim_\eta}[(R + \gamma v(S^\prime;w) - v(S;w))^2]} = \mathbb{E}_{S\sim_\eta}[\Delta(S)] \end{aligned} \right. A2C: Advantage:Δ(S)Actor:θmaxJ(θ)Critic:wminJ(w)=R+γv(S;wt)v(S;wt)=ESη,Aπ[lnπ(AS;θ)Δ(S)]=ESη[(R+γv(S;w)v(S;w))2]=ESη[Δ(S)]

Pesudocode

Image

10.3 Off-policy Actor-Critic

Importance Sampling

​ The key technique to convert the AC algorithm to off-policy is importance sampling. Consider a random variable X ∈ X X\in\mathcal{X} XX. Suppose that p 0 ( X ) p_0(X) p0(X) is a probability distribution. Our goal is to estimate E X ∼ p 0 [ X ] \mathbb{E}_{X\sim p_0}[X] EXp0[X]. We also known the p 1 ( X ) p_1(X) p1(X) is a probability distribution of X X X. How can we use the probability p 1 ( X ) p_1(X) p1(X) to sample data to estimate E X ∼ p 0 [ X ] \mathbb{E}_{X\sim p_0}[X] EXp0[X]. The technique is importance sampling. Suppose we have some i.i.d. samples { x i } i = 1 n \{x_i\}^n_{i=1} {xi}i=1n generated by distribution p 1 ( X ) p_1(X) p1(X).
E X ∼ p 0 [ X ] = ∑ x ∈ X p 0 ( x ) x = ∑ x ∈ X p 1 ( x ) p 0 ( x ) p 1 ( x ) x ⏟ f ( x ) = E X ∼ p 1 [ f ( X ) ] E X ∼ p 0 [ X ] = E X ∼ p 1 [ f ( X ) ] ≈ f ˉ = 1 n ∑ i = 1 n f ( x i ) = 1 n ∑ i = 1 n p 0 ( x i ) p 1 ( x i ) ⏟ importance weight x i \mathbb{E}_{X\sim p_0}[X] = \sum_{x\in\mathcal{X}}p_0(x)x = \sum_{x\in\mathcal{X}}p_1(x)\underbrace{\frac{p_0(x)}{p_1(x)}x}_{f(x)} = \mathbb{E}_{X\sim p_1}[f(X)] \\ \mathbb{E}_{X\sim p_0}[X] = \mathbb{E}_{X\sim p_1}[f(X)] \approx \bar{f} = \frac{1}{n} \sum^n_{i=1}f(x_i) = \frac{1}{n} \sum^n_{i=1} \underbrace{\frac{p_0(x_i)}{p_1(x_i)}}_{\text{importance weight}}x_i EXp0[X]=xXp0(x)x=xXp1(x)f(x) p1(x)p0(x)x=EXp1[f(X)]EXp0[X]=EXp1[f(X)]fˉ=n1i=1nf(xi)=n1i=1nimportance weight p1(xi)p0(xi)xi
An Example

​ Consider X ∈ X = + 1 , − 1 X\in\mathcal{X}={+1,-1} XX=+1,1 Suppose the p 0 p_0 p0 is a probability distribution satisfying
p 0 ( X = + 1 ) = 0.5 , p 0 ( X = − 1 ) = 0.5 p_0(X=+1)=0.5, p_0(X=-1)=0.5 p0(X=+1)=0.5,p0(X=1)=0.5
The expectaton of X X X over p 0 p_0 p0 is
E X ∼ p 0 [ X ] = ( + 1 ) × 0.5 + ( − 1 ) × 0.5 = 0 \mathbb{E}_{X\sim p_0}[X] = (+1)\times 0.5 + (-1) \times 0.5 = 0 EXp0[X]=(+1)×0.5+(1)×0.5=0
Suppose p 1 p_1 p1 is a probability distribution satisfying
p 0 ( X = + 1 ) = 0.8 , p 0 ( X = − 1 ) = 0.2 p_0(X=+1)=0.8, p_0(X=-1)=0.2 p0(X=+1)=0.8,p0(X=1)=0.2
The expectaton of X X X over p 0 p_0 p0 is
E X ∼ p 1 [ X ] = ( + 1 ) × 0.8 + ( − 1 ) × 0.2 = 0.6 \mathbb{E}_{X\sim p_1}[X] = (+1)\times 0.8 + (-1) \times 0.2 = 0.6 EXp1[X]=(+1)×0.8+(1)×0.2=0.6
We can use the importance samping techique to sample data under distribution p 1 p_1 p1 to compute E X ∼ p 0 [ X ] \mathbb{E}_{X\sim p_0}[X] EXp0[X]
E X ∼ p 0 [ X ] = 1 n ∑ i = 1 n p 0 ( x i ) p 1 ( x i ) x i \mathbb{E}_{X\sim p_0}[X] = \frac{1}{n}\sum_{i=1}^n \frac{p_0(x_i)}{p_1(x_i)}x_i EXp0[X]=n1i=1np1(xi)p0(xi)xi

import numpy as np
import matplotlib.pyplot as plt
# reproducible
np.random.seed(0)

# 定义元素和对应的概率
elements = [1, -1]
probs1 = [0.5, 0.5]
probs2 = [0.8, 0.2]

# 重要性采样 importance sample
sample_times = 300
sample_list = []
i_sample_list = []
average_list = []
importance_list = []
for i in range(sample_times):
    sample = np.random.choice(elements, p=probs2)
    sample_list.append(sample)
    average_list.append(np.mean(sample_list))
    if sample == elements[0]:
        i_sample_list.append(probs1[0] / probs2[0] * sample)
    elif sample == elements[1]:
        i_sample_list.append(probs1[1] / probs2[1] * sample)
    importance_list.append(np.mean(i_sample_list))



plt.plot(range(len(sample_list)), sample_list, 'o', markerfacecolor='none', label='sample data')
plt.plot(range(len(average_list)), average_list, 'b--', label='average')
plt.plot(range(len(importance_list)), importance_list, 'g--', label='importance sampling')
plt.axhline(y=0.6, color='r', linestyle='--')
plt.axhline(y=0, color='r', linestyle='--')
plt.ylim(-1.5, 2.5) # 限制y轴显示范围
plt.xlim(0,sample_times) # 限制x轴显示范围
plt.legend(loc='upper right')
plt.show()

Image

Off-policy policy gradient theorem

​ With importance sampling, we are ready to present the off-policy gradient theorem. Suppose that the β \beta β is a behavior policy. Our goal is to use the samples generated by behavoir policy β \beta β to learn a target policy π \pi π that can maximize the following metric
max ⁡ θ J ( θ ) = E S ∼ d β [ v π ( S ) ] \max_\theta J(\theta) = \mathbb{E}_{S\sim d_\beta}[v_\pi(S)] θmaxJ(θ)=ESdβ[vπ(S)]
Theorem 10.1 (Stochastic off-policy policy gradient theorem). In the discounted case where γ ∈ ( 0 , 1 ) \gamma\in(0,1) γ(0,1), the gradient of J ( θ ) J(\theta) J(θ) is
∇ θ J ( θ ) = E S ∼ ρ , A ∼ β [ π ( A ∣ S ; θ ) β ( A ∣ S ) ⏟ importance weight ∇ θ ln ⁡ π ( A ∣ S ; θ ) q π ( S , A ) ] \textcolor{red}{\nabla_\theta J(\theta) = \mathbb{E}_{S\sim\rho, A\sim\beta}\Big[\underbrace{\frac{\pi(A|S;\theta)}{\beta(A|S)}}_{\text{importance weight}} \nabla_\theta \ln \pi(A|S;\theta) q_\pi(S,A) \Big]} θJ(θ)=ESρ,Aβ[importance weight β(AS)π(AS;θ)θlnπ(AS;θ)qπ(S,A)]
where the state distribution ρ \rho ρ is
ρ ( s ) ≜ ∑ s ′ ∈ S d β ( s ′ ) Pr ⁡ π ( s ∣ s ′ ) \rho(s) \triangleq \sum_{s^\prime\in\mathcal{S}} d_\beta(s^\prime) \Pr_\pi(s|s^\prime) ρ(s)sSdβ(s)πPr(ss)
where Pr ⁡ π ( s ∣ s ′ ) = ∑ k = 0 ∞ γ k [ P π k ] s ′ , s = [ ( I − γ P π ) − 1 ] s ′ , s \Pr_\pi(s|s^\prime)=\sum_{k=0}^\infty \gamma^k[P^k_\pi]_{s^\prime,s}=[(I-\gamma P_\pi)^{-1}]_{s^\prime,s} Prπ(ss)=k=0γk[Pπk]s,s=[(IγPπ)1]s,s is the discounted total probability of transitioning from s ′ s^\prime s to s s s under policy π \pi π.

​ The off-policy policy gradient is invariant to any additional baseline b ( s ) b(s) b(s). In particular, we have
∇ θ J ( θ ) = E S ∼ ρ , A ∼ β [ π ( A ∣ S ; θ ) β ( A ∣ S ) ∇ θ ln ⁡ π ( A ∣ S ; θ ) ( q π ( S , A ) − b ( S ) ) ] \nabla_\theta J(\theta) = \mathbb{E}_{S\sim\rho, A\sim\beta}\Big[\frac{\pi(A|S;\theta)}{\beta(A|S)} \nabla_\theta \ln \pi(A|S;\theta) \big( q_\pi(S,A) - b(S) \big) \Big] θJ(θ)=ESρ,Aβ[β(AS)π(AS;θ)θlnπ(AS;θ)(qπ(S,A)b(S))]
when we take the state value as the baseline v π ( S ) = b ( S ) v_\pi(S)=b(S) vπ(S)=b(S), there comes the advantage function.
δ π ( S , A ) = q π ( S , A ) − v π ( S ) \delta_\pi(S,A) = q_\pi(S,A) - v_\pi(S) δπ(S,A)=qπ(S,A)vπ(S)
The corresponding stochastic gradient-ascent algorithm is
θ t + 1 = θ t + α θ π ( a t ∣ s t ; θ ) β ( a t ∣ s t ) ∇ θ ln ⁡ π ( a t ∣ s t ; θ t ) ( q t ( s t , a t ) − v t ( s t ) ) \theta_{t+1} = \theta_t + \alpha_\theta \frac{\pi(a_t|s_t;\theta)}{\beta(a_t|s_t)} \nabla_\theta \ln\pi(a_t|s_t;\theta_t)(q_t(s_t,a_t)-v_t(s_t)) θt+1=θt+αθβ(atst)π(atst;θ)θlnπ(atst;θt)(qt(st,at)vt(st))
The advantage function can be replaced by the TD error. That is
q t ( s t , a t ) − v t ( s t ) ≈ r t + 1 + γ v t ( s t + 1 ) − v t ( s t ) ≜ δ t ( s t , a t ) q_t(s_t,a_t)-v_t(s_t) \approx r_{t+1} + \gamma v_t(s_{t+1}) - v_t(s_t) \triangleq \delta_t(s_t,a_t) qt(st,at)vt(st)rt+1+γvt(st+1)vt(st)δt(st,at)
In off-policy A2C we use the behavior policy β \beta β to obtain samples, to learn a policy network π ( a ∣ s ; θ ) \pi(a|s;\theta) π(as;θ), and one value network v ( s ; w ) v(s;w) v(s;w). The core idea of off-policy A2C is

off-policy A2C : { Behavior policy: S ∼ β Advantage : δ t = r t + 1 + γ v ( s t + 1 ; w t ) − v ( s t ; w t ) Actor : θ t + 1 = θ t + α θ π ( a t ∣ s t ; θ ) β ( a t ∣ s t ) δ t ∇ θ ln ⁡ π ( a t ∣ s t ; θ ) Critic : w t + 1 = w t + α w π ( a t ∣ s t ; θ ) β ( a t ∣ s t ) δ t ∇ w v ( s t , ; w t ) \text{off-policy A2C}: \left \{ \begin{aligned} \text{Behavior policy:} S & \sim\beta \\ \text{Advantage}: \delta_t & = r_{t+1} + \gamma v(s_{t+1};w_t) - v(s_t;w_t) \\ \text{Actor}: \theta_{t+1} & = \theta_t + \alpha_\theta \frac{\pi(a_t|s_t;\theta)}{\beta(a_t|s_t)} \delta_t\nabla_\theta \ln\pi(a_t|s_t;\theta) \\ \text{Critic}: w_{t+1} & = w_t + \alpha_w \frac{\pi(a_t|s_t;\theta)}{\beta(a_t|s_t)}\delta_t \nabla_w v(s_t,;w_t) \end{aligned} \right. off-policy A2C: Behavior policy:SAdvantage:δtActor:θt+1Critic:wt+1β=rt+1+γv(st+1;wt)v(st;wt)=θt+αθβ(atst)π(atst;θ)δtθlnπ(atst;θ)=wt+αwβ(atst)π(atst;θ)δtwv(st,;wt)

We rewrite the objective function of the update rule

off-policy A2C : { Behavior policy: S ∼ β Advantage : Δ ( S ) = R + γ v ( S ′ ; w ) − v ( S ; w ) Actor : max ⁡ θ J ( θ ) = E S ∼ ρ , A ∼ β [ π ( A ∣ S ; θ ) β ( A ∣ S ) Δ ( S ) ln ⁡ π ( A ∣ S ; θ ) ] Critic : min ⁡ w J ( w ) = E S ∼ ρ [ ( R + γ v ( S ′ ; w t ) − v ( S t ; w t ) ) 2 ] = E S ∼ η [ Δ ( S ) ] \text{off-policy A2C}: \left \{ \begin{aligned} \text{Behavior policy:} S & \sim\beta \\ \text{Advantage}: \Delta(S) & = R + \gamma v(S^\prime;w) - v(S;w) \\ \text{Actor}: \max_\theta J(\theta) & = \mathbb{E}_{S\sim\rho,A\sim\beta}[\frac{\pi(A|S;\theta)}{\beta(A|S)}\Delta(S) \ln\pi(A|S;\theta)] \\ \text{Critic}: \min_w J(w) & = \mathbb{E}_{S\sim\rho}[(R + \gamma v(S^\prime;w_t) - v(S_t;w_t))^2] = \mathbb{E}_{S\sim_\eta}[\Delta(S)] \end{aligned} \right. off-policy A2C: Behavior policy:SAdvantage:Δ(S)Actor:θmaxJ(θ)Critic:wminJ(w)β=R+γv(S;w)v(S;w)=ESρ,Aβ[β(AS)π(AS;θ)Δ(S)lnπ(AS;θ)]=ESρ[(R+γv(S;wt)v(St;wt))2]=ESη[Δ(S)]

Pesudocode

Image

Reference

赵世钰老师的课程

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/869664.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

Blazor简单教程(2):布局

文章目录 前言布局自定义布局默认布局 前言 我们现在主流的页面都是单页面Layout布局,即一个页面有侧边栏,抬头,下边栏,中间主题。 BootstrapBlazor UI, Blazor Server 模式配置 布局 自定义布局 注入LayoutCompon…

微服务06-分布式事务解决方案Seata

1、Seata 概述 Seata事务管理中有三个重要的角色: TC (Transaction Coordinator) - **事务协调者:**维护全局和分支事务的状态,协调全局事务提交或回滚。 TM (Transaction Manager) - **事务管理器:**定义全局事务的范围、开始全局事务、提交或回滚全局事务。 RM (Resourc…

软考笔记 信息管理师 高级

文章目录 介绍考试内容与时间教材 预习课程一些例子课本结构考试内容 1 信息与信息化1.1 信息与信息化1.1.1 信息1.1.2 信息系统1.1.3 信息化 1.2 现代化基础设施1.2.1 新型基础建设1.2.2 工业互联网1.2.3 车联网: 1.3 现代化创新发展1.3.1 农业农村现代化1.3.2 两化…

常见的路由协议之RIP协议与OSPF协议

目录 RIP OSPF 洪泛和广播的区别 路由协议是用于在网络中确定最佳路径的一组规则。它们主要用于在路由器之间交换路由信息,以便找到从源到目标的最佳路径。 常见的路由协议: RIP (Routing Information Protocol):RIP 是一种基于距离向量算…

cookie是什么?

cookie是什么? Cookie实际上是一小段的文本信息。 http协议本身是无状态的。无状态是指Web浏览器与Web服务器之间不需要建立持久的连接,这意味着当一个客户端向服务器端发出请求,然后Web服务器返回响应(Response)&…

用友移动管理系统 任意文件上传漏洞复现(HW0day)

0x01 产品简介 用友移动系统管理是用友公司推出的一款移动办公解决方案,旨在帮助企业实现移动办公、提高管理效率和员工工作灵活性。它提供了一系列功能和工具,方便用户在移动设备上管理和处理企业的系统和业务。 0x02 漏洞概述 用友移动管理系统 uploa…

Kubesphere中DevOps流水线无法部署/部署失败

摘要 总算能让devops运行以后,流水线却卡在了deploy这一步。碰到了两个比较大的问题,一个是无法使用k8sp自带的kubeconfig认证去部署;一个是部署好了以后但是没有办法解析镜像名。 版本信息 k8s:v1.21.5 k8sp:v3.3.…

GO学习之 微框架(Gin)

GO系列 1、GO学习之Hello World 2、GO学习之入门语法 3、GO学习之切片操作 4、GO学习之 Map 操作 5、GO学习之 结构体 操作 6、GO学习之 通道(Channel) 7、GO学习之 多线程(goroutine) 8、GO学习之 函数(Function) 9、GO学习之 接口(Interface) 10、GO学习之 网络通信(Net/Htt…

kafka高吞吐量分享

消息队列kafka基本介绍基本概念整体架构 高吞吐量实现分区分段顺序写磁盘零拷贝技术DMA(Direct Memory Access)传统传输零拷贝传输 批量发送 消息队列 解耦合 耦合的状态表示当你实现某个功能的时候,是直接接入当前接口,而利用消…

20230812在WIN10下使用python3将SRT格式的字幕转换为SSA格式

20230812在WIN10下使用python3将SRT格式的字幕转换为SSA格式 2023/8/12 20:58 本文的SSA格式以【Batch Subtitles Converter(批量字幕转换) v1.23】的格式为准! 1、 缘起:网上找到的各种各样的字幕转换软件/小工具都不是让自己完全满意! 【都…

2023年中国智慧公安行业发展现况及发展趋势分析:数据化建设的覆盖范围不断扩大[图]

智慧公安基于互联网、物联网、云计算、智能引擎、视频技术、数据挖掘、知识管理为技术支撑,公安信息化为核心,通过互联互通、物联化、智能方式促进公安系统各功能模块的高度集成、协同作战实现警务信息化“强度整合、高度共享、深度应用”警察发展的新概…

goland插件推荐Rider UI Theme Pack

推荐一个goland配色插件Rider UI Theme Pack,里面自带visual assist配色,配色截图如下: 直接在plugins里面进行搜索或者在插件home page下载后进行安装均可。 总算找到一统vscode 和goland二者优势的插件了。

由于找不到d3dx9_42.dll,无法继续执行代码。重新安装程序可能会解决此问题

d3dx9_42.dll是一个动态链接库文件,它是Microsoft DirectX 9的一部分。这个文件包含了DirectX 9的一些函数和资源,用于支持计算机上运行基于DirectX 9的应用程序和游戏。它通常用于提供图形、音频和输入设备的支持,以及其他与图形和游戏相关的…

【分布式系统】聊聊高性能设计

每个程序员都应该知道的数字 高性能 对于以上的数字,其实每个程序员都应该了解,因为只有了解这些基本的数字,才能知道对于CPU、内存、磁盘、网络之间数据读写的时间。1000ms 1S。毫秒->微秒->纳秒-秒->分钟 为什么高性能如此重要的…

分布式任务调度平台XXL-JOB使用

说明:分布式任务调度平台XXL-JOB,其核心设计目标是开发迅速、学习简单、轻量级、易扩展。现已开放源代码并接入多家公司线上产品线,开箱即用(官方语)。 本文介绍使用XXL-JOB实现定时执行代码,可用于项目中…

postgresql之内存池-GenerationContext

创建GenerationContext MemoryContext GenerationContextCreate(MemoryContext parent,const char *name,Size blockSize) {GenerationContext *set; ...set (GenerationContext *) malloc(MAXALIGN(sizeof(GenerationContext))); .../* Fill in GenerationContext-specific …

利用ApiPost实现Mock Server服务

APIPOST可以让你在没有后端程序的情况下能真实地返回接口数据,你可以用APIPOST实现项目初期纯前端的效果演示,也可以用APIPOST实现开发中的数据模拟从而实现前后端分离。在使用APIPOST之前,你的团队实现数据模拟可能是下面的方案中的一种或者…

android Ndk Jni动态注册方式以及静态注册

目录 一.静态注册方式 二.动态注册方式 三.源代码 一.静态注册方式 1.项目名\app\src\main下新建一个jni目录 2.在jni目录下,再新建一个Android.mk文件 写入以下配置 LOCAL_PATH := $(call my-dir)//获取当前Android.mk所在目录 inclu

uniapp开发(由浅到深)

文章目录 1. 项目构建1.1 脚手架构建1.2 HBuilderX创建 uni-app项目步骤: 2 . 包依赖2.1 uView2.2 使用uni原生ui插件2.3 uni-modules2.4 vuex使用 3.跨平台兼容3.1 条件编译 4.API 使用4.1 正逆参数传递 5. 接口封装6. 多端打包3.1 微信小程序3.2 打包App3.2.1 自有…

IO密集型服务提升性能的三种方法

文章目录 批处理缓存多线程总结 大部分的业务系统其实都是IO密集型的系统,比如像我们面向B端提供摄像头服务,很多的接口其实就是将各种各样的数据汇总起来,展示给用户,我们的数据来源包括Redis、Mysql、Hbase、以及依赖的一些服务…