深度强化学习(七)策略梯度

news2024/11/17 12:45:27

深度强化学习(七)策略梯度

策略学习的目的是通过求解一个优化问题,学出最优策略函数或它的近似函数(比如策略网络)

一.策略网络

假设动作空间是离散的,,比如 A = { 左 , 右 , 上 } \cal A=\{左,右,上\} A={,,},策略函数 π \pi π是个条件概率函数:
π ( a ∣ s ) = P ( A = a ∣ S = s ) \pi(a\mid s)=\Bbb P(A=a\mid S=s) π(as)=P(A=aS=s)
D Q N DQN DQN类似,我们可以用神经网络 π ( a ∣ s ; θ ) \pi(a \mid s ; \boldsymbol{\theta}) π(as;θ)去近似策略函数 π ( a ∣ s ) \pi(a\mid s) π(as), θ \boldsymbol \theta θ是我们需要训练的神经网络的参数。

回忆动作价值函数的定义是
Q π ( a t , s t ) = E A t + 1 , S t + 1 … [ U t ∣ A t = a t , S t = s t ] Q_{\pi}(a_t,s_t)=\Bbb E_{A_{t+1},S_{t+1}\ldots}[U_t\mid A_t=a_t,S_t=s_t] Qπ(at,st)=EAt+1,St+1[UtAt=at,St=st]
状态价值函数的定义是
V π ( s t ) = E A t ∼ π ( a ∣ s ) [ Q π ( A t , s t ) ] V_{\pi}(s_t)=\Bbb E_{A_t\sim \pi(a\mid s)}[Q_{\pi}(A_t,s_t)] Vπ(st)=EAtπ(as)[Qπ(At,st)]
 状态价值既依赖于当前状态  s t , 也依赖于策略网络  π  的参数  θ  。  \text { 状态价值既依赖于当前状态 } s_t \text {, 也依赖于策略网络 } \pi \text { 的参数 } \boldsymbol{\theta} \text { 。 }  状态价值既依赖于当前状态 st也依赖于策略网络 π 的参数 θ  

为排除状态对策略的影响,我们对状态 S t S_t St求期望,得出
J ( θ ) = E S t [ V π ( S t ) ] J(\boldsymbol \theta)=\Bbb E_{S_t}[V_{\pi}(S_t)] J(θ)=ESt[Vπ(St)]
这个目标函数排除掉了状态 S S S 的因素,只依赖于策略网络 π \pi π的参数 θ \boldsymbol \theta θ;策略越好,则 J J J越大。所以策略学习可以描述为这样一个优化问题
Max θ J ( θ ) \text{Max}_{\boldsymbol \theta} \quad J(\boldsymbol \theta) MaxθJ(θ)
由于是求最大化问题,我们可利用梯度上升对 J ( θ ) J(\boldsymbol \theta) J(θ)进行更新,问题的关键是计算 ∇ θ J ( θ ) \nabla_{\boldsymbol \theta}J(\boldsymbol \theta) θJ(θ)

二.策略梯度定理推导

Theorem:递归公式,其中 S ′ S' S是 下一时刻的状态。
∂ V π ( s ) ∂ θ = E A ∼ π ( ⋅ ∣ s ; θ ) [ ∂ ln ⁡ π ( A ∣ s ; θ ) ∂ θ ⋅ Q π ( s , A ) + γ ⋅ E S ′ ∼ p ( ⋅ ∣ s , A ) [ ∂ V π ( S ′ ) ∂ θ ] ] (2.1) \frac{\partial V_\pi(s)}{\partial \boldsymbol{\theta}}=\mathbb{E}_{A \sim \pi(\cdot \mid s ; \boldsymbol{\theta})}\left[\frac{\partial \ln \pi(A \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \cdot Q_\pi(s, A)+\gamma \cdot \mathbb{E}_{S^{\prime} \sim p(\cdot \mid s, A)}\left[\frac{\partial V_\pi\left(S^{\prime}\right)}{\partial \boldsymbol{\theta}}\right]\right]\tag{2.1} θVπ(s)=EAπ(s;θ)[θlnπ(As;θ)Qπ(s,A)+γESp(s,A)[θVπ(S)]](2.1)

Proof:
∂ V π ( s ) ∂ θ = ∂ ∂ θ [ E A ∼ π ( ⋅ ∣ s ; θ ) [ Q π ( s , A ) ] ] = ∂ ∂ θ [ ∑ A π ( a ∣ s ; θ ) Q π ( s , a ) ] = ∑ A [ ∂ π ( a ∣ s ; θ ) ∂ θ Q π ( s , a ) + π ( a ∣ s ; θ ) ∂ Q π ( s , a ) ∂ θ ] = ∑ A [ π ( a ∣ s ; θ ) ⋅ ∂ ln ⁡ π ( a ∣ s ; θ ) ∂ θ ⋅ Q π ( s , a ) + π ( a ∣ s ; θ ) ∂ Q π ( s , a ) ∂ θ ] = E A ∼ π ( ⋅ ∣ s ; θ ) [ ∂ ln ⁡ π ( A ∣ s ; θ ) ∂ θ ⋅ Q π ( s , A ) ] + E A ∼ π ( ⋅ ∣ s ; θ ) [ ∂ Q π ( s , a ) ∂ θ ] . = E A ∼ π ( ⋅ ∣ s ; θ ) [ ∂ ln ⁡ π ( A ∣ s ; θ ) ∂ θ ⋅ Q π ( s , A ) + ∂ Q π ( s , a ) ∂ θ ] \begin{aligned} \frac{\partial V_\pi(s)}{\partial \boldsymbol{\theta}} &=\frac{\partial}{\partial \boldsymbol \theta}[\Bbb E_{A\sim \pi(\cdot \mid s;\boldsymbol \theta)}[Q_{\pi}(s,A)]]\\ &= \frac{\partial}{\partial \boldsymbol \theta}[\sum_{A}\pi(a\mid s;\boldsymbol \theta)Q_{\pi}(s,a)]\\ &=\sum_{A}[\frac{\partial \pi(a\mid s;\boldsymbol \theta)}{\partial \boldsymbol \theta}Q_{\pi}(s,a)+\pi(a\mid s;\boldsymbol \theta)\frac{\partial Q_{\pi}(s,a)}{\partial \boldsymbol \theta}]\\ &=\sum_{A}[\pi(a\mid s;\boldsymbol \theta)\cdot\frac{\partial \ln \pi(a\mid s;\boldsymbol \theta)}{\partial \boldsymbol \theta}\cdot Q_{\pi}(s,a)+\pi(a\mid s;\boldsymbol \theta)\frac{\partial Q_{\pi}(s,a)}{\partial \boldsymbol \theta}] \\ & =\mathbb{E}_{A \sim \pi(\cdot \mid s ; \boldsymbol{\theta})}\left[\frac{\partial \ln \pi(A \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \cdot Q_\pi(s, A)\right]+\mathbb{E}_{A \sim \pi(\cdot \mid s ; \boldsymbol{\theta})}\left[\frac{\partial Q_\pi(s, a)}{\partial \boldsymbol{\theta}}\right] . \\ &= \mathbb{E}_{A \sim \pi(\cdot \mid s ; \boldsymbol{\theta})}[\frac{\partial \ln \pi(A \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \cdot Q_\pi(s, A)+\frac{\partial Q_\pi(s, a)}{\partial \boldsymbol{\theta}}] \end{aligned} θVπ(s)=θ[EAπ(s;θ)[Qπ(s,A)]]=θ[Aπ(as;θ)Qπ(s,a)]=A[θπ(as;θ)Qπ(s,a)+π(as;θ)θQπ(s,a)]=A[π(as;θ)θlnπ(as;θ)Qπ(s,a)+π(as;θ)θQπ(s,a)]=EAπ(s;θ)[θlnπ(As;θ)Qπ(s,A)]+EAπ(s;θ)[θQπ(s,a)].=EAπ(s;θ)[θlnπ(As;θ)Qπ(s,A)+θQπ(s,a)]
接下来仅需证明 ∂ Q π ( s , a ) ∂ θ = γ E S ′ ∼ p ( ⋅ ∣ s , A ) [ ∂ V π ( S ′ ) ∂ θ ] \frac{\partial Q_\pi(s, a)}{\partial \boldsymbol{\theta}}=\gamma \mathbb{E}_{S^{\prime} \sim p(\cdot \mid s, A)}[\frac{\partial V_\pi\left(S^{\prime}\right)}{\partial \boldsymbol{\theta}}] θQπ(s,a)=γESp(s,A)[θVπ(S)],贝尔曼方程为
Q π ( s , a ) = E S ′ ∼ p ( ⋅ ∣ s , a ) [ R ( s , a , S ′ ) + γ ⋅ V π ( s ′ ) ] = ∑ s ′ ∈ S p ( s ′ ∣ s , a ) ⋅ [ R ( s , a , s ′ ) + γ ⋅ V π ( s ′ ) ] = ∑ s ′ ∈ S p ( s ′ ∣ s , a ) ⋅ R ( s , a , s ′ ) + γ ⋅ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) ⋅ V π ( s ′ ) . \begin{aligned} Q_\pi(s, a) & =\mathbb{E}_{S^{\prime} \sim p(\cdot \mid s, a)}\left[R\left(s, a, S^{\prime}\right)+\gamma \cdot V_\pi\left(s^{\prime}\right)\right] \\ & =\sum_{s^{\prime} \in \mathcal{S}} p\left(s^{\prime} \mid s, a\right) \cdot\left[R\left(s, a, s^{\prime}\right)+\gamma \cdot V_\pi\left(s^{\prime}\right)\right] \\ & =\sum_{s^{\prime} \in \mathcal{S}} p\left(s^{\prime} \mid s, a\right) \cdot R\left(s, a, s^{\prime}\right)+\gamma \cdot \sum_{s^{\prime} \in \mathcal{S}} p\left(s^{\prime} \mid s, a\right) \cdot V_\pi\left(s^{\prime}\right) . \end{aligned} Qπ(s,a)=ESp(s,a)[R(s,a,S)+γVπ(s)]=sSp(ss,a)[R(s,a,s)+γVπ(s)]=sSp(ss,a)R(s,a,s)+γsSp(ss,a)Vπ(s).

在观测到 s 、 a 、 s ′ s 、 a 、 s^{\prime} sas 之后, p ( s ′ ∣ s , a ) p\left(s^{\prime} \mid s, a\right) p(ss,a) R ( s , a , s ′ ) R\left(s, a, s^{\prime}\right) R(s,a,s) 都与策略网络 π \pi π 无关, 因此
∂ ∂ θ [ p ( s ′ ∣ s , a ) ⋅ R ( s , a , s ′ ) ] = 0. \frac{\partial}{\partial \boldsymbol{\theta}}\left[p\left(s^{\prime} \mid s, a\right) \cdot R\left(s, a, s^{\prime}\right)\right]=0 . θ[p(ss,a)R(s,a,s)]=0.

可得:
∂ Q π ( s , a ) ∂ θ = ∑ s ′ ∈ S ∂ ∂ θ [ p ( s ′ ∣ s , a ) ⋅ R ( s , a , s ′ ) ] ⏟ 等于零  + γ ⋅ ∑ s ′ ∈ S ∂ ∂ θ [ p ( s ′ ∣ s , a ) ⋅ V π ( s ′ ) ] = γ ⋅ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) ⋅ ∂ V π ( s ′ ) ∂ θ = γ ⋅ E S ′ ∼ p ( ⋅ ∣ s , a ) [ ∂ V π ( S ′ ) ∂ θ ] . \begin{aligned} \frac{\partial Q_\pi(s, a)}{\partial \boldsymbol{\theta}} & =\sum_{s^{\prime} \in \mathcal{S}} \underbrace{\frac{\partial}{\partial \boldsymbol{\theta}}\left[p\left(s^{\prime} \mid s, a\right) \cdot R\left(s, a, s^{\prime}\right)\right]}_{\text {等于零 }}+\gamma \cdot \sum_{s^{\prime} \in \mathcal{S}} \frac{\partial}{\partial \boldsymbol{\theta}}\left[p\left(s^{\prime} \mid s, a\right) \cdot V_\pi\left(s^{\prime}\right)\right] \\ & =\gamma \cdot \sum_{s^{\prime} \in \mathcal{S}} p\left(s^{\prime} \mid s, a\right) \cdot \frac{\partial V_\pi\left(s^{\prime}\right)}{\partial \boldsymbol{\theta}} \\ & =\gamma \cdot \mathbb{E}_{S^{\prime} \sim p(\cdot \mid s, a)}\left[\frac{\partial V_\pi\left(S^{\prime}\right)}{\partial \boldsymbol{\theta}}\right] . \end{aligned} θQπ(s,a)=sS等于零  θ[p(ss,a)R(s,a,s)]+γsSθ[p(ss,a)Vπ(s)]=γsSp(ss,a)θVπ(s)=γESp(s,a)[θVπ(S)].

证毕

g ( s , a ; θ ) ≜ Q π ( s , a ) ⋅ ∂ ln ⁡ π ( a ∣ s ; θ ) ∂ θ \boldsymbol{g}(s, a ; \boldsymbol{\theta}) \triangleq Q_\pi(s, a) \cdot \frac{\partial \ln \pi(a \mid s ; \theta)}{\partial \boldsymbol{\theta}} g(s,a;θ)Qπ(s,a)θlnπ(as;θ) 。设一局游戏在第 n n n 步之后结束。那么
∂ J ( θ ) ∂ θ = E S 1 , A 1 [ g ( S 1 , A 1 ; θ ) ] + γ ⋅ E S 1 , A 1 , S 2 , A 2 [ g ( S 2 , A 2 ; θ ) ] + γ 2 ⋅ E S 1 , A 1 , S 2 , A 2 , S 3 , A 3 [ g ( S 3 , A 3 ; θ ) ] + ⋯ + γ n − 1 ⋅ E S 1 , A 1 , S 2 , A 2 , S 3 , A 3 , ⋯ S n , A n [ g ( S n , A n ; θ ) ] (2.2) \begin{aligned} \frac{\partial J(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}}= & \mathbb{E}_{S_1, A_1}\left[\boldsymbol{g}\left(S_1, A_1 ; \boldsymbol{\theta}\right)\right] \\ & +\gamma \cdot \mathbb{E}_{S_1, A_1, S_2, A_2}\left[\boldsymbol{g}\left(S_2, A_2 ; \boldsymbol{\theta}\right)\right] \\ & +\gamma^2 \cdot \mathbb{E}_{S_1, A_1, S_2, A_2, S_3, A_3}\left[\boldsymbol{g}\left(S_3, A_3 ; \boldsymbol{\theta}\right)\right] \\ & +\cdots \\ & \left.+\gamma^{n-1} \cdot \mathbb{E}_{S_1, A_1, S_2, A_2, S_3, A_3, \cdots S_n, A_n}[\boldsymbol{g}\left(S_n, A_n ; \boldsymbol{\theta}\right)\right] \end{aligned} \tag{2.2} θJ(θ)=ES1,A1[g(S1,A1;θ)]+γES1,A1,S2,A2[g(S2,A2;θ)]+γ2ES1,A1,S2,A2,S3,A3[g(S3,A3;θ)]++γn1ES1,A1,S2,A2,S3,A3,Sn,An[g(Sn,An;θ)](2.2)

Proof:由式 2.1 2.1 2.1可知
∇ θ V π ( s t ) = E A t ∼ π ( ⋅ ∣ s t ; θ ) [ ∂ ln ⁡ π ( A t ∣ s t ; θ ) ∂ θ ⋅ Q π ( s t , A t ) + γ ⋅ E S t + 1 ∼ p ( ⋅ ∣ s t , A t ) [ ∇ θ V π ( S t + 1 ) ] ] = E A t ∼ π ( ⋅ ∣ s t ; θ ) [ g ( s t , A t ; θ ) + γ ⋅ E S t + 1 [ ∇ θ V π ( S t + 1 ) ∣ A t , S t = s t ] ] = E A t [ g ( s t , A t ; θ ) ∣ S t = s t ] + γ E A t [ E S t + 1 [ ∇ θ V π ( S t + 1 ) ∣ A t , S t = s t ] ∣ S t = s t ] = E A t [ g ( s t , A t ; θ ) ∣ S t = s t ] + γ E A t , S t + 1 [ ∇ θ V π ( S t + 1 ) ∣ S t = s t ] \begin{aligned} \nabla_{\boldsymbol \theta }V_{\pi}(s_t)&=\mathbb{E}_{A_t \sim \pi(\cdot \mid s_t ; \boldsymbol{\theta})}\left[\frac{\partial \ln \pi(A_t \mid s_t ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \cdot Q_\pi(s_t, A_t)+\gamma \cdot \mathbb{E}_{S_{t+1} \sim p(\cdot \mid s_t, A_t)}[\nabla _{\boldsymbol \theta}V_\pi\left(S_{t+1}\right)]\right]\\ &=\mathbb{E}_{A_t \sim \pi(\cdot \mid s_t ; \boldsymbol{\theta})}\left[\boldsymbol g(s_t,A_t;\boldsymbol \theta)+\gamma \cdot \mathbb{E}_{S_{t+1} }[\nabla _{\boldsymbol \theta}V_\pi\left(S_{t+1}\right)\mid A_t,S_t=s_t]\right]\\ &=\Bbb E_{A_t}[\boldsymbol g(s_t,A_t;\boldsymbol \theta)\mid S_t=s_t]+\gamma \Bbb E_{A_t}[\Bbb E_{S_{t+1}}[\nabla_{\boldsymbol \theta}V_{\pi}(S_{t+1})\mid A_t,S_t=s_t]\mid S_t=s_t]\\ &=\Bbb E_{A_t}[\boldsymbol g(s_t,A_t;\boldsymbol \theta)\mid S_t=s_t]+\gamma \Bbb E_{A_t,S_{t+1}}[\nabla_{\boldsymbol \theta}V_{\pi}(S_{t+1})\mid S_t=s_t] \end{aligned} θVπ(st)=EAtπ(st;θ)[θlnπ(Atst;θ)Qπ(st,At)+γESt+1p(st,At)[θVπ(St+1)]]=EAtπ(st;θ)[g(st,At;θ)+γESt+1[θVπ(St+1)At,St=st]]=EAt[g(st,At;θ)St=st]+γEAt[ESt+1[θVπ(St+1)At,St=st]St=st]=EAt[g(st,At;θ)St=st]+γEAt,St+1[θVπ(St+1)St=st]
∇ θ V π ( S t + 1 ) = E A t + 1 [ g ( S t + 1 , A t + 1 ; θ ) ∣ S t + 1 ] + γ E A t + 1 , S t + 2 [ ∇ θ V π ( S t + 2 ) ∣ S t + 1 ] \nabla_{\boldsymbol \theta }V_{\pi}(S_{t+1})=\Bbb E_{A_{t+1}}[\boldsymbol g(S_{t+1},A_{t+1};\boldsymbol \theta)\mid S_{t+1}]+\gamma \Bbb E_{A_{t+1},S_{t+2}}[\nabla_{\boldsymbol \theta}V_{\pi}(S_{t+2})\mid S_{t+1}] θVπ(St+1)=EAt+1[g(St+1,At+1;θ)St+1]+γEAt+1,St+2[θVπ(St+2)St+1],带入上式中可得
∇ θ V π ( s t ) = E A t [ g ( s t , A t ; θ ) ∣ S t = s t ] + γ E A t , S t + 1 [ ∇ θ V π ( S t + 1 ) ∣ S t = s t ] = E A t [ g ( s t , A t ; θ ) ∣ S t = s t ] + γ E A t , S t + 1 [ E A t + 1 [ g ( S t + 1 , A t + 1 ; θ ) ∣ S t + 1 ] + γ E A t + 1 , S t + 2 [ ∇ θ V π ( S t + 2 ) ∣ S t + 1 ] ∣ S t = s t ] = E A t [ g ( s t , A t ; θ ) ∣ S t = s t ] + γ E A t , S t + 1 [ E A t + 1 [ g ( S t + 1 , A t + 1 ; θ ) ∣ S t + 1 , S t = s t , A t ] + γ E A t + 1 , S t + 2 [ [ ∇ θ V π ( S t + 2 ) ∣ S t + 1 ] ∣ S t = s t ] 马尔可可夫性 = E A t [ g ( s t , A t ; θ ) ∣ S t = s t ] + γ E A t , S t + 1 , A t + 1 [ g ( S t + 1 , A t + 1 ; θ ) ∣ S t = s t ] + γ E A t + 1 , S t + 2 [ [ ∇ θ V π ( S t + 2 ) ∣ S t + 1 ] ∣ S t = s t ] \begin{aligned} \nabla_{\boldsymbol \theta }V_{\pi}(s_t)&=\Bbb E_{A_t}[\boldsymbol g(s_t,A_t;\boldsymbol \theta)\mid S_t=s_t]+\gamma \Bbb E_{A_t,S_{t+1}}[\nabla_{\boldsymbol \theta}V_{\pi}(S_{t+1})\mid S_t=s_t]\\ &=\Bbb E_{A_t}[\boldsymbol g(s_t,A_t;\boldsymbol \theta)\mid S_t=s_t]+\gamma \Bbb E_{A_t,S_{t+1}}[\Bbb E_{A_{t+1}}[\boldsymbol g(S_{t+1},A_{t+1};\boldsymbol \theta)\mid S_{t+1}]+\gamma \Bbb E_{A_{t+1},S_{t+2}}[\nabla_{\boldsymbol \theta}V_{\pi}(S_{t+2})\mid S_{t+1}]\mid S_t=s_t]\\ &=\Bbb E_{A_t}[\boldsymbol g(s_t,A_t;\boldsymbol \theta)\mid S_t=s_t]+\gamma \Bbb E_{A_t,S_{t+1}}[\Bbb E_{A_{t+1}}[\boldsymbol g(S_{t+1},A_{t+1};\boldsymbol \theta)\mid S_{t+1},S_t=s_t,A_t]+\gamma \Bbb E_{A_{t+1},S_{t+2}}[[\nabla_{\boldsymbol \theta}V_{\pi}(S_{t+2})\mid S_{t+1}]\mid S_t=s_t]\text{马尔可可夫性}\\ &= \Bbb E_{A_t}[\boldsymbol g(s_t,A_t;\boldsymbol \theta)\mid S_t=s_t]+\gamma\Bbb E_{A_t,S_{t+1},A_{t+1}}[\boldsymbol g(S_{t+1},A_{t+1};\boldsymbol \theta)\mid S_t=s_t]+\gamma \Bbb E_{A_{t+1},S_{t+2}}[[\nabla_{\boldsymbol \theta}V_{\pi}(S_{t+2})\mid S_{t+1}]\mid S_t=s_t] \end{aligned} θVπ(st)=EAt[g(st,At;θ)St=st]+γEAt,St+1[θVπ(St+1)St=st]=EAt[g(st,At;θ)St=st]+γEAt,St+1[EAt+1[g(St+1,At+1;θ)St+1]+γEAt+1,St+2[θVπ(St+2)St+1]St=st]=EAt[g(st,At;θ)St=st]+γEAt,St+1[EAt+1[g(St+1,At+1;θ)St+1,St=st,At]+γEAt+1,St+2[[θVπ(St+2)St+1]St=st]马尔可可夫性=EAt[g(st,At;θ)St=st]+γEAt,St+1,At+1[g(St+1,At+1;θ)St=st]+γEAt+1,St+2[[θVπ(St+2)St+1]St=st]
继续利用上式反复带入,最后可得
∂ V π ( S 1 ) ∂ θ = E A 1 [ g ( S 1 , A 1 ; θ ) ∣ S 1 ] + γ ⋅ E A 1 , S 2 , A 2 [ g ( S 2 , A 2 ; θ ) ∣ S 1 ] + γ 2 ⋅ E A 1 , S 2 , A 2 , S 3 , A 3 [ g ( S 3 , A 3 ; θ ) ∣ S 1 ] + ⋯ + γ n − 1 ⋅ E A 1 , S 2 , A 2 , S 3 , A 3 , ⋯ S n , A n [ g ( S n , A n ; θ ) ∣ S 1 ] + γ n ⋅ E A 1 , S 2 , A 2 , S 3 , A 3 , ⋯ S n , A n , S n + 1 [ ∂ V π ( S n + 1 ) ∂ θ ⏟ 等于零  ∣ S 1 ] \begin{aligned} \frac{\partial V_\pi\left(S_1\right)}{\partial \boldsymbol{\theta}}= & \mathbb{E}_{A_1}\left[\boldsymbol{g}\left(S_1, A_1 ; \boldsymbol{\theta}\right)\mid S_1\right] \\ & +\gamma \cdot \mathbb{E}_{A_1, S_2, A_2}\left[\boldsymbol{g}\left(S_2, A_2 ; \boldsymbol{\theta}\right)\mid S_1\right] \\ & +\gamma^2 \cdot \mathbb{E}_{A_1, S_2, A_2, S_3, A_3}\left[\boldsymbol{g}\left(S_3, A_3 ; \boldsymbol{\theta}\right)\mid S_1\right] \\ & +\cdots \\ & +\gamma^{n-1} \cdot \mathbb{E}_{A_1, S_2, A_2, S_3, A_3, \cdots S_n, A_n}\left[\boldsymbol{g}\left(S_n, A_n ; \boldsymbol{\theta}\right)\mid S_1\right] \\ &+\gamma^n \cdot \mathbb{E}_{A_1, S_2, A_2, S_3, A_3, \cdots S_n, A_n, S_{n+1}}[\underbrace{\frac{\partial V_\pi\left(S_{n+1}\right)}{\partial \boldsymbol{\theta}}}_{\text {等于零 }}\mid S_1] \end{aligned} θVπ(S1)=EA1[g(S1,A1;θ)S1]+γEA1,S2,A2[g(S2,A2;θ)S1]+γ2EA1,S2,A2,S3,A3[g(S3,A3;θ)S1]++γn1EA1,S2,A2,S3,A3,Sn,An[g(Sn,An;θ)S1]+γnEA1,S2,A2,S3,A3,Sn,An,Sn+1[等于零  θVπ(Sn+1)S1]
上式中最后一项等于零,原因是游戏在n时刻后结束,而 n + 1 n+1 n+1时刻之后没有奖励,所以 n + 1 n+1 n+1时刻的回报和价值都是零。最后,由上面的公式和,最后,由 J ( θ ) J(\boldsymbol \theta) J(θ)定义知
∂ J ( θ ) ∂ θ = E S 1 [ ∂ V π ( S 1 ) ∂ θ ] \frac{\partial J(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}}=\mathbb{E}_{S_1}\left[\frac{\partial V_\pi\left(S_1\right)}{\partial \boldsymbol{\theta}}\right] θJ(θ)=ES1[θVπ(S1)]
证毕

稳态分布:想要严格证明策略梯度定理, 需要用到马尔科夫链 (Markov chain) 的稳态分布 (stationary distribution)。设状态 S ′ S^{\prime} S 是这样得到的: S → A → S ′ S \rightarrow A \rightarrow S^{\prime} SAS 。回忆一下, 状态转移函数 p ( S ′ ∣ S , A ) p\left(S^{\prime} \mid S, A\right) p(SS,A), 是一个概率质量函数。设 f ( S ) f(S) f(S) 是状态 S S S 的概率质量函数那么状态 S ′ S^{\prime} S的边缘分布 f ( S ′ ) f(S') f(S)
f ( S ′ ) = E S , A [ p ( S ′ ∣ A , S ) ] = E S [ E A [ p ( S ′ ∣ A , S ) ∣ S ] ] = E S [ ∑ A p ( S ′ ∣ a , S ) ⋅ π ( a ∣ S ) ] = ∑ S ∑ A p ( S ′ ∣ a , s ) ⋅ π ( a ∣ s ) ⋅ f ( s ) \begin{aligned} f(S')&=\Bbb E_{S,A}[p(S'\mid A,S)]\\ &=\Bbb E_{S}[\Bbb E_{A}[p(S'\mid A,S)\mid S]]\\ &=\Bbb E_{S}[\sum_{A}p(S'\mid a,S)\cdot \pi(a\mid S)]\\ &=\sum_{S}\sum_{A}p(S'\mid a,s)\cdot \pi(a\mid s)\cdot f(s) \end{aligned} f(S)=ES,A[p(SA,S)]=ES[EA[p(SA,S)S]]=ES[Ap(Sa,S)π(aS)]=SAp(Sa,s)π(as)f(s)
如果 f ( S ′ ) f(S') f(S) f ( S ) f(S) f(S) 是相同的概率质量函数, 即 $f(S)=f(S’) $, 则意味着马尔科夫链达到稳态, 而 f ( S ) f(S) f(S) 就是稳态时的概率质量函数。

Theorem:

f ( S ) f(S) f(S) 是马尔科夫链稳态时的概率质量 (密度) 函数。那么对于任意函数 G ( S ′ ) G\left(S^{\prime}\right) G(S),
E S ∼ f ( ⋅ ) [ E A ∼ π ( ⋅ ∣ S ; θ ) [ E S ′ ∼ p ( ⋅ ∣ s , A ) [ G ( S ′ ) ] ] ] = E S ′ ∼ f ( ⋅ ) [ G ( S ′ ) ] (2.3) \mathbb{E}_{S \sim f(\cdot)}\left[\mathbb{E}_{A \sim \pi(\cdot \mid S ; \boldsymbol{\theta})}\left[\mathbb{E}_{S^{\prime} \sim p(\cdot \mid s, A)}\left[G\left(S^{\prime}\right)\right]\right]\right]=\mathbb{E}_{S^{\prime} \sim f(\cdot)}\left[G\left(S^{\prime}\right)\right]\tag{2.3} ESf()[EAπ(S;θ)[ESp(s,A)[G(S)]]]=ESf()[G(S)](2.3)

Proof:
E S ∼ f ( ⋅ ) [ E A ∼ π ( ⋅ ∣ S ; θ ) [ E S ′ ∼ p ( ⋅ ∣ S , A ) [ G ( S ′ ) ] ] ] = E S ∼ f ( ⋅ ) [ E A [ E S ′ [ G ( S ′ ) ∣ S , A ] ∣ S ] ] = E S ∼ f ( ⋅ ) [ E A , S ′ [ G ( S ′ ) ∣ S ] ] = E S , A , S ′ [ G ( S ′ ) ] = E S ′ [ G ( S ′ ) ] \begin{aligned} \mathbb{E}_{S \sim f(\cdot)}\left[\mathbb{E}_{A \sim \pi(\cdot \mid S ; \boldsymbol{\theta})}\left[\mathbb{E}_{S^{\prime} \sim p(\cdot \mid S, A)}\left[G\left(S^{\prime}\right)\right]\right]\right]&= \Bbb E_{S\sim f(\cdot)}[\Bbb E_{A}[\Bbb E_{S'}[G(S')\mid S,A]\mid S]]\\ &=\Bbb E_{S\sim f(\cdot)}[\Bbb E_{A,S'}[G(S')\mid S]]\\ &=\Bbb E_{S,A,S'}[G(S')]\\ &=\Bbb E_{S'}[G(S')] \end{aligned} ESf()[EAπ(S;θ)[ESp(S,A)[G(S)]]]=ESf()[EA[ES[G(S)S,A]S]]=ESf()[EA,S[G(S)S]]=ES,A,S[G(S)]=ES[G(S)]
又因 S , S ′ S,S' S,S有相同的分布 f ( ⋅ ) f(\cdot) f(),所以 E S ′ [ G ( S ′ ) ] = E S ′ ∼ f ( ⋅ ) [ G ( S ′ ) ] \Bbb E_{S'}[G(S')]=\mathbb{E}_{S^{\prime} \sim f(\cdot)}\left[G\left(S^{\prime}\right)\right] ES[G(S)]=ESf()[G(S)]

Theorem:策略梯度定理

设目标函数为 J ( θ ) = E S ∼ f ( ⋅ ) [ V π ( S ) ] J(\boldsymbol{\theta})=\mathbb{E}_{S \sim f(\cdot)}\left[V_\pi(S)\right] J(θ)=ESf()[Vπ(S)], 设 f ( S ) f(S) f(S) 为马尔科夫链稳态分布的概率质量 (密度) 函数。那么
∂ J ( θ ) ∂ θ = ( 1 + γ + γ 2 + ⋯ + γ n − 1 ) ⋅ E S ∼ f ( ⋅ ) [ E A ∼ π ( ⋅ ∣ S ; θ ) [ ∂ ln ⁡ π ( A ∣ S ; θ ) ∂ θ ⋅ Q π ( S , A ) ] ] \frac{\partial J(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}}=\left(1+\gamma+\gamma^2+\cdots+\gamma^{n-1}\right) \cdot \mathbb{E}_{S \sim f(\cdot)}\left[\mathbb{E}_{A \sim \pi(\cdot \mid S ; \boldsymbol{\theta})}\left[\frac{\partial \ln \pi(A \mid S ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \cdot Q_\pi(S, A)\right]\right] θJ(θ)=(1+γ+γ2++γn1)ESf()[EAπ(S;θ)[θlnπ(AS;θ)Qπ(S,A)]]

Proof:设初始状态 S 1 S_1 S1 服从马尔科夫链的稳态分布,它的概率质量函数是 f ( S 1 ) f\left(S_1\right) f(S1) 。对于所有的 t = 1 , ⋯   , n t=1, \cdots, n t=1,,n, 动作 A t A_t At 根据策略网络抽样得到:
A t ∼ π ( ⋅ ∣ S t ; θ ) A_t \sim \pi\left(\cdot \mid S_t ; \boldsymbol{\theta}\right) Atπ(St;θ)
对于任意函数 G G G, 反复应用式 2.3 可得:
E A 1 , … , A t − 1 , S 1 , … , S t [ G ( S t ) ] = E S 1 ∼ f { E A 1 ∼ π , S 2 ∼ p { E A 2 , S 3 , A 3 , S 4 , ⋯   , A t − 1 , S t [ G ( S t ) ] } } = E S 2 ∼ f { E A 2 , S 3 , A 3 , S 4 , ⋯   , A t − 1 , S t [ G ( S t ) ] } = E S 2 ∼ f { E A 2 ∼ π , S 3 ∼ p { E A 3 , S 4 , A 4 , S 5 , ⋯   , A t − 1 , S t [ G ( S t ) ] } } = E S 3 ∼ f { E A 3 , S 4 , A 4 , S 5 , ⋯   , A t − 1 , S t [ G ( S t ) ] } ⋮ = E S t − 1 ∼ f { E A t − 1 ∼ π , S t ∼ p { G ( S t ) } } = E S t ∼ f { G ( S t ) } . \begin{aligned} \Bbb E_{A_1,\ldots,A_{t-1},S_1,\ldots,S_{t}}[G(S_t)] & =\mathbb{E}_{S_1 \sim f}\left\{\mathbb{E}_{A_1 \sim \pi, S_2 \sim p}\left\{\mathbb{E}_{A_2, S_3, A_3, S_4, \cdots, A_{t-1}, S_t}\left[G\left(S_t\right)\right]\right\}\right\} \\ & =\mathbb{E}_{S_2 \sim f}\left\{\mathbb{E}_{A_2, S_3, A_3, S_4, \cdots, A_{t-1}, S_t}\left[G\left(S_t\right)\right]\right\} \quad \\ & =\mathbb{E}_{S_2 \sim f}\left\{\mathbb{E}_{A_2 \sim \pi, S_3 \sim p}\left\{\mathbb{E}_{A_3, S_4, A_4, S_5, \cdots, A_{t-1}, S_t}\left[G\left(S_t\right)\right]\right\}\right\} \\ & =\mathbb{E}_{S_3 \sim f}\left\{\mathbb{E}_{A_3, S_4, A_4, S_5, \cdots, A_{t-1}, S_t}\left[G\left(S_t\right)\right]\right\} \quad \\ & \vdots \\ & =\mathbb{E}_{S_{t-1} \sim f}\left\{\mathbb{E}_{A_{t-1} \sim \pi, S_t \sim p}\left\{G\left(S_t\right)\right\}\right\} \\ & =\mathbb{E}_{S_t \sim f}\left\{G\left(S_t\right)\right\} . \end{aligned} EA1,,At1,S1,,St[G(St)]=ES1f{EA1π,S2p{EA2,S3,A3,S4,,At1,St[G(St)]}}=ES2f{EA2,S3,A3,S4,,At1,St[G(St)]}=ES2f{EA2π,S3p{EA3,S4,A4,S5,,At1,St[G(St)]}}=ES3f{EA3,S4,A4,S5,,At1,St[G(St)]}=ESt1f{EAt1π,Stp{G(St)}}=EStf{G(St)}.
g ( s , a ; θ ) ≜ Q π ( s , a ) ⋅ ∂ ln ⁡ π ( a ∣ s ; θ ) ∂ θ \boldsymbol{g}(s, a ; \boldsymbol{\theta}) \triangleq Q_\pi(s, a) \cdot \frac{\partial \ln \pi(a \mid s ; \boldsymbol{\theta})}{\partial \boldsymbol{\theta}} g(s,a;θ)Qπ(s,a)θlnπ(as;θ) 。设一局游戏在第 n n n 步之后结束。由式2.2与上面的公式可得:
∂ J ( θ ) ∂ θ = E S 1 , A 1 [ g ( S 1 , A 1 ; θ ) ] + γ ⋅ E S 1 , A 1 , S 2 , A 2 [ g ( S 2 , A 2 ; θ ) ] + γ 2 ⋅ E S 1 , A 1 , S 2 , A 2 , S 3 , A 3 [ g ( S 3 , A 3 ; θ ) ] + ⋯ + γ n − 1 ⋅ E S 1 , A 1 , S 2 , A 2 , S 3 , A 3 , ⋯ S n , A n [ g ( S n , A n ; θ ) ] ] = E S 1 ∼ f ( ⋅ ) { E A 1 ∼ π ( ⋅ ∣ S 1 ; θ ) [ g ( S 1 , A 1 ; θ ) ] } + γ ⋅ E S 2 ∼ f ( ⋅ ) { E A 2 ∼ π ( ⋅ ∣ S 2 ; θ ) [ g ( S 2 , A 2 ; θ ) ] } + γ 2 ⋅ E S 3 ∼ f ( ⋅ ) { E A 3 ∼ π ( ⋅ ∣ S 3 ; θ ) [ g ( S 3 , A 3 ; θ ) ] } + ⋯ + γ n − 1 ⋅ E S n ∼ f ( ⋅ ) { E A n ∼ π ( ⋅ ∣ S n ; θ ) [ g ( S n , A n ; θ ) ] } = ( 1 + γ + γ 2 + ⋯ + γ n − 1 ) ⋅ E S ∼ f ( ⋅ ) { E A ∼ π ( ⋅ ∣ S ; θ ) [ g ( S , A ; θ ) ] } . \begin{aligned} \frac{\partial J(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}}= & \mathbb{E}_{S_1, A_1}\left[\boldsymbol{g}\left(S_1, A_1 ; \boldsymbol{\theta}\right)\right] \\ & +\gamma \cdot \mathbb{E}_{S_1, A_1, S_2, A_2}\left[\boldsymbol{g}\left(S_2, A_2 ; \boldsymbol{\theta}\right)\right] \\ & +\gamma^2 \cdot \mathbb{E}_{S_1, A_1, S_2, A_2, S_3, A_3}\left[\boldsymbol{g}\left(S_3, A_3 ; \boldsymbol{\theta}\right)\right] \\ & +\cdots \\ & \left.+\gamma^{n-1} \cdot \mathbb{E}_{S_1, A_1, S_2, A_2, S_3, A_3, \cdots S_n, A_n}\left[\boldsymbol{g}\left(S_n, A_n ; \boldsymbol{\theta}\right)\right]\right] \\ = & \mathbb{E}_{S_1 \sim f(\cdot)}\left\{\mathbb{E}_{A_1 \sim \pi\left(\cdot \mid S_1 ; \boldsymbol{\theta}\right)}\left[\boldsymbol{g}\left(S_1, A_1 ; \boldsymbol{\theta}\right)\right]\right\} \\ & +\gamma \cdot \mathbb{E}_{S_2 \sim f(\cdot)}\left\{\mathbb{E}_{A_2 \sim \pi\left(\cdot \mid S_2 ; \boldsymbol{\theta}\right)}\left[\boldsymbol{g}\left(S_2, A_2 ; \boldsymbol{\theta}\right)\right]\right\} \\ & +\gamma^2 \cdot \mathbb{E}_{S_3 \sim f(\cdot)}\left\{\mathbb{E}_{A_3 \sim \pi\left(\cdot \mid S_3 ; \boldsymbol{\theta}\right)}\left[\boldsymbol{g}\left(S_3, A_3 ; \boldsymbol{\theta}\right)\right]\right\} \\ & +\cdots \\ & +\gamma^{n-1} \cdot \mathbb{E}_{S_n \sim f(\cdot)}\left\{\mathbb{E}_{A_n \sim \pi\left(\cdot \mid S_n ; \boldsymbol{\theta}\right)}\left[\boldsymbol{g}\left(S_n, A_n ; \boldsymbol{\theta}\right)\right]\right\} \\ = & \left(1+\gamma+\gamma^2+\cdots+\gamma^{n-1}\right) \cdot \mathbb{E}_{S \sim f(\cdot)}\left\{\mathbb{E}_{A \sim \pi(\cdot \mid S ; \boldsymbol{\theta})}[\boldsymbol{g}(S, A ; \boldsymbol{\theta})]\right\} . \end{aligned} θJ(θ)===ES1,A1[g(S1,A1;θ)]+γES1,A1,S2,A2[g(S2,A2;θ)]+γ2ES1,A1,S2,A2,S3,A3[g(S3,A3;θ)]++γn1ES1,A1,S2,A2,S3,A3,Sn,An[g(Sn,An;θ)]]ES1f(){EA1π(S1;θ)[g(S1,A1;θ)]}+γES2f(){EA2π(S2;θ)[g(S2,A2;θ)]}+γ2ES3f(){EA3π(S3;θ)[g(S3,A3;θ)]}++γn1ESnf(){EAnπ(Sn;θ)[g(Sn,An;θ)]}(1+γ+γ2++γn1)ESf(){EAπ(S;θ)[g(S,A;θ)]}.

证毕

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1521996.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

Linux 进程控制进程终止

目录 一、fork函数 1、概念 2、父子进程的共享 3、 为什么父子进程会从fork()调用之后的位置继续执行代码 4、写时拷贝 5、为什么需要写时拷贝 6、写时拷贝的好处 7、fork常规用法 8、fork调用失败的原因 9、查看系统最大进程数 二、进程终止 1、进程退出场景 2、…

redis 入门01

1.安装与配置 在官网下压缩包并传送给自己的虚拟机或者使用wget直接下载都可以 注意:redis是运行在linux下的基于内存的kv键值对数据库 安装与配置参考 2.经典Hello World 注意设置redis在后台运行,默认是前台进行的 我们配置完成之后首先启动服务器 redis-server 配置文件 这里…

MyBatis3源码深度解析(十二)MyBatis的核心组件(一)Configuration

文章目录 第四章 MyBatis的核心组件4.1 使用MyBatis操作数据库4.2 MyBatis核心组件4.3 Configuration组件4.3.1 属性4.3.2 设置4.3.3 类型别名4.3.3 类型处理器4.3.5 对象工厂4.3.6 插件4.3.7 配置环境4.3.8 映射器 第四章 MyBatis的核心组件 4.1 使用MyBatis操作数据库 在研…

Boyer Moore 算法介绍

1. Boyer Moore 算法介绍 Boyer Moore 算法:简称为 BM 算法,是由它的两位发明者 Robert S. Boyer 和 J Strother Moore 的名字来命名的。BM 算法是他们在 1977 年提出的高效字符串搜索算法。在实际应用中,比 KMP 算法要快 3~5 倍。 BM 算法思…

【Machine Learning】Suitable Learning Rate in Machine Learning

一、The cases of different learning rates: In the gradient descent algorithm model: is the learning rate of the demand, how to determine the learning rate, and what impact does it have if it is too large or too small? We will analyze it through the follow…

【安全类书籍-1】asp代码审计.pdf

目录 内容简介 作用 下载地址 内容简介 这个文档摘录片段主要讨论了ASP编程中的安全性审计,包括SQL注入漏洞、Cookie注入防范措施及文件上传安全问题,并给出了相关示例代码。 SQL注入漏洞与防范 - ASP代码中展示了如何通过`Request.QueryString`和`Request.Form`获取用户…

SpringBoot打造企业级进销存储系统 第五讲

package com.java1234.repository;import com.java1234.entity.Menu; import org.springframework.data.jpa.repository.JpaRepository; import org.springframework.data.jpa.repository.Query;import java.util.List;/*** 菜单Repository接口*/ public interface MenuReposit…

spacy进行简单的自然语言处理的学习

自然语言处理基本概念 概念:自然语言处理,是让机器理解人的语言的过程。 作用:通过使用自然语言处理,机器可以理解人的语言,从而进行语义分析,例如:从一句话中判断喜怒哀乐;从一段文…

电大搜题:开启学习新时代

身处信息化时代,学习的方式已经发生了巨大的变革。在这个多元化的学习环境中,传统的学习模式已经无法满足现代学习者的需求。然而,电大搜题应运而生,为学习者提供了一个高效、便捷的学习途径。 电大搜题,作为黑龙江开…

阅读 - 二维码扫码登录原理

在日常生活中,二维码出现在很多场景,比如超市支付、系统登录、应用下载等等。了解二维码的原理,可以为技术人员在技术选型时提供新的思路。对于非技术人员呢,除了解惑,还可以引导他更好地辨别生活中遇到的各种二维码&a…

铁路订票平台小程序|基于微信小程序的铁路订票平台小程序设计与实现(源码+数据库+文档)

铁路订票平台小程序目录 目录 基于微信小程序的铁路订票平台小程序设计与实现 一、前言 二、系统设计 三、系统功能设计 1、用户信息管理 2、车次信息管理 3、公告信息管理 4、论坛信息管理 四、数据库设计 五、核心代码 六、论文参考 七、最新计算机毕设选题推荐…

Transformer学习笔记(二)

一、文本嵌入层Embedding 1、作用: 无论是源文本嵌入还是目标文本嵌入,都是为了将文本中词汇的数字表示转变为向量表示,希望在这样的高维空间捕捉词汇间的关系。 二、位置编码器Positional Encoding 1、作用: 因为在Transformer…

冲动是魔鬼,工作不顺心时不要把坏脾气带给家人

今天与一个跟踪了很久的客户准备签合同了,客户突然反悔,为此与他周旋了一整天,忙碌得一口水都没有喝。回到小区坐在车里抽着烟,久久不愿回家,只想一个人坐着,疲惫、无奈。这个月的奖金似乎又将成为泡影。 …

Microsoft SQL Server2019占用大量磁盘空间的解决办法(占了我C盘120G的空间!!!)附SQL数据库定时清理代理作业

一、问题 安装Microsoft SQL Server2019后我的C盘在几天后少了100G,如图所示: 解决后: 出现这种情况,我在各种清理C盘后,空间还是没有太大变化 ,且几乎每天都要少2个G,后来终于找见原因了&…

Postman接口测试:API 测试的必备技巧

在现代软件开发生命周期中,接口测试是一个至关重要的部分。使用 Postman 这一工具,可以轻松地进行 接口测试。以下是一份简单的使用教程,帮助你快速上手。 安装 Postman 首先,你需要在电脑上安装 Postman。你可以从官网上下载并…

虚拟机NAT模式配置

注意这里IP要和网关在同一网段,且虚拟机默认网关末尾为.2(如果默认网关配置为.1会与宿主机冲突,导致无法ping通外网) 点击NAT模式下的NAT设置即可查看默认网关 这里的网关可以理解为主机与虚拟机交互的入口

CSDN首发Chainlink(预言机)讲解:基础知识总结 到底什么是预言机本篇带你解析

苏泽 大家好 这里是苏泽 一个钟爱区块链技术的后端开发者 本篇专栏 ←持续记录本人自学两年走过无数弯路的智能合约学习笔记和经验总结 如果喜欢拜托三连支持~ 前面的专栏带大家熟悉了 区块链的基本组成 、共识机制、智能合约、最小信任机制 以及EVM等知识 如遇不懂的概念或名…

2024年【危险化学品经营单位主要负责人】新版试题及危险化学品经营单位主要负责人复审考试

题库来源:安全生产模拟考试一点通公众号小程序 2024年【危险化学品经营单位主要负责人】新版试题及危险化学品经营单位主要负责人复审考试,包含危险化学品经营单位主要负责人新版试题答案和解析及危险化学品经营单位主要负责人复审考试练习。安全生产模…

Kubernetes operator系列:webhook 知识学习

云原生学习路线导航页(持续更新中) 本文是 Kubernetes operator学习 系列文章,本节会对 kubernetes webhook 知识进行学习 本文的所有代码,都存储于github代码库:https://github.com/graham924/share-code-operator-st…

说下你对TCP以及TCP三次握手四次挥手的理解?

参考自简单理解TCP三次握手四次挥手 什么是TCP协议? TCP( Transmission control protocol )即传输控制协议,是一种面向连接、可靠的数据传输协议,它是为了在不可靠的互联网上提供可靠的端到端字节流而专门设计的一个传输协议。 面向连接&a…