Hands on RL 之 Off-policy Maximum Entropy Actor-Critic (SAC)

news2024/11/27 12:21:18

Hands on RL 之 Off-policy Maximum Entropy Actor-Critic (SAC)

文章目录

  • Hands on RL 之 Off-policy Maximum Entropy Actor-Critic (SAC)
    • 1. 理论基础
      • 1.1 Maximum Entropy Reinforcement Learning, MERL
      • 1.2 Soft Policy Evaluation and Soft Policy Improvement in SAC
      • 1.3 Two Q Value Neural Network
      • 1.4 Tricks
      • 1.5 Pesudocode
    • 2. 代码实现
      • 2.1 SAC处理连续动作空间
      • 2.2 SAC处理离散动作空间
    • Reference

​ Soft actor-critic方法又被称为Off-policy maximum entropy actor-critic algorithm。

1. 理论基础

1.1 Maximum Entropy Reinforcement Learning, MERL

​ 在MERL原论文中,引入了熵的概念,熵定义如下:

随机变量 x x x符合 P P P的概率分布,那么随机变量 x x x的熵 H ( P ) \mathcal{H}(P) H(P)
H ( P ) = E x ∼ P [ − log ⁡ p ( x ) ] \mathcal{H}(P)=\mathbb{E}_{x\sim P}[-\log p(x)] H(P)=ExP[logp(x)]
标准的RL算法的目标是能够找到最大化累计收益的策略
π std ∗ = arg ⁡ max ⁡ π ∑ t E ( s t , a t ) ∼ ρ π [ r ( s t , a t ) ] \pi^*_{\text{std}} = \arg \max_\pi \sum_t \mathbb{E}_{(s_t,a_t)\sim \rho_\pi}[r(s_t, a_t)] πstd=argπmaxtE(st,at)ρπ[r(st,at)]
引入了熵最大化的RL算法的目标是
π M E R L ∗ = arg ⁡ max ⁡ π ∑ t E ( s t , a t ) ∼ ρ π [ r ( s t , a t ) + α H ( π ( ⋅ ∣ s t ) ) ] \pi^*_{MERL} = \arg\max_\pi \sum_t \mathbb{E}_{(s_t,a_t)\sim\rho_\pi}[r(s_t,a_t) + \alpha \mathcal{H}(\pi(\cdot|s_t))] πMERL=argπmaxtE(st,at)ρπ[r(st,at)+αH(π(st))]
其中, ρ π \rho_\pi ρπ表示在策略 π \pi π下状态空间对的概率分布, α \alpha α是温度系数,用于调节对熵的重视程度。

类似的,我们也可以在RL的动作值函数,和状态值函数中同样引入熵的概念。

标准的RL算法的值函数
standard Q function: Q π ( s , a ) = E s t , a t ∼ ρ π [ ∑ t = 0 ∞ γ t r ( s t , a t ) ∣ s 0 = s , a 0 = a ] standard V function: V π ( s ) = E s t , a t ∼ ρ π [ ∑ t = 0 ∞ γ t r ( s t , a t ) ∣ s 0 = s ] \begin{aligned} \text{standard Q function:} \quad Q^\pi(s,a) & = \mathbb{E}_{s_t,a_t\sim\rho_\pi}[\sum_{t=0}^\infty \gamma^t r(s_t,a_t)|s_0=s, a_0=a] \\ \text{standard V function:} \quad V^\pi(s) & =\mathbb{E}_{s_t,a_t\sim\rho_\pi}[\sum_{t=0}^\infty \gamma^t r(s_t,a_t)|s_0=s] \end{aligned} standard Q function:Qπ(s,a)standard V function:Vπ(s)=Est,atρπ[t=0γtr(st,at)s0=s,a0=a]=Est,atρπ[t=0γtr(st,at)s0=s]

根据MERL的目标函数,在值函数中引入熵的概念之后就获得了Soft Value Function, SVF
Soft Q function: Q soft π ( s , a ) = E s t , a t ∼ ρ π [ ∑ t = 0 ∞ γ t r ( s t , a t ) + α ∑ t = 1 ∞ γ t H ( π ( ⋅ ∣ s t ) ) ∣ s 0 = s , a 0 = a ] Soft V function: V soft π ( s ) = E s t , a t ∼ ρ π [ ∑ t = 0 ∞ γ t ( r ( s t , a t ) + α H ( π ( ⋅ ∣ s t ) ) ) ∣ s 0 = s ] \begin{aligned} \text{Soft Q function:} \quad Q_{\text{soft}}^\pi(s,a) & = \mathbb{E}_{s_t,a_t\sim\rho_\pi}[\sum_{t=0}^\infty \gamma^t r(s_t,a_t) + \alpha \sum_{t=1}^\infty\gamma^t\mathcal{H} (\pi(\cdot|s_t)) |s_0=s, a_0=a] \\ \text{Soft V function:} \quad V_{\text{soft}}^\pi(s) & =\mathbb{E}_{s_t,a_t\sim\rho_\pi}[\sum_{t=0}^\infty \gamma^t \Big( r(s_t,a_t) + \alpha \mathcal{H} (\pi(\cdot|s_t)) \Big)|s_0=s] \end{aligned} Soft Q function:Qsoftπ(s,a)Soft V function:Vsoftπ(s)=Est,atρπ[t=0γtr(st,at)+αt=1γtH(π(st))s0=s,a0=a]=Est,atρπ[t=0γt(r(st,at)+αH(π(st)))s0=s]

观察上式我们就可以获得Soft Bellman Equation
Q soft π ( s , a ) = E s ′ ∼ p ( s ′ ∣ s , a ) a ′ ∼ π [ r ( s , a ) + γ ( Q soft π ( s ′ , a ′ ) + α H ( π ( ⋅ ∣ s ′ ) ) ) ] = E s ′ ∼ p ( s ′ ∣ s , a ) [ r ( s , a ) + γ V soft π ( s ) ] \begin{aligned} Q_{\text{soft}}^\pi(s,a) & = \mathbb{E}_{s^\prime \sim p(s^\prime|s,a)\\ a^\prime\sim \pi}[r(s,a) + \gamma \Big( Q_{\text{soft}}^\pi(s^\prime,a^\prime) + \alpha \mathcal{H}(\pi(\cdot|s^\prime)) \Big)] \\ & = \mathbb{E}_{s^\prime\sim p(s^\prime|s,a)} [r(s,a) + \gamma V_{\text{soft}}^\pi(s)] \end{aligned} Qsoftπ(s,a)=Esp(ss,a)aπ[r(s,a)+γ(Qsoftπ(s,a)+αH(π(s)))]=Esp(ss,a)[r(s,a)+γVsoftπ(s)]

V soft π ( s ) = E s t , a t ∼ ρ π [ ∑ t = 0 ∞ γ t ( r ( s t , a t ) + α H ( π ( ⋅ ∣ s t ) ) ) ∣ s 0 = s ] = E s t , a t ∼ ρ π [ ∑ t = 0 ∞ γ t r ( s t , a t ) + α ∑ t = 1 ∞ γ t H ( π ( ⋅ ∣ s t ) ) + α H ( π ( ⋅ ∣ s t ) ∣ s 0 = s ] = E s t , a t ∼ ρ π [ Q soft π ( s , a ) ∣ s 0 = s , a 0 = a ] + α H ( π ( ⋅ ∣ s t ) ) = E a ∼ π [ Q soft π ( s , a ) − α log ⁡ π ( a ∣ s ) ] \begin{aligned} V_{\text{soft}}^\pi(s) & = \mathbb{E}_{s_t,a_t\sim\rho_\pi}[\sum_{t=0}^\infty \gamma^t \Big( r(s_t,a_t) + \alpha \mathcal{H} (\pi(\cdot|s_t)) \Big)|s_0=s] \\ & = \mathbb{E}_{s_t,a_t\sim\rho_\pi}[\sum_{t=0}^\infty \gamma^t r(s_t,a_t) + \alpha \sum_{t=1}^\infty\gamma^t\mathcal{H} (\pi(\cdot|s_t)) + \alpha \mathcal{H}(\pi(\cdot|s_t)|s_0=s] \\ & = \mathbb{E}_{s_t,a_t\sim\rho_\pi}[Q_{\text{soft}}^\pi(s,a)|s_0=s,a_0=a] + \alpha \mathcal{H}(\pi(\cdot|s_t)) \\ & = \textcolor{blue}{\mathbb{E}_{a\sim\pi}[Q_{\text{soft}}^\pi(s,a) - \alpha \log \pi(a|s)]} \end{aligned} Vsoftπ(s)=Est,atρπ[t=0γt(r(st,at)+αH(π(st)))s0=s]=Est,atρπ[t=0γtr(st,at)+αt=1γtH(π(st))+αH(π(st)s0=s]=Est,atρπ[Qsoftπ(s,a)s0=s,a0=a]+αH(π(st))=Eaπ[Qsoftπ(s,a)αlogπ(as)]

1.2 Soft Policy Evaluation and Soft Policy Improvement in SAC

soft Q function的值迭代公式
Q soft π ( s , a ) = E s ′ ∼ p ( s ′ ∣ s , a ) a ′ ∼ π [ r ( s , a ) + γ ( Q soft π ( s ′ , a ′ ) + α H ( π ( ⋅ ∣ s ′ ) ) ) ] = E s ′ ∼ p ( s ′ ∣ s , a ) [ r ( s , a ) + γ V soft π ( s ) ] \begin{align} Q_{\text{soft}}^\pi(s,a) & = \mathbb{E}_{s^\prime \sim p(s^\prime|s,a)\\ a^\prime\sim \pi}[r(s,a) + \gamma \Big( Q_{\text{soft}}^\pi(s^\prime,a^\prime) + \alpha \mathcal{H}(\pi(\cdot|s^\prime)) \Big)] \tag{1.1} \\ & = \mathbb{E}_{s^\prime\sim p(s^\prime|s,a)} [r(s,a) + \gamma V_{\text{soft}}^\pi(s)] \tag{1.2} \end{align} Qsoftπ(s,a)=Esp(ss,a)aπ[r(s,a)+γ(Qsoftπ(s,a)+αH(π(s)))]=Esp(ss,a)[r(s,a)+γVsoftπ(s)](1.1)(1.2)

soft V function的值迭代公式

V soft π ( s ) = E a ∼ π [ Q soft π ( s , a ) − α log ⁡ π ( a ∣ s ) ] (1.3) V_{\text{soft}}^\pi(s)= \mathbb{E}_{a\sim\pi}[Q_{\text{soft}}^\pi(s,a) - \alpha \log \pi(a|s)] \tag{1.3} Vsoftπ(s)=Eaπ[Qsoftπ(s,a)αlogπ(as)](1.3)

在SAC中如果我们只打算维持一个Q值函数,那么使用式子 ( 1 , 1 ) (1,1) (1,1)进行值迭代即可,

如果需要同时维持Q,V两个函数,那么使用式子 ( 1 , 2 ) , ( 1 , 3 ) (1,2),(1,3) (1,2),(1,3)进行值迭代。

下面直接给出训练中的损失函数

V值函数的损失函数

J V ( ψ ) = E s t ∼ D [ 1 2 ( V ψ ( s t ) − E a t ∼ π ϕ [ Q θ ( s t , a t ) − log ⁡ π ϕ ( a t ∣ s t ) ] ) 2 ] J_{V(\psi)} = \mathbb{E}_{s_t\sim\mathcal{D}} \Big[\frac{1}{2}(V_{\psi}(s_t) - \mathbb{E}_{a_t\sim\pi_{\phi}}[Q_\theta(s_t,a_t)- \log \pi_\phi(a_t|s_t)])^2 \Big] JV(ψ)=EstD[21(Vψ(st)Eatπϕ[Qθ(st,at)logπϕ(atst)])2]

Q值函数的损失函数

J Q ( θ ) = E ( s t , a t ) ∼ D [ 1 2 ( Q θ ( s t , a t ) − Q ^ ( s t , a t ) ) ] Q ^ ( s t , a t ) = r ( s t , a t ) + γ E s t + 1 ∼ p [ V ψ ( s t + 1 ) ] Q ^ ( s t , a t ) = r ( s t , a t ) + γ Q θ ( s t + 1 , a t + 1 ) J_{Q(\theta)} = \mathbb{E}_{(s_t,a_t)\sim\mathcal{D}} \Big[ \frac{1}{2}\Big( Q_\theta(s_t,a_t) - \hat{Q}(s_t,a_t) \Big) \Big] \\ \hat{Q}(s_t,a_t) = r(s_t,a_t) + \gamma \mathbb{E}_{s_{t+1}\sim p}[V_{\psi}(s_{t+1})] \\ \hat{Q}(s_t,a_t) = r(s_t,a_t) + \gamma Q_\theta(s_{t+1},a_{t+1}) JQ(θ)=E(st,at)D[21(Qθ(st,at)Q^(st,at))]Q^(st,at)=r(st,at)+γEst+1p[Vψ(st+1)]Q^(st,at)=r(st,at)+γQθ(st+1,at+1)

策略 π \pi π的损失函数

J π ( ϕ ) = E s t ∼ D [ D K L ( π ϕ ( ⋅ ∣ s t ) ∣ ∣ exp ⁡ ( Q θ ( s t , ⋅ ) ) Z θ ( s t ) ) ] J_\pi(\phi) = \mathbb{E}_{s_t\sim \mathcal{D}} \Big[ \mathbb{D}_{KL}\Big( \pi_\phi(\cdot|s_t) \big|\big| \frac{\exp(Q_\theta(s_t,\cdot))}{Z_\theta(s_t)} \Big) \Big] Jπ(ϕ)=EstD[DKL(πϕ(st) Zθ(st)exp(Qθ(st,)))]

其中, Z θ ( s t ) Z_\theta(s_t) Zθ(st)是分配函数 (partition function)

KL散度的定义如下

​ 假设对随机变量 ξ \xi ξ,存在两个概率分布P和Q,其中P是真实的概率分布,Q是较容易获得的概率分布。如果 ξ \xi ξ是离散的随机变量,那么定义从P到Qd KL散度为

D K L ( P ∣ ∣ Q ) = ∑ i P ( i ) ln ⁡ ( P ( i ) Q ( i ) ) \mathbb{D}_{KL}(P\big|\big|Q) = \sum_i P(i)\ln(\frac{P(i)}{Q(i)}) DKL(P Q)=iP(i)ln(Q(i)P(i))

如果 ξ \xi ξ是连续变量,则定义从P到Q的KL散度为

D K L ( P ∣ ∣ Q ) = ∫ − ∞ ∞ p ( x ) ln ⁡ ( p ( x ) q ( x ) ) d x \mathbb{D}_{KL}(P\big|\big|Q) = \int^\infty_{-\infty}p(\mathbb{x})\ln(\frac{p(\mathbb{x})}{q(\mathbb{x})}) d\mathbb{x} DKL(P Q)=p(x)ln(q(x)p(x))dx

那么根据离散的KL散度定义,我们可以将策略 π \pi π的损失函数展开如下

J π ( ϕ ) = E s t ∼ D [ D K L ( π ϕ ( ⋅ ∣ s t ) ∣ ∣ exp ⁡ ( Q θ ( s t , ⋅ ) ) Z θ ( s t ) ) ] = E s t ∼ D [ ∑ π ϕ ( ⋅ ∣ s t ) ln ⁡ ( π ϕ ( ⋅ ∣ s t ) Z θ ( s t ) exp ⁡ ( Q θ ( s t , ⋅ ) ) ) ] = E s t ∼ D , a t ∼ π ϕ [ ln ⁡ ( π ϕ ( ⋅ ∣ s t ) + ln ⁡ ( Z θ ( s t ) ) − Q θ ( s t , ⋅ ) ] = E s t ∼ D , a t ∼ π ϕ [ ln ⁡ ( π ϕ ( ⋅ ∣ s t ) ) − Q θ ( s t , ⋅ ) ] \begin{align} J_\pi(\phi) & = \mathbb{E}_{s_t\sim \mathcal{D}} \Big[ \mathbb{D}_{KL}\Big( \pi_\phi(\cdot|s_t) \big|\big| \frac{\exp(Q_\theta(s_t,\cdot))}{Z_\theta(s_t)} \Big) \Big] \\ & = \mathbb{E}_{s_t\sim\mathcal{D}} \Big[ \sum\pi_\phi(\cdot|s_t) \ln(\frac{\pi_\phi(\cdot|s_t)Z_\theta(s_t)}{\exp(Q_\theta(s_t,\cdot))}) \Big] \\ & = \mathbb{E}_{s_t\sim\mathcal{D}, a_t\sim \pi_\phi} \Big[ \ln(\pi_\phi(\cdot|s_t) + \ln(Z_\theta(s_t)) - Q_\theta(s_t,\cdot) \Big] \\ & = \textcolor{blue}{\mathbb{E}_{s_t\sim\mathcal{D}, a_t\sim \pi_\phi} \Big[ \ln(\pi_\phi(\cdot|s_t)) - Q_\theta(s_t,\cdot) \Big]} \tag{1.4} \end{align} Jπ(ϕ)=EstD[DKL(πϕ(st) Zθ(st)exp(Qθ(st,)))]=EstD[πϕ(st)ln(exp(Qθ(st,))πϕ(st)Zθ(st))]=EstD,atπϕ[ln(πϕ(st)+ln(Zθ(st))Qθ(st,)]=EstD,atπϕ[ln(πϕ(st))Qθ(st,)](1.4)

最后一步是因为,分配函数 Z θ Z_\theta Zθ与策略 π \pi π无关,所以对梯度没有影响,所以在目标函数中可以直接忽略。

然后又引入了 reparameter的方法,

a t = f ϕ ( ϵ t ; s t ) , ϵ ∼ N a_t = f_\phi(\epsilon_t;s_t), \epsilon\sim\mathcal{N} at=fϕ(ϵt;st),ϵN

将上式带入 ( 1.4 ) (1.4) (1.4)中则有

J π ( ϕ ) = E s t ∼ D , a t ∼ π ϕ [ ln ⁡ ( π ϕ ( f ϕ ( ϵ t ; s t ) ∣ s t ) ) − Q θ ( s t , f ϕ ( ϵ t ; s t ) ) ] J_\pi(\phi)=\mathbb{E}_{s_t\sim\mathcal{D}, a_t\sim \pi_\phi} \Big[ \ln\Big(\pi_\phi\big(f_\phi(\epsilon_t;s_t)|s_t\big)\Big) - Q_\theta\Big(s_t,f_\phi(\epsilon_t;s_t)\Big) \Big] Jπ(ϕ)=EstD,atπϕ[ln(πϕ(fϕ(ϵt;st)st))Qθ(st,fϕ(ϵt;st))]

最后,只需要不断收集数据,缩小这两个损失函数,就可以得到收敛到一个解。

1.3 Two Q Value Neural Network

​ SAC的策略网络与DDPG中策略网络直接估计动作值大小不同,SAC的策略网络是估计连续动作空间Gaussian分布的均值和方差,再从这个Gaussian 分布中均匀采样获得连续的动作值。SAC中可以维护一个V值函数网络,也可以不维护V值函数网络,但是SAC中必须的是两个Q值函数网络及两个Q值网络的target network和一个策略网络 π \pi π。为什么要维护两个Q值网络呢?这是因为为了减少Q值网络存在的过高估计的问题。然后用两个Q值网络中较小的Q值网络来计算损失函数,假设存在两个Q值网络 Q ( θ 1 ) , Q ( θ 2 ) Q(\theta_1),Q(\theta_2) Q(θ1),Q(θ2)和其对应的target network Q ( θ 1 − ) , Q ( θ 2 − ) Q(\theta_1^-),Q(\theta_2^-) Q(θ1),Q(θ2)

​ 那么Q值函数的损失函数就成了

J Q ( θ ) = E ( s t , a t ) ∼ D [ 1 2 ( Q θ ( s t , a t ) − Q ^ ( s t , a t ) ) ] Q ^ ( s t , a t ) = r ( s t , a t ) + γ V ψ ( s t + 1 ) V ψ ( s t + 1 ) = Q θ ( s t + 1 , a t + 1 ) − α log ⁡ ( π ( a t + 1 ∣ s t + 1 ) ) \begin{align} J_Q(\theta) & = \mathbb{E}_{(s_t,a_t)\sim\mathcal{D}} \Big[ \frac{1}{2}\Big( Q_\theta(s_t,a_t) - \hat{Q}(s_t,a_t) \Big) \Big] \tag{1.5} \\ \hat{Q}(s_t,a_t) & = r(s_t,a_t) + \gamma V_\psi(s_{t+1}) \tag{1.6} \\ V_\psi(s_{t+1}) & = Q_\theta(s_{t+1}, a_{t+1}) - \alpha \log (\pi(a_{t+1}|s_{t+1})) \tag{1.7} \end{align} JQ(θ)Q^(st,at)Vψ(st+1)=E(st,at)D[21(Qθ(st,at)Q^(st,at))]=r(st,at)+γVψ(st+1)=Qθ(st+1,at+1)αlog(π(at+1st+1))(1.5)(1.6)(1.7)

结合式子 1.5 , 1.6 , 1.7 1.5, 1.6, 1.7 1.5,1.6,1.7则有

J Q ( θ ) = E ( s t , a t ) ∼ D [ 1 2 ( Q θ ( s t , a t ) − ( r ( s t , a t ) + γ ( min ⁡ j = 1 , 2 Q θ j − ( s t + 1 , a t + 1 ) − α log ⁡ π ( a t + 1 ∣ s t + 1 ) ) ) ) 2 ] J_Q(\theta) = \textcolor{red}{\mathbb{E}_{(s_t,a_t)\sim\mathcal{D}} \Big[ \frac{1}{2}\Big( Q_\theta(s_t,a_t) - \Big(r(s_t,a_t)+\gamma \big(\min_{j=1,2}Q_{\theta_j^-}(s_{t+1},a_{t+1}) - \alpha \log\pi(a_{t+1}|s_{t+1}) \big) \Big) \Big)^2 \Big]} JQ(θ)=E(st,at)D[21(Qθ(st,at)(r(st,at)+γ(j=1,2minQθj(st+1,at+1)αlogπ(at+1st+1))))2]

因为SAC也是一种离线的算法,为了让训练更加稳定,这里使用了目标Q网络 Q ( θ − ) Q(\theta^-) Q(θ),同样是两个目标网络,与两个Q网络相对应,SAC中目标Q网络的更新方式和DDPG中一致。

​ 之前推导了策略 π ( ϕ ) \pi(\phi) π(ϕ)的损失函数如下

J π ( ϕ ) = E s t ∼ D , a t ∼ π ϕ [ ln ⁡ ( π ϕ ( ⋅ ∣ s t ) ) − Q θ ( s t , ⋅ ) ] J_\pi(\phi) = \mathbb{E}_{s_t\sim\mathcal{D}, a_t\sim \pi_\phi} \Big[ \ln(\pi_\phi(\cdot|s_t)) - Q_\theta(s_t,\cdot) \Big] Jπ(ϕ)=EstD,atπϕ[ln(πϕ(st))Qθ(st,)]

再引入reparameter化简之后,就成了

J π ( ϕ ) = E s t ∼ D , a t ∼ π ϕ [ ln ⁡ ( π ϕ ( f ϕ ( ϵ t ; s t ) ∣ s t ) ) − Q θ ( s t , f ϕ ( ϵ t ; s t ) ) ] J_\pi(\phi)=\mathbb{E}_{s_t\sim\mathcal{D}, a_t\sim \pi_\phi} \Big[ \ln\Big(\pi_\phi\big(f_\phi(\epsilon_t;s_t)|s_t\big)\Big) - Q_\theta\Big(s_t,f_\phi(\epsilon_t;s_t)\Big) \Big] Jπ(ϕ)=EstD,atπϕ[ln(πϕ(fϕ(ϵt;st)st))Qθ(st,fϕ(ϵt;st))]

利用双Q网络取最小,再将上述目标函数进行改写

J π ( ϕ ) = E s t ∼ D , a t ∼ π ϕ [ ln ⁡ ( π ϕ ( f ϕ ( ϵ t ; s t ) ∣ s t ) ) − min ⁡ j = 1 , 2 Q θ j ( s t , f ϕ ( ϵ t ; s t ) ) ] J_\pi(\phi)= \textcolor{red}{\mathbb{E}_{s_t\sim\mathcal{D}, a_t\sim \pi_\phi} \Big[ \ln\Big(\pi_\phi\big(f_\phi(\epsilon_t;s_t)|s_t\big)\Big) - \min_{j=1,2}Q_{\theta_j}\Big(s_t,f_\phi(\epsilon_t;s_t)\Big) \Big]} Jπ(ϕ)=EstD,atπϕ[ln(πϕ(fϕ(ϵt;st)st))j=1,2minQθj(st,fϕ(ϵt;st))]

Image

1.4 Tricks

​ SAC还集成了一些别的算法的常用技巧,比如Replay Buffer来产生independent and identically distribution的样本,使用了Double DQN中target network的思想使用两个网络。SAC中最重要的一个技巧是自动调整熵正则项,在 SAC 算法中,如何选择熵正则项的系数非常重要。在不同的状态下需要不同大小的熵:在最优动作不确定的某个状态下,熵的取值应该大一点;而在某个最优动作比较确定的状态下,熵的取值可以小一点。为了自动调整熵正则项,SAC 将强化学习的目标改写为一个带约束的优化问题:

max ⁡ π E π [ ∑ t r ( s t , a t ) ] s.t. E ( s t , a t ) ∼ ρ π [ − log ⁡ ( π t ( a t ∣ s t ) ) ] ≥ H 0 \max_\pi \mathbb{E}_\pi\Big[ \sum_t r(s_t,a_t) \Big] \quad \text{s.t.} \quad \mathbb{E}_{(s_t,a_t)\sim\rho_\pi}[-\log (\pi_t(a_t|s_t))] \ge \mathcal{H}_0 πmaxEπ[tr(st,at)]s.t.E(st,at)ρπ[log(πt(atst))]H0

H 0 \mathcal{H}_0 H0 是预先定义好的最小策略熵的阈值。通过一些数学技巧简化,可以得到温度 α \alpha α的损失函数

J ( α ) = E s t ∼ R , a t ∼ π ( ⋅ ∣ s t ) [ − α log ⁡ ( π t ( a t ∣ s t ) ) − α H 0 ] J(\alpha) = \textcolor{red}{\mathbb{E}_{s_t\sim R,a_t\sim \pi(\cdot|s_t)}[-\alpha\log(\pi_t(a_t|s_t))-\alpha\mathcal{H}_0]} J(α)=EstR,atπ(st)[αlog(πt(atst))αH0]

1.5 Pesudocode

Image

2. 代码实现

2.1 SAC处理连续动作空间

采用gymnasiumPendulum-v1的连续动作环境,整体代码如下

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Normal

from tqdm import tqdm
import collections
import random
import numpy as np
import matplotlib.pyplot as plt
import gym

# replay buffer
class ReplayBuffer():
    def __init__(self, capacity):
        self.buffer = collections.deque(maxlen=capacity)
    
    def add(self, s, r, a, s_, d):
        self.buffer.append((s,r,a,s_,d))
    
    def sample(self, batch_size):
        transitions = random.sample(self.buffer, batch_size)
        states, rewards, actions, next_states, dones = zip(*transitions)
        return np.array(states), rewards, actions, np.array(next_states), dones
    
    def size(self):
        return len(self.buffer)

# Actor
class PolicyNet_Continuous(nn.Module):
    """动作空间符合高斯分布,输出动作空间的均值mu,和标准差std"""
    def __init__(self, state_dim, hidden_dim, action_dim, action_bound):
        super(PolicyNet_Continuous, self).__init__()
        self.fc1 = nn.Sequential(
            nn.Linear(in_features=state_dim, out_features=hidden_dim),
            nn.ReLU()
        )
        self.fc_mu = nn.Linear(in_features=hidden_dim, out_features=action_dim)
        self.fc_std = nn.Sequential(
            nn.Linear(in_features=hidden_dim, out_features=action_dim),
            nn.Softplus()
        )
        self.action_bound = action_bound

    def forward(self, s):
        x = self.fc1(s)
        mu = self.fc_mu(x)
        std = self.fc_std(x)
        distribution = Normal(mu, std)
        normal_sample = distribution.rsample()
        normal_log_prob = distribution.log_prob(normal_sample)
        # get action limit to [-1,1]
        action = torch.tanh(normal_sample)
        # get tanh_normal log probability
        tanh_log_prob = normal_log_prob - torch.log(1 - torch.tanh(action).pow(2) + 1e-7)
        # get action bounded
        action = action * self.action_bound
        return action, tanh_log_prob


# Critic
class QValueNet_Continuous(nn.Module):
    def __init__(self, state_dim, hidden_dim, action_dim):
        super(QValueNet_Continuous, self).__init__()
        self.fc1 = nn.Sequential(
            nn.Linear(in_features=state_dim + action_dim, out_features=hidden_dim),
            nn.ReLU()
        )
        self.fc2 = nn.Sequential(
            nn.Linear(in_features=hidden_dim, out_features=hidden_dim),
            nn.ReLU()
        )
        self.fc_out = nn.Linear(in_features=hidden_dim, out_features=1)
    
    def forward(self, s, a):
        cat = torch.cat([s,a], dim=1)
        x = self.fc1(cat)
        x = self.fc2(x)
        return self.fc_out(x)

# maximize entropy deep reinforcement learning SAC
class SAC_Continuous():
    def __init__(self, state_dim, hidden_dim, action_dim, action_bound,
                    actor_lr, critic_lr, alpha_lr, target_entropy, tau, gamma,
                    device):
        # actor
        self.actor = PolicyNet_Continuous(state_dim, hidden_dim, action_dim, action_bound).to(device)
        # two critics
        self.critic1 = QValueNet_Continuous(state_dim, hidden_dim, action_dim).to(device)
        self.critic2 = QValueNet_Continuous(state_dim, hidden_dim, action_dim).to(device)
        # two target critics
        self.target_critic1 = QValueNet_Continuous(state_dim, hidden_dim, action_dim).to(device)
        self.target_critic2 = QValueNet_Continuous(state_dim, hidden_dim, action_dim).to(device)
        # initialize with same parameters
        self.target_critic1.load_state_dict(self.critic1.state_dict())
        self.target_critic2.load_state_dict(self.critic2.state_dict())
        # specify optimizers
        self.optimizer_actor = torch.optim.Adam(self.actor.parameters(), lr=actor_lr)
        self.optimizer_critic1 = torch.optim.Adam(self.critic1.parameters(), lr=critic_lr)
        self.optimizer_critic2 = torch.optim.Adam(self.critic2.parameters(), lr=critic_lr)
        # 使用alpha的log值可以使训练稳定
        self.log_alpha = torch.tensor(np.log(0.01), dtype=torch.float, requires_grad = True)
        self.optimizer_log_alpha = torch.optim.Adam([self.log_alpha], lr=alpha_lr)

        self.target_entropy = target_entropy
        self.gamma = gamma
        self.tau = tau
        self.device = device
    
    def take_action(self, state):
        state = torch.tensor(np.array([state]), dtype=torch.float).to(self.device)
        action, _ = self.actor(state)
        return [action.item()]
    
    # calculate td target
    def calc_target(self, rewards, next_states, dones):
        next_action, log_prob = self.actor(next_states)
        entropy = -log_prob
        q1_values = self.target_critic1(next_states, next_action)
        q2_values = self.target_critic2(next_states, next_action)
        next_values = torch.min(q1_values, q2_values) + self.log_alpha.exp() * entropy
        td_target = rewards + self.gamma * next_values * (1-dones)
        return td_target

    # soft update method
    def soft_update(self, net, target_net):
        for param_target, param in zip(target_net.parameters(), net.parameters()):
            param_target.data.copy_(param_target.data * (1.0-self.tau) + param.data * self.tau)
        
    def update(self, transition_dict):
        states = torch.tensor(transition_dict['states'], dtype=torch.float).to(self.device)
        rewards = torch.tensor(transition_dict['rewards'], dtype=torch.float).view(-1,1).to(self.device)
        actions = torch.tensor(transition_dict['actions'], dtype=torch.float).view(-1,1).to(self.device)
        next_states = torch.tensor(transition_dict['next_states'], dtype=torch.float).to(self.device)
        dones = torch.tensor(transition_dict['dones'], dtype=torch.float).view(-1,1).to(self.device)

        rewards = (rewards + 8.0) / 8.0     #对倒立摆环境的奖励进行重塑,方便训练

        # update two Q-value network
        td_target = self.calc_target(rewards, next_states, dones).detach()
        critic1_loss = torch.mean(F.mse_loss(td_target, self.critic1(states, actions)))
        critic2_loss = torch.mean(F.mse_loss(td_target, self.critic2(states, actions)))

        self.optimizer_critic1.zero_grad()
        critic1_loss.backward()
        self.optimizer_critic1.step()
        self.optimizer_critic2.zero_grad()
        critic2_loss.backward()
        self.optimizer_critic2.step()

        # update policy network
        new_actions, log_prob = self.actor(states)
        entropy = - log_prob
        q1_value = self.critic1(states, new_actions)
        q2_value = self.critic2(states, new_actions)
        actor_loss = torch.mean(-self.log_alpha.exp() * entropy - torch.min(q1_value, q2_value))
        self.optimizer_actor.zero_grad()
        actor_loss.backward()
        self.optimizer_actor.step()

        # update temperature alpha
        alpha_loss = torch.mean((entropy - self.target_entropy).detach() * self.log_alpha.exp())
        self.optimizer_log_alpha.zero_grad()
        alpha_loss.backward()
        self.optimizer_log_alpha.step()

        # soft update target Q-value network
        self.soft_update(self.critic1, self.target_critic1)
        self.soft_update(self.critic2, self.target_critic2)


def train_off_policy_agent(env, agent, num_episodes, replay_buffer, minimal_size, batch_size, render, seed_number):
    return_list = []
    for i in range(10):
        with tqdm(total=int(num_episodes/10), desc='Iteration %d'%(i+1)) as pbar:
            for i_episode in range(int(num_episodes/10)):
                observation, _ = env.reset(seed=seed_number)
                done = False
                episode_return = 0

                while not done:
                    if render:
                        env.render()
                    action = agent.take_action(observation)
                    observation_, reward, terminated, truncated, _ = env.step(action)
                    done = terminated or truncated
                    replay_buffer.add(observation, action, reward, observation_, done)
                    # swap states
                    observation = observation_
                    episode_return += reward
                    if replay_buffer.size() > minimal_size:
                        b_s, b_a, b_r, b_ns, b_d = replay_buffer.sample(batch_size)
                        transition_dict = {
                            'states': b_s,
                            'actions': b_a,
                            'rewards': b_r,
                            'next_states': b_ns,
                            'dones': b_d
                        }
                        agent.update(transition_dict)
                return_list.append(episode_return)
                if(i_episode+1) % 10 == 0:
                    pbar.set_postfix({
                        'episode': '%d'%(num_episodes/10 * i + i_episode + 1),
                        'return': "%.3f"%(np.mean(return_list[-10:]))
                    })
                pbar.update(1)
    env.close()
    return return_list

def moving_average(a, window_size):
    cumulative_sum = np.cumsum(np.insert(a, 0, 0)) 
    middle = (cumulative_sum[window_size:] - cumulative_sum[:-window_size]) / window_size
    r = np.arange(1, window_size-1, 2)
    begin = np.cumsum(a[:window_size-1])[::2] / r
    end = (np.cumsum(a[:-window_size:-1])[::2] / r)[::-1]
    return np.concatenate((begin, middle, end))

def plot_curve(return_list, mv_return, algorithm_name, env_name):
    episodes_list = list(range(len(return_list)))
    plt.plot(episodes_list, return_list, c='gray', alpha=0.6)
    plt.plot(episodes_list, mv_return)
    plt.xlabel('Episodes')
    plt.ylabel('Returns')
    plt.title('{} on {}'.format(algorithm_name, env_name))
    plt.show()



if __name__ == "__main__":

    # reproducible
    seed_number = 0
    random.seed(seed_number)
    np.random.seed(seed_number)
    torch.manual_seed(seed_number)

    num_episodes = 150     # episodes length
    hidden_dim = 128        # hidden layers dimension
    gamma = 0.98            # discounted rate
    actor_lr = 1e-4         # lr of actor
    critic_lr = 1e-3        # lr of critic
    alpha_lr = 1e-4
    tau = 0.005             # soft update parameter
    buffer_size = 10000
    minimal_size = 1000
    batch_size = 64

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    env_name = 'Pendulum-v1'

    render = False
    if render:
        env = gym.make(id=env_name, render_mode='human')
    else:
        env = gym.make(id=env_name)

    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.shape[0]  
    action_bound = env.action_space.high[0]
    # entropy初始化为动作空间维度的负数
    target_entropy = - env.action_space.shape[0]

    replaybuffer = ReplayBuffer(buffer_size)
    agent = SAC_Continuous(state_dim, hidden_dim, action_dim, action_bound, actor_lr, critic_lr, alpha_lr, target_entropy, tau, gamma, device)
    return_list = train_off_policy_agent(env, agent, num_episodes, replaybuffer, minimal_size, batch_size, render, seed_number)

    mv_return = moving_average(return_list, 9)
    plot_curve(return_list, mv_return, 'SAC', env_name)

所获得的回报曲线如下图

Image

2.2 SAC处理离散动作空间

采用gaymnasium中的CartPole-v1离散动作空间的环境。

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Categorical

from tqdm import tqdm
import collections
import random
import numpy as np
import matplotlib.pyplot as plt
import gym

# replay buffer
class ReplayBuffer():
    def __init__(self, capacity):
        self.buffer = collections.deque(maxlen=capacity)
    
    def add(self, s, r, a, s_, d):
        self.buffer.append((s,r,a,s_,d))
    
    def sample(self, batch_size):
        transitions = random.sample(self.buffer, batch_size)
        states, rewards, actions, next_states, dones = zip(*transitions)
        return np.array(states), rewards, actions, np.array(next_states), dones
    
    def size(self):
        return len(self.buffer)

# Actor
class PolicyNet_Discrete(nn.Module):
    """动作空间服从离散的概率分布,输出每个动作的概率值"""
    def __init__(self, state_dim, hidden_dim, action_dim):
        super(PolicyNet_Discrete, self).__init__()
        self.fc1 = nn.Sequential(
            nn.Linear(in_features=state_dim, out_features=hidden_dim),
            nn.ReLU()
        )
        self.fc2 = nn.Sequential(
            nn.Linear(in_features=hidden_dim, out_features=action_dim),
            nn.Softmax(dim=1)
        )

    def forward(self, s):
        x = self.fc1(s)
        return self.fc2(x)


# Critic
class QValueNet_Discrete(nn.Module):
    def __init__(self, state_dim, hidden_dim, action_dim):
        super(QValueNet_Discrete, self).__init__()
        self.fc1 = nn.Sequential(
            nn.Linear(in_features=state_dim, out_features=hidden_dim),
            nn.ReLU()
        )
        self.fc2 = nn.Linear(in_features=hidden_dim, out_features=action_dim)
    
    def forward(self, s):
        x = self.fc1(s)
        return self.fc2(x)

# maximize entropy deep reinforcement learning SAC
class SAC_Continuous():
    def __init__(self, state_dim, hidden_dim, action_dim,
                    actor_lr, critic_lr, alpha_lr, target_entropy, tau, gamma,
                    device):
        # actor
        self.actor = PolicyNet_Discrete(state_dim, hidden_dim, action_dim).to(device)
        # two critics
        self.critic1 = QValueNet_Discrete(state_dim, hidden_dim, action_dim).to(device)
        self.critic2 = QValueNet_Discrete(state_dim, hidden_dim, action_dim).to(device)
        # two target critics
        self.target_critic1 = QValueNet_Discrete(state_dim, hidden_dim, action_dim).to(device)
        self.target_critic2 = QValueNet_Discrete(state_dim, hidden_dim, action_dim).to(device)
        # initialize with same parameters
        self.target_critic1.load_state_dict(self.critic1.state_dict())
        self.target_critic2.load_state_dict(self.critic2.state_dict())
        # specify optimizers
        self.optimizer_actor = torch.optim.Adam(self.actor.parameters(), lr=actor_lr)
        self.optimizer_critic1 = torch.optim.Adam(self.critic1.parameters(), lr=critic_lr)
        self.optimizer_critic2 = torch.optim.Adam(self.critic2.parameters(), lr=critic_lr)
        # 使用alpha的log值可以使训练稳定
        self.log_alpha = torch.tensor(np.log(0.01), dtype=torch.float, requires_grad = True)
        self.optimizer_log_alpha = torch.optim.Adam([self.log_alpha], lr=alpha_lr)

        self.target_entropy = target_entropy
        self.gamma = gamma
        self.tau = tau
        self.device = device
    
    def take_action(self, state):
        state = torch.tensor(np.array([state]), dtype=torch.float).to(self.device)
        probs = self.actor(state)
        action_dist = Categorical(probs)
        action = action_dist.sample()
        return action.item()
    
    # calculate td target
    def calc_target(self, rewards, next_states, dones):
        next_probs = self.actor(next_states)
        next_log_probs = torch.log(next_probs + 1e-8)
        entropy = -torch.sum(next_probs * next_log_probs, dim=1, keepdim=True)
        q1_values = self.target_critic1(next_states)
        q2_values = self.target_critic2(next_states)
        min_qvalue = torch.sum(next_probs * torch.min(q1_values, q2_values),
                               dim=1,
                               keepdim=True)
        next_value = min_qvalue + self.log_alpha.exp() * entropy
        td_target = rewards + self.gamma * next_value * (1 - dones)
        return td_target

    # soft update method
    def soft_update(self, net, target_net):
        for param_target, param in zip(target_net.parameters(), net.parameters()):
            param_target.data.copy_(param_target.data * (1.0-self.tau) + param.data * self.tau)
        
    def update(self, transition_dict):
        states = torch.tensor(transition_dict['states'], dtype=torch.float).to(self.device)
        rewards = torch.tensor(transition_dict['rewards'], dtype=torch.float).view(-1,1).to(self.device)
        actions = torch.tensor(transition_dict['actions'], dtype=torch.int64).view(-1,1).to(self.device)
        next_states = torch.tensor(transition_dict['next_states'], dtype=torch.float).to(self.device)
        dones = torch.tensor(transition_dict['dones'], dtype=torch.float).view(-1,1).to(self.device)

        rewards = (rewards + 8.0) / 8.0     #对倒立摆环境的奖励进行重塑,方便训练

        # update two Q-value network
        td_target = self.calc_target(rewards, next_states, dones).detach()
        critic1_loss = torch.mean(F.mse_loss(td_target, self.critic1(states).gather(dim=1,index=actions)))
        critic2_loss = torch.mean(F.mse_loss(td_target, self.critic2(states).gather(dim=1,index=actions)))

        self.optimizer_critic1.zero_grad()
        critic1_loss.backward()
        self.optimizer_critic1.step()
        self.optimizer_critic2.zero_grad()
        critic2_loss.backward()
        self.optimizer_critic2.step()

        # update policy network
        probs = self.actor(states)
        log_probs = torch.log(probs + 1e-8)
        entropy = -torch.sum(probs * log_probs, dim=1, keepdim=True)
        q1_value = self.critic1(states)
        q2_value = self.critic2(states)
        min_qvalue = torch.sum(probs * torch.min(q1_value, q2_value),
                               dim=1,
                               keepdim=True)  # 直接根据概率计算期望
        actor_loss = torch.mean(-self.log_alpha.exp() * entropy - min_qvalue)
        self.optimizer_actor.zero_grad()
        actor_loss.backward()
        self.optimizer_actor.step()

        # update temperature alpha
        alpha_loss = torch.mean((entropy - self.target_entropy).detach() * self.log_alpha.exp())
        self.optimizer_log_alpha.zero_grad()
        alpha_loss.backward()
        self.optimizer_log_alpha.step()

        # soft update target Q-value network
        self.soft_update(self.critic1, self.target_critic1)
        self.soft_update(self.critic2, self.target_critic2)


def train_off_policy_agent(env, agent, num_episodes, replay_buffer, minimal_size, batch_size, render, seed_number):
    return_list = []
    for i in range(10):
        with tqdm(total=int(num_episodes/10), desc='Iteration %d'%(i+1)) as pbar:
            for i_episode in range(int(num_episodes/10)):
                observation, _ = env.reset(seed=seed_number)
                done = False
                episode_return = 0

                while not done:
                    if render:
                        env.render()
                    action = agent.take_action(observation)
                    observation_, reward, terminated, truncated, _ = env.step(action)
                    done = terminated or truncated
                    replay_buffer.add(observation, action, reward, observation_, done)
                    # swap states
                    observation = observation_
                    episode_return += reward
                    if replay_buffer.size() > minimal_size:
                        b_s, b_a, b_r, b_ns, b_d = replay_buffer.sample(batch_size)
                        transition_dict = {
                            'states': b_s,
                            'actions': b_a,
                            'rewards': b_r,
                            'next_states': b_ns,
                            'dones': b_d
                        }
                        agent.update(transition_dict)
                return_list.append(episode_return)
                if(i_episode+1) % 10 == 0:
                    pbar.set_postfix({
                        'episode': '%d'%(num_episodes/10 * i + i_episode + 1),
                        'return': "%.3f"%(np.mean(return_list[-10:]))
                    })
                pbar.update(1)
    env.close()
    return return_list

def moving_average(a, window_size):
    cumulative_sum = np.cumsum(np.insert(a, 0, 0)) 
    middle = (cumulative_sum[window_size:] - cumulative_sum[:-window_size]) / window_size
    r = np.arange(1, window_size-1, 2)
    begin = np.cumsum(a[:window_size-1])[::2] / r
    end = (np.cumsum(a[:-window_size:-1])[::2] / r)[::-1]
    return np.concatenate((begin, middle, end))

def plot_curve(return_list, mv_return, algorithm_name, env_name):
    episodes_list = list(range(len(return_list)))
    plt.plot(episodes_list, return_list, c='gray', alpha=0.6)
    plt.plot(episodes_list, mv_return)
    plt.xlabel('Episodes')
    plt.ylabel('Returns')
    plt.title('{} on {}'.format(algorithm_name, env_name))
    plt.show()



if __name__ == "__main__":

    # reproducible
    seed_number = 0
    random.seed(seed_number)
    np.random.seed(seed_number)
    torch.manual_seed(seed_number)

    num_episodes = 200    # episodes length
    hidden_dim = 128        # hidden layers dimension
    gamma = 0.98            # discounted rate
    actor_lr = 1e-3         # lr of actor
    critic_lr = 1e-2        # lr of critic
    alpha_lr = 1e-2
    tau = 0.005             # soft update parameter
    buffer_size = 10000
    minimal_size = 500
    batch_size = 64


    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    env_name = 'CartPole-v1'

    render = False
    if render:
        env = gym.make(id=env_name, render_mode='human')
    else:
        env = gym.make(id=env_name)

    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.n  

    # entropy初始化为-1
    target_entropy = -1

    replaybuffer = ReplayBuffer(buffer_size)
    agent = SAC_Continuous(state_dim, hidden_dim, action_dim, actor_lr, critic_lr, alpha_lr, target_entropy, tau, gamma, device)
    return_list = train_off_policy_agent(env, agent, num_episodes, replaybuffer, minimal_size, batch_size, render, seed_number)

    mv_return = moving_average(return_list, 9)
    plot_curve(return_list, mv_return, 'SAC', env_name)

Reference

Hands on RL

SAC(Soft Actor-Critic)阅读笔记

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/891494.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

【Java高级开发高频面试题】面试者角度的口述版

文章目录 1.具备扎实的Java基础集合HashMap底层工作原理HashMap版本问题HashMap并发修改异常HashMap影响HashMap性能的因素HashMap使用优化 SynchronizedThreadLocalAQS线程池JVM内存模型类加载机制与双亲委派垃圾回收算法、垃圾回收器、空间分配担保策略引用计数器算法、可达性…

小白到运维工程师自学之路 第七十七集 (基于Prometheus监控Kubernetes集群)

一、Prometheus简介 Prometheus是由SoundCloud开发的开源监控报警系统和时序列数据库(TSDB);Prometheus使用Go语言开发,是Google BorgMon监控系统的开源版本;2016年由Google发起Linux基金会旗下的原生云基金会(Cloud Native Computing Found…

Next.js - Route Groups(路由组)

路由组的作用 在应用程序目录中,嵌套文件夹通常会映射到 URL 路径。不过,您可以将文件夹标记为路由组,以防止该文件夹包含在路由的 URL 路径中。 这样就可以在不影响 URL 路径结构的情况下,将路由段和项目文件组织到逻辑组中。 …

Vue 根据Upload组件的before-upload方法,限制用户上传文件的类型及大小

文章目录 一、前端 Vue Upload组件的before-upload方法二,使用方法 一、前端 Vue Upload组件的before-upload方法 判断用户上传的文件是否符合要求,可以根据文件类型或者大小做出限制。 文件类型值docapplication/msworddocxapplication/vnd.openxmlform…

Docker基础入门:常规软件安装与镜像加载原理

Docker基础入门:常规软件安装与镜像加载原理 一、Docker常规软件安装1.1、部署nginx1.2、部署tomcat1.3、部署elasticsearch1.4、如何部署kibana-->连接elasticsearch1.5、部署可视化工具 二、 镜像加载原理2.1、镜像是什么2.2、Docker镜像加速原理2.3、分层理解…

【Python程序设计】基于Python Flask 机器学习的全国+上海气象数据采集预测可视化系统-附下载链接以及详细论文报告,原创项目其他均为抄袭

基于Python Flask 机器学习的全国上海气象数据采集预测可视化系统 一、项目简介二、开发环境三、项目技术四、功能结构五、运行截图六、功能实现七、数据库设计八、源码下载地址 一、项目简介 在信息科技蓬勃发展的当代,我们推出了一款基于Python Flask的全国上海气…

什么是合成数据(Synthetic Data)?

关于合成数据您需要知道的一切 推出人工智能(AI)的企业在为其模型采集足够的数据方面会遇到一个主要障碍。对于许多用例来说,正确的数据根本不可用,或者获取数据非常困难且成本高昂。在创建AI模型时,数据缺失或不完整…

哨兵1号处理流程

分步步骤 1.打开.mainsafe文件 2.轨道校正:Radar—>Apply Orbit File,勾上Do not fail 3.热噪声去除:Radar—>Radiometric—>S-1 Thermal Noise 4.辐射定标:Radar—>Radiometric—>Calibrate (只选outp…

day20 飞机大战射击游戏

有飞行物类 飞行 爆炸 的连环画, 飞行的背景图 , 子弹图, 还有游戏开始 暂停 结束 的画面图。 设计一个飞机大战的小游戏, 玩家用鼠标操作hero飞行机, 射出子弹杀死敌机,小蜜蜂。 敌机可以获得分数&…

Vue2集成Echarts实现可视化图表

一、依赖配置 1、引入echarts相关依赖 也可以卸载原有的,重新安装 卸载:npm uninstall echarts --save 安装:npm install echarts4.8.0 --save 引入水球图形依赖 npm install echarts-liquidfill2.0.2 --save 水球图可参考文档&#xff1…

【Python】使用python解析someip报文,以someip格式打印报文

文章目录 1.安装scapy库2.示例 1.安装scapy库 使用 pip 安装 scapy 第三方库,打开 cmd,输入以下命令: pip install scapy出现如图所示,表示安装成功: 2.示例 要解析someip格式报文,需要导入someip模块&a…

rabbitmq的死信队列

目录 成为死信的条件 消息TTL过期 队列达到最大长度 消息被拒 延迟队列 延迟队列使用场景 消息设置 TTL 队列设置 TTL 两者区别 producer 将消息投递到 broker 或者直接到 queue 里了, consumer 从 queue 取出消息 进行消费,但某些时候由…

汇聚产学合力二十载,2023英特尔学术大会在南京开幕

8月16日,以“合聚创 共IN智能时代”为主题的“2023英特尔(中国)学术大会”在南京开幕,邀请专家学者共话科技界前沿趋势,展示科研成果和技术解决方案。本次大会延续了英特尔“为智能而聚能”,推动中国产业界…

ShowMeBug CEO李亚飞受邀出席ArchSummit 全球架构师峰会

2023年7月21-22日,极客邦科技旗下InfoQ中国举办的ArchSummit 全球架构师峰会(深圳站)2023 在深圳顺利召开。本次会议,聚集了国内外数百位架构师专家来分享技术内容,像MySQL之父、科大讯飞涵盖语言大模型、AIGC、可观测…

214、仿真-基于51单片机温度甲醛一氧化碳(co)电机净化报警Proteus仿真设计(程序+Proteus仿真+配套资料等)

毕设帮助、开题指导、技术解答(有偿)见文未 目录 一、硬件设计 二、设计功能 三、Proteus仿真图 四、程序源码 资料包括: 需要完整的资料可以点击下面的名片加下我,找我要资源压缩包的百度网盘下载地址及提取码。 方案选择 单片机的选择 方案一&a…

Spring系列七:声明式事务

🐘声明式事务 和AOP有密切的联系, 是AOP的一个实际的应用. 🐲事务分类简述 ●分类 1.编程式事务: 示意代码, 传统方式 Connection connection JdbcUtils.getConnection(); try { //1.先设置事务不要自动提交 connection.setAutoCommit(false…

【数理知识】向量与基的内积,Matlab 代码验证

序号内容1【数理知识】向量的坐标基表示法,Matlab 代码验证2【数理知识】向量与基的内积,Matlab 代码验证 文章目录 1. 向量与基的内积2. 二维平面向量举例3. 代码验证Ref 1. 向量与基的内积 假设存在一个二维平面内的向量 a ⃗ \vec{a} a &#xff0c…

复旦微FR0触摸原理(1)

传统的家电产品通常使用物理按键来进行操作,但随着科技的不断进步,越来越多的家电产品开始采用触摸屏幕和触摸按键来提供更加智能化和便捷的操作方式。 本篇介绍复旦微FM33FR026的触摸检测原理 TSI 模块使用自电容的 方法来检测触摸行为。 自电容检测的原…

半导体退火那些事(1)

1.半导体退火的原理 半导体材料在晶体生长和制造过程中,由于各种原因会出现缺陷、杂质、位错等结构性缺陷,导致晶格不完整,施加电场后的电导率较低。通过退火处理,可以使材料得到修复,结晶体内部重新排列,…

day22 API文档 第一个元素File

file的绝对路径 file的相对路径 file路径的写法 file的各种方法 File.separator 解决虚拟机和windows路径中的斜杠问题 lambda怎么写 file1 file2 file3 file4 public class FileDemo04 {public static void main(String[] args) {File file new File(".");File…