深度强化学习(九)(改进策略梯度)
一.带基线的策略梯度方法
Theorem:
设 b b b 是任意的函数, b b b与 A A A无关。把 b b b 作为动作价值函数 Q π ( S , A ) Q_\pi(S, A) Qπ(S,A) 的基线, 对策略梯度没有影响:
∇ θ J ( θ ) = E S [ E A ∼ π ( ⋅ ∣ S ; θ ) [ ( Q π ( S , A ) − b ) ⋅ ∇ θ ln π ( A ∣ S ; θ ) ] ] . \nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta})=\mathbb{E}_S\left[\mathbb{E}_{A \sim \pi(\cdot \mid S ; \boldsymbol{\theta})}\left[\left(Q_\pi(S, A)-b\right) \cdot \nabla_{\boldsymbol{\theta}} \ln \pi(A \mid S ; \boldsymbol{\theta})\right]\right] . ∇θJ(θ)=ES[EA∼π(⋅∣S;θ)[(Qπ(S,A)−b)⋅∇θlnπ(A∣S;θ)]].
proof:
E
S
[
E
A
∼
π
(
⋅
∣
S
;
θ
)
[
b
⋅
∇
θ
ln
π
(
A
∣
S
;
θ
)
]
]
=
E
A
,
S
[
b
⋅
∇
θ
ln
π
(
A
∣
S
;
θ
)
]
=
∑
A
,
S
b
⋅
∇
θ
π
(
a
∣
s
;
θ
)
p
(
a
,
s
)
π
(
a
∣
s
;
θ
)
=
∑
A
,
S
b
⋅
∇
θ
π
(
a
∣
s
;
θ
)
⋅
p
(
s
)
=
∑
S
[
b
⋅
p
(
s
)
∑
A
∇
θ
π
(
a
∣
s
;
θ
)
]
=
∑
S
[
b
⋅
p
(
s
)
∇
θ
∑
A
π
(
a
∣
s
;
θ
)
]
=
∑
S
[
b
⋅
p
(
s
)
∇
θ
1
]
=
0
\begin{aligned} \Bbb E_{S}[\Bbb E_{A\sim\pi(\cdot\mid S;\boldsymbol \theta)}[b\cdot\nabla_{\boldsymbol \theta}\ln \pi(A\mid S;\boldsymbol \theta)]]&=\Bbb E_{A,S}[b\cdot \nabla_{\boldsymbol \theta}\ln \pi(A\mid S;\boldsymbol \theta)]\\ &=\sum_{A,S}b\cdot\nabla_{\boldsymbol \theta}\pi(a\mid s;\boldsymbol \theta)\frac{p(a,s)}{\pi(a\mid s;\boldsymbol \theta)}\\ &=\sum_{A,S}b\cdot\nabla_{\boldsymbol \theta}\pi(a\mid s;\boldsymbol \theta)\cdot p(s)\\ &=\sum_{S}[b\cdot p(s)\sum_{A}\nabla_{\boldsymbol \theta}\pi(a\mid s;\boldsymbol \theta)]\\ &=\sum_{S}[b\cdot p(s)\nabla_{\boldsymbol \theta}\sum_{A}\pi(a\mid s;\boldsymbol \theta)]\\ &=\sum_{S}[b\cdot p(s)\nabla_{\boldsymbol \theta}1]\\ &=0 \end{aligned}
ES[EA∼π(⋅∣S;θ)[b⋅∇θlnπ(A∣S;θ)]]=EA,S[b⋅∇θlnπ(A∣S;θ)]=A,S∑b⋅∇θπ(a∣s;θ)π(a∣s;θ)p(a,s)=A,S∑b⋅∇θπ(a∣s;θ)⋅p(s)=S∑[b⋅p(s)A∑∇θπ(a∣s;θ)]=S∑[b⋅p(s)∇θA∑π(a∣s;θ)]=S∑[b⋅p(s)∇θ1]=0
所以策略梯度
∇
θ
J
(
θ
)
\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta})
∇θJ(θ) 可以近似为下面的随机梯度:
g
b
(
s
,
a
;
θ
)
=
[
Q
π
(
s
,
a
)
−
b
]
⋅
∇
θ
ln
π
(
a
∣
s
;
θ
)
\boldsymbol{g}_b(s, a ; \boldsymbol{\theta})=\left[Q_\pi(s, a)-b\right] \cdot \nabla_{\boldsymbol{\theta}} \ln \pi(a \mid s ; \boldsymbol{\theta})
gb(s,a;θ)=[Qπ(s,a)−b]⋅∇θlnπ(a∣s;θ)
无论
b
b
b取何值,
E
A
,
S
[
g
b
(
s
,
a
;
θ
)
]
\Bbb E_{A,S}[\boldsymbol g_{b}(s,a;\boldsymbol \theta)]
EA,S[gb(s,a;θ)]都是策略梯度的无篇估计,但是随着
b
b
b取值的变化,方差会出现变化。
V
a
r
=
E
A
,
S
[
(
g
b
(
S
,
A
;
θ
)
−
∇
θ
J
(
θ
)
)
2
]
=
E
A
,
S
[
g
b
(
S
,
A
;
θ
)
2
]
−
[
∇
θ
J
(
θ
)
]
2
=
E
A
,
S
[
(
Q
π
(
S
,
A
)
−
b
)
2
∇
θ
2
ln
π
(
A
∣
S
;
θ
)
]
−
[
∇
θ
J
(
θ
)
]
2
\begin{aligned} \Bbb{Var}&=\Bbb E_{A,S}[(\boldsymbol{g}_b(S, A ; \boldsymbol{\theta})-\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}))^2]\\ &=\Bbb E_{A,S}[\boldsymbol{g}_b(S, A ; \boldsymbol{\theta})^2]-[\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta})]^2\\ &=\Bbb E_{A,S}[(Q_{\pi}(S,A)-b)^2\nabla_{\boldsymbol{\theta}}^2\ln \pi(A\mid S;\boldsymbol{\theta})]-[\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta})]^2\\ \end{aligned}
Var=EA,S[(gb(S,A;θ)−∇θJ(θ))2]=EA,S[gb(S,A;θ)2]−[∇θJ(θ)]2=EA,S[(Qπ(S,A)−b)2∇θ2lnπ(A∣S;θ)]−[∇θJ(θ)]2
由于
∇
θ
J
(
θ
)
\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta})
∇θJ(θ)是与
b
b
b无关的常数,所以仅需极小化
E
A
,
S
[
(
Q
π
(
S
,
A
)
−
b
)
2
∇
θ
2
ln
π
(
A
∣
S
;
θ
)
]
\Bbb E_{A,S}[(Q_{\pi}(S,A)-b)^2\nabla_{\boldsymbol{\theta}}^2\ln \pi(A\mid S;\boldsymbol{\theta})]
EA,S[(Qπ(S,A)−b)2∇θ2lnπ(A∣S;θ)]
E
A
,
S
[
(
Q
π
(
S
,
A
)
−
b
)
2
∇
θ
2
ln
π
(
A
∣
S
;
θ
)
]
=
E
S
[
E
A
∼
π
(
A
∣
S
;
θ
)
[
(
Q
π
(
S
,
A
)
−
b
)
2
∇
θ
2
ln
π
(
A
∣
S
;
θ
)
]
]
=
E
S
[
E
A
∼
∇
θ
2
π
(
A
∣
S
;
θ
)
π
(
A
∣
S
;
θ
)
[
(
Q
π
(
S
,
A
)
−
b
)
2
]
]
\begin{aligned} \Bbb E_{A,S}[(Q_{\pi}(S,A)-b)^2\nabla_{\boldsymbol{\theta}}^2\ln \pi(A\mid S;\boldsymbol{\theta})]&=\Bbb E_{S}[\Bbb E_{A\sim \pi(A\mid S;\boldsymbol \theta)}[(Q_{\pi}(S,A)-b)^2\nabla_{\boldsymbol \theta}^2\ln\pi(A\mid S;\boldsymbol \theta)]]\\ &=\Bbb E_{S}[\Bbb E_{A\sim \frac{\nabla_{\boldsymbol \theta}^2\pi(A\mid S;\boldsymbol \theta)}{\pi(A\mid S;\boldsymbol \theta)}}[(Q_{\pi}(S,A)-b)^2]] \end{aligned}
EA,S[(Qπ(S,A)−b)2∇θ2lnπ(A∣S;θ)]=ES[EA∼π(A∣S;θ)[(Qπ(S,A)−b)2∇θ2lnπ(A∣S;θ)]]=ES[EA∼π(A∣S;θ)∇θ2π(A∣S;θ)[(Qπ(S,A)−b)2]]
所以要最小化方差,令
A
∼
∇
θ
2
π
(
A
∣
S
;
θ
)
π
(
A
∣
S
;
θ
)
A\sim \frac{\nabla_{\boldsymbol \theta}^2\pi(A\mid S;\boldsymbol \theta)}{\pi(A\mid S;\boldsymbol \theta)}
A∼π(A∣S;θ)∇θ2π(A∣S;θ)为N-K密度,则
b
=
E
A
∼
∇
θ
2
π
(
A
∣
S
;
θ
)
π
(
A
∣
S
;
θ
)
[
Q
π
(
S
,
A
)
]
/
E
A
∼
∇
θ
2
π
(
A
∣
S
;
θ
)
π
(
A
∣
S
;
θ
)
[
]
=
E
A
∼
π
θ
[
∇
θ
log
π
θ
(
A
∣
S
)
T
∇
θ
log
π
(
A
∣
S
)
Q
(
S
,
A
)
]
E
A
∼
π
θ
[
∇
θ
log
π
θ
(
A
∣
S
)
T
∇
θ
log
π
θ
(
A
∣
S
)
]
\begin{aligned} b&=\Bbb E_{A\sim \frac{\nabla_{\boldsymbol \theta}^2\pi(A\mid S;\boldsymbol \theta)}{\pi(A\mid S;\boldsymbol \theta)}}[Q_{\pi}(S,A)]/\Bbb E_{A \sim \frac{\nabla_{\boldsymbol \theta}^2\pi(A\mid S;\boldsymbol \theta)}{\pi(A\mid S;\boldsymbol \theta)}}[]\\ &=\frac{\mathbb{E}_{A \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(A \mid S)^T \nabla_\theta \log \pi(A \mid S) Q(S, A)\right]}{\mathbb{E}_{A \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(A \mid S)^T \nabla_\theta \log \pi_\theta(A \mid S)\right]} \end{aligned}
b=EA∼π(A∣S;θ)∇θ2π(A∣S;θ)[Qπ(S,A)]/EA∼π(A∣S;θ)∇θ2π(A∣S;θ)[]=EA∼πθ[∇θlogπθ(A∣S)T∇θlogπθ(A∣S)]EA∼πθ[∇θlogπθ(A∣S)T∇θlogπ(A∣S)Q(S,A)]
,我们使用 b = E A ∼ π ( A ∣ S ) [ Q π ( S , A ) ] = V π ( S ) b=\Bbb E_{A\sim \pi(A\mid S)}[Q_{\pi}(S,A)]=V_\pi(S) b=EA∼π(A∣S)[Qπ(S,A)]=Vπ(S)作为近似代替。
我们使用状态价值
V
π
(
s
)
V_\pi(s)
Vπ(s) 作基线,得到策略梯度的一个无偏估计:
g
(
s
,
a
;
θ
)
=
[
Q
π
(
s
,
a
)
−
V
π
(
s
)
]
⋅
∇
θ
ln
π
(
a
∣
s
;
θ
)
.
\boldsymbol{g}(s, a ; \boldsymbol{\theta})=\left[Q_\pi(s, a)-V_\pi(s)\right] \cdot \nabla_{\boldsymbol{\theta}} \ln \pi(a \mid s ; \boldsymbol{\theta}) .
g(s,a;θ)=[Qπ(s,a)−Vπ(s)]⋅∇θlnπ(a∣s;θ).
REINFORCE使用实际观测的回报
u
u
u 来代替动作价值
Q
π
(
s
,
a
)
Q_\pi(s, a)
Qπ(s,a) 。此处我们同样用
u
u
u 代替
Q
π
(
s
,
a
)
Q_\pi(s, a)
Qπ(s,a) 。此外, 我们还用一个神经网络
v
(
s
;
w
)
v(s ; \boldsymbol{w})
v(s;w) 近似状态价值函数
V
π
(
s
)
V_\pi(s)
Vπ(s) 。这样一来,
g
(
s
,
a
;
θ
)
\boldsymbol{g}(s, a ; \boldsymbol{\theta})
g(s,a;θ) 就被近似成了:
g
~
(
s
,
a
;
θ
)
=
[
u
−
v
(
s
;
w
)
]
⋅
∇
θ
ln
π
(
a
∣
s
;
θ
)
.
\tilde{\boldsymbol{g}}(s, a ; \boldsymbol{\theta})=[u-v(s ; \boldsymbol{w})] \cdot \nabla_{\boldsymbol{\theta}} \ln \pi(a \mid s ; \boldsymbol{\theta}) .
g~(s,a;θ)=[u−v(s;w)]⋅∇θlnπ(a∣s;θ).
可以用
g
~
(
s
,
a
;
θ
)
\tilde{\boldsymbol{g}}(s, a ; \boldsymbol{\theta})
g~(s,a;θ) 作为策略梯度
∇
θ
J
(
θ
)
\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta})
∇θJ(θ) 的近似, 更新策略网络参数:
θ
←
θ
+
β
⋅
g
~
(
s
,
a
;
θ
)
\boldsymbol{\theta} \leftarrow \boldsymbol{\theta}+\beta \cdot \tilde{\boldsymbol{g}}(s, a ; \boldsymbol{\theta})
θ←θ+β⋅g~(s,a;θ)
训练价值网络的方法是回归 (regression)。回忆一下, 状态价值是回报的期望:
V
π
(
s
t
)
=
E
[
U
t
∣
S
t
=
s
t
]
,
V_\pi\left(s_t\right)=\mathbb{E}\left[U_t \mid S_t=s_t\right],
Vπ(st)=E[Ut∣St=st],
期望消掉了动作 A t , A t + 1 , ⋯ , A n A_t, A_{t+1}, \cdots, A_n At,At+1,⋯,An 和状态 S t + 1 , ⋯ , S n S_{t+1}, \cdots, S_n St+1,⋯,Sn 训练价值网络的目的是让 v ( s t ; w ) v\left(s_t ; \boldsymbol{w}\right) v(st;w)拟合 V π ( s t ) V_\pi\left(s_t\right) Vπ(st), 即拟合 u t u_t ut 的期望。定义
损失失函数:
L
(
w
)
=
1
2
n
∑
t
=
1
n
[
v
(
s
t
;
w
)
−
u
t
]
2
.
L(\boldsymbol{w})=\frac{1}{2 n} \sum_{t=1}^n\left[v\left(s_t ; \boldsymbol{w}\right)-u_t\right]^2 .
L(w)=2n1t=1∑n[v(st;w)−ut]2.
设
v
^
t
=
v
(
s
t
;
w
)
\widehat{v}_t=v\left(s_t ; \boldsymbol{w}\right)
v
t=v(st;w) 。损失函数的梯度是:
∇
w
L
(
w
)
=
1
n
∑
t
=
1
n
(
v
^
t
−
u
t
)
⋅
∇
w
v
(
s
t
;
w
)
.
\nabla_{\boldsymbol{w}} L(\boldsymbol{w})=\frac{1}{n} \sum_{t=1}^n\left(\widehat{v}_t-u_t\right) \cdot \nabla_{\boldsymbol{w}} v\left(s_t ; \boldsymbol{w}\right) .
∇wL(w)=n1t=1∑n(v
t−ut)⋅∇wv(st;w).
做一次梯度下降更新
w
\boldsymbol{w}
w :
w
←
w
−
α
⋅
∇
w
L
(
w
)
.
\boldsymbol{w} \leftarrow \boldsymbol{w}-\alpha \cdot \nabla_{\boldsymbol{w}} L(\boldsymbol{w}) .
w←w−α⋅∇wL(w).
接下来的训练过程与
r
e
i
n
f
o
r
c
e
reinforce
reinforce一样。
二.Advantage Actor-Critic (A2C)
训练价值网络:reinforce使用蒙特卡洛方法直接求出了所有
u
t
u_t
ut,从而可以直接训练
v
π
(
s
)
v_{\pi}(s)
vπ(s)而在
a
c
t
o
r
−
c
r
i
t
i
c
actor-critic
actor−critic中并未使用蒙特卡洛方法,我们依据贝尔曼方程进行自举训练。
V
π
(
s
t
)
=
E
A
t
,
S
t
+
1
[
R
t
+
γ
⋅
V
π
(
S
t
+
1
)
∣
S
t
=
s
t
]
=
E
A
t
[
E
S
t
+
1
[
R
t
+
γ
⋅
V
π
(
S
t
+
1
)
∣
S
t
=
s
t
,
A
t
]
∣
S
t
=
s
t
]
\begin{aligned} V_\pi\left(s_t\right)&=\mathbb{E}_{A_t, S_{t+1}}\left[R_t+\gamma \cdot V_\pi\left(S_{t+1}\right) \mid S_t=s_t\right]\\ &= \Bbb E_{A_t}[\Bbb E_{S_{t+1}}[R_{t}+\gamma \cdot V_{\pi}(S_{t+1})\mid S_t=s_t,A_t] \mid S_t=s_t] \end{aligned}
Vπ(st)=EAt,St+1[Rt+γ⋅Vπ(St+1)∣St=st]=EAt[ESt+1[Rt+γ⋅Vπ(St+1)∣St=st,At]∣St=st]
从初始状态
s
t
s_t
st出发,依据策略
π
(
A
∣
S
)
\pi(A\mid S)
π(A∣S)选取动作
a
t
a_t
at,再依据状态转移概率
p
(
S
t
+
1
∣
A
t
,
S
t
)
p(S_{t+1}\mid A_t,S_t)
p(St+1∣At,St),选中下一刻状态
s
t
+
1
s_{t+1}
st+1,得出
r
t
r_{t}
rt.
则 y t = r t + v π ( s t + 1 ; w ) y_t=r_t+v_{\pi}(s_{t+1};\boldsymbol{w}) yt=rt+vπ(st+1;w)
具体这样更新价值网络参数
w
\boldsymbol{w}
w 。定义损失函数
L
(
w
)
≜
1
2
[
v
(
s
t
;
w
)
−
y
t
^
]
2
.
L(\boldsymbol{w}) \triangleq \frac{1}{2}\left[v\left(s_t ; \boldsymbol{w}\right)-\widehat{y_t}\right]^2 .
L(w)≜21[v(st;w)−yt
]2.
设
v
^
t
≜
v
(
s
t
;
w
)
\widehat{v}_t \triangleq v\left(s_t ; \boldsymbol{w}\right)
v
t≜v(st;w) 。损失函数的梯度是:
∇
w
L
(
w
)
=
(
v
^
t
−
y
^
t
)
⏟
TD 误差
δ
t
⋅
∇
w
v
(
s
t
;
w
)
.
\nabla_{\boldsymbol{w}} L(\boldsymbol{w})=\underbrace{\left(\widehat{v}_t-\widehat{y}_t\right)}_{\text {TD 误差 } \delta_t} \cdot \nabla_{\boldsymbol{w}} v\left(s_t ; \boldsymbol{w}\right) .
∇wL(w)=TD 误差 δt
(v
t−y
t)⋅∇wv(st;w).
定义 TD 误差为
δ
t
≜
v
^
t
−
y
^
t
\delta_t \triangleq \widehat{v}_t-\widehat{y}_t
δt≜v
t−y
t 。做一轮梯度下降更新
w
:
\boldsymbol{w}:
w:
w
←
w
−
α
⋅
δ
t
⋅
∇
w
v
(
s
t
;
w
)
.
\boldsymbol{w} \leftarrow \boldsymbol{w}-\alpha \cdot \delta_t \cdot \nabla_{\boldsymbol{w}} v\left(s_t ; \boldsymbol{w}\right) .
w←w−α⋅δt⋅∇wv(st;w).
训练策略网络:贝尔曼公式:
Q
π
(
s
t
,
a
t
)
=
E
S
t
+
1
∼
p
(
⋅
∣
s
t
,
a
t
)
[
R
t
+
γ
⋅
V
π
(
S
t
+
1
)
]
.
Q_\pi\left(s_t, a_t\right)=\mathbb{E}_{S_{t+1} \sim p\left(\cdot \mid s_t, a_t\right)}\left[R_t+\gamma \cdot V_\pi\left(S_{t+1}\right)\right] .
Qπ(st,at)=ESt+1∼p(⋅∣st,at)[Rt+γ⋅Vπ(St+1)].
把近似策略梯度
g
(
s
t
,
a
t
;
θ
)
\boldsymbol{g}\left(s_t, a_t ; \boldsymbol{\theta}\right)
g(st,at;θ) 中的
Q
π
(
s
t
,
a
t
)
Q_\pi\left(s_t, a_t\right)
Qπ(st,at) 替换成上面的期望, 得到:
g
(
s
t
,
a
t
;
θ
)
=
[
Q
π
(
s
t
,
a
t
)
−
V
π
(
s
t
)
]
⋅
∇
θ
ln
π
(
a
t
∣
s
t
;
θ
)
=
[
E
S
t
+
1
[
R
t
+
γ
⋅
V
π
(
S
t
+
1
)
]
−
V
π
(
s
t
)
]
⋅
∇
θ
ln
π
(
a
t
∣
s
t
;
θ
)
.
\begin{aligned} \boldsymbol{g}\left(s_t, a_t ; \boldsymbol{\theta}\right) & =\left[Q_\pi\left(s_t, a_t\right)-V_\pi\left(s_t\right)\right] \cdot \nabla_{\boldsymbol{\theta}} \ln \pi\left(a_t \mid s_t ; \boldsymbol{\theta}\right) \\ & =\left[\mathbb{E}_{S_{t+1}}\left[R_t+\gamma \cdot V_\pi\left(S_{t+1}\right)\right]-V_\pi\left(s_t\right)\right] \cdot \nabla_{\boldsymbol{\theta}} \ln \pi\left(a_t \mid s_t ; \boldsymbol{\theta}\right) . \end{aligned}
g(st,at;θ)=[Qπ(st,at)−Vπ(st)]⋅∇θlnπ(at∣st;θ)=[ESt+1[Rt+γ⋅Vπ(St+1)]−Vπ(st)]⋅∇θlnπ(at∣st;θ).
当智能体执行动作
a
t
a_t
at 之后, 环境给出新的状态
s
t
+
1
s_{t+1}
st+1 和奖励
r
t
r_t
rt; 利用
s
t
+
1
s_{t+1}
st+1 和
r
t
r_t
rt 对上面的期望做蒙特卡洛近似, 得到:
g
(
s
t
,
a
t
;
θ
)
≈
[
r
t
+
γ
⋅
V
π
(
s
t
+
1
)
−
V
π
(
s
t
)
]
⋅
∇
θ
ln
π
(
a
t
∣
s
t
;
θ
)
.
\boldsymbol{g}\left(s_t, a_t ; \boldsymbol{\theta}\right) \approx\left[r_t+\gamma \cdot V_\pi\left(s_{t+1}\right)-V_\pi\left(s_t\right)\right] \cdot \nabla_{\boldsymbol{\theta}} \ln \pi\left(a_t \mid s_t ; \boldsymbol{\theta}\right) .
g(st,at;θ)≈[rt+γ⋅Vπ(st+1)−Vπ(st)]⋅∇θlnπ(at∣st;θ).
进一步把状态价值函数
V
π
(
s
)
V_\pi(s)
Vπ(s) 替换成价值网络
v
(
s
;
w
)
v(s ; \boldsymbol{w})
v(s;w), 得到:
g
~
(
s
t
,
a
t
;
θ
)
≜
[
r
t
+
γ
⋅
v
(
s
t
+
1
;
w
)
⏟
T
D
目标
y
^
t
−
v
(
s
t
;
w
)
]
⋅
∇
θ
ln
π
(
a
t
∣
s
t
;
θ
)
\tilde{\boldsymbol{g}}\left(s_t, a_t ; \boldsymbol{\theta}\right) \triangleq[\underbrace{r_t+\gamma \cdot v\left(s_{t+1} ; \boldsymbol{w}\right)}_{\mathrm{TD} \text { 目标 } \hat{y}_t}-v\left(s_t ; \boldsymbol{w}\right)] \cdot \nabla_{\boldsymbol{\theta}} \ln \pi\left(a_t \mid s_t ; \boldsymbol{\theta}\right)
g~(st,at;θ)≜[TD 目标 y^t
rt+γ⋅v(st+1;w)−v(st;w)]⋅∇θlnπ(at∣st;θ)
前面定义了 TD 目标和 TD 误差:
y
^
t
≜
r
t
+
γ
⋅
v
(
s
t
+
1
;
w
)
和
δ
t
≜
v
(
s
t
;
w
)
−
y
^
t
.
\widehat{y}_t \triangleq r_t+\gamma \cdot v\left(s_{t+1} ; \boldsymbol{w}\right) \quad \text { 和 } \quad \delta_t \triangleq v\left(s_t ; \boldsymbol{w}\right)-\widehat{y}_t .
y
t≜rt+γ⋅v(st+1;w) 和 δt≜v(st;w)−y
t.
因此, 可以把
g
~
\tilde{\boldsymbol{g}}
g~ 写成:
g
~
(
s
t
,
a
t
;
θ
)
≜
−
δ
t
⋅
∇
θ
ln
π
(
a
t
∣
s
t
;
θ
)
.
\tilde{\boldsymbol{g}}\left(s_t, a_t ; \boldsymbol{\theta}\right) \triangleq-\delta_t \cdot \nabla_{\boldsymbol{\theta}} \ln \pi\left(a_t \mid s_t ; \boldsymbol{\theta}\right) .
g~(st,at;θ)≜−δt⋅∇θlnπ(at∣st;θ).
g
~
\tilde{\boldsymbol{g}}
g~ 是
g
\boldsymbol{g}
g 的近似,所以也是策略梯度
∇
θ
J
(
θ
)
\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta})
∇θJ(θ) 的近似。用
g
~
\tilde{\boldsymbol{g}}
g~ 更新策略网络参数
θ
\boldsymbol{\theta}
θ :
θ
←
θ
+
β
⋅
g
~
(
s
t
,
a
t
;
θ
)
.
\boldsymbol{\theta} \leftarrow \boldsymbol{\theta}+\beta \cdot \tilde{\boldsymbol{g}}\left(s_t, a_t ; \boldsymbol{\theta}\right) .
θ←θ+β⋅g~(st,at;θ).
训练流程。设当前策略网络参数是
θ
now
\boldsymbol{\theta}_{\text {now }}
θnow , 价值网络参数是
w
now
\boldsymbol{w}_{\text {now }}
wnow 。执行下面的步骤, 将参数更新成
θ
new
\theta_{\text {new }}
θnew 和
w
new
\boldsymbol{w}_{\text {new }}
wnew :
- 观测到当前状态 s t s_t st, 根据策略网络做决策: a t ∼ π ( ⋅ ∣ s t ; θ now ) a_t \sim \pi\left(\cdot \mid s_t ; \boldsymbol{\theta}_{\text {now }}\right) at∼π(⋅∣st;θnow ), 并让智能体执行动作 a t a_t at 。
- 从环境中观测到奖励 r t r_t rt 和新的状态 s t + 1 s_{t+1} st+1 。
- 让价值网络打分:
v ^ t = v ( s t ; w now ) 和 v ^ t + 1 = v ( s t + 1 ; w now ) \widehat{v}_t=v\left(s_t ; \boldsymbol{w}_{\text {now }}\right) \quad \text { 和 } \quad \widehat{v}_{t+1}=v\left(s_{t+1} ; \boldsymbol{w}_{\text {now }}\right) v t=v(st;wnow ) 和 v t+1=v(st+1;wnow ) - 计算 TD 目标和 TD 误差:
y ^ t = r t + γ ⋅ v ^ t + 1 和 δ t = v ^ t − y ^ t . \widehat{y}_t=r_t+\gamma \cdot \widehat{v}_{t+1} \quad \text { 和 } \quad \delta_t=\widehat{v}_t-\widehat{y}_t . y t=rt+γ⋅v t+1 和 δt=v t−y t. - 更新价值网络:
w new ← w now − α ⋅ δ t ⋅ ∇ w v ( s t ; w now ) . \boldsymbol{w}_{\text {new }} \leftarrow \boldsymbol{w}_{\text {now }}-\alpha \cdot \delta_t \cdot \nabla_{\boldsymbol{w}} v\left(s_t ; \boldsymbol{w}_{\text {now }}\right) . wnew ←wnow −α⋅δt⋅∇wv(st;wnow ). - 更新策略网络:
θ new ← θ now − β ⋅ δ t ⋅ ∇ θ ln π ( a t ∣ s t ; θ now ) . \boldsymbol{\theta}_{\text {new }} \leftarrow \boldsymbol{\theta}_{\text {now }}-\beta \cdot \delta_t \cdot \nabla_{\boldsymbol{\theta}} \ln \pi\left(a_t \mid s_t ; \boldsymbol{\theta}_{\text {now }}\right) . θnew ←θnow −β⋅δt⋅∇θlnπ(at∣st;θnow ).