【论文_2000】REINFORCE 和 actor-critic 等策略梯度方法的局部收敛性证明

部分证明不太理解

SUTTON R S, MCALLESTER D A, SINGH S P, et al. Policy gradient methods for reinforcement learning with function approximation [C] // Advances in neural information processing systems, 2000: 1057-1063. 【PDF 链接】

在这里插入图片描述

文章目录

摘要
引言
1 策略梯度定理
2 策略梯度近似
3 推导算法和优势的应用
4 函数近似的策略梯度的收敛性
致谢
参考文献
附录：定理 1 的证明

摘要

Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and determining a policy from it has so far proven theoretically intractable.
函数近似对强化学习至关重要，但近似一个价值函数并从中确定策略的标准方法迄今为止在理论上被证明是难以解决的。
In this paper we explore an alternative approach in which the policy is explicitly represented by its own function approximator, independent of the value function, and is updated according to the gradient of expected reward with respect to the policy parameters.
在本文中，我们探索了一种替代方法，其中策略由其自己的函数近似器显式表示，独立于价值函数，并根据期望奖励相对于策略参数的梯度进行更新。
Williams’s REINFORCE method and actor-critic methods are examples of this approach.
Williams 的 REINFORCE 方法和 actor-critic 方法都是这种方法的例子。
Our main new result is to show that the gradient can be written in a form suitable for estimation from experience aided by an approximate action-value or advantage function.
我们的主要新结果是表明梯度可以写成一种适合于由近似动作-价值或优势函数 辅助的经验估计的形式。
Using this result, we prove for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy.
利用这一结果，我们首次证明了具有任意可微函数近似的策略迭代收敛于局部最优策略。

引言

Large applications of reinforcement learning (RL) require the use of generalizing function approximators such neural networks, decision-trees, or instance-based methods.
强化学习 (RL) 的大规模应用需要使用泛化函数近似器，如神经网络、决策树或基于实例的方法。
The dominant approach for the last decade has been the value-function approach, in which all function approximation effort goes into estimating a value function, with the action-selection policy represented implicitly as the “greedy” policy with respect to the estimated values (e.g., as the policy that selects in each state the action with highest estimated value).
在过去十年中，占主导地位的方法是价值函数方法，其中所有的函数近似努力都用于估计价值函数，动作选择策略隐含地表示为相对于估计的价值的“贪婪”策略 (例如，作为在每个状态中选择具有最高价值估计的动作的策略)。
The value-function approach has worked well in many applications, but has several limitations.
价值函数方法在许多应用程序中工作得很好，但有一些限制。
First, it is oriented toward finding deterministic policies, whereas the optimal policy is often stochastic, selecting different actions with specific probabilities (e.g., see Singh, Jaakkola, and Jordan, 1994).
首先，它倾向于寻找确定性策略，而最优策略通常是随机的，以特定概率选择不同动作(例如，参见 Singh, Jaakkola, and Jordan, 1994)。
Second, an arbitrarily small change in the estimated value of an action can cause it to be, or not be, selected.
其次，一个动作的估计值的任意微小变化都可能导致它被选中或不被选中。
Such discontinuous changes have been identified as a key obstacle to establishing convergence assurances for algorithms following the value-function approach (Bertsekas and Tsitsiklis, 1996).
这种不连续变化被认为是为遵循价值函数方法的算法建立收敛保证的关键障碍 (Bertsekas和Tsitsiklis, 1996)。
For example, Q-learning, Sarsa, and dynamic programming methods have all been shown unable to converge to any policy for simple MDPs and simple function approximators (Gordon, 1995, 1996; Baird, 1995; Tsitsiklis and van Roy, 1996; Bertsekas and Tsitsiklis, 1996).
例如，Q-learning、Sarsa 和动态规划方法都被证明不能收敛于简单 MDPs 和简单函数近似器的任何策略(Gordon, 1995,1996;贝尔德,1995;Tsitsiklis和van Roy, 1996;Bertsekas and Tsitsiklis, 1996)。
This can occur even if the best approximation is found at each step before changing the policy, and whether the notion of “best” is in the mean-squared-error sense or the slightly different senses of residual-gradient, temporal-difference, and dynamic-programming methods.
即使在改变策略之前的每一步都找到了最佳近似值，无论“最佳”的概念是在均方误差意义上还是在残差梯度、时序差分和动态规划方法的稍微不同的意义上，也可能发生这种情况。

In this paper we explore an alternative approach to function approximation in RL.
在本文中我们探讨了强化学习中函数近似的另一种方法。
Rather than approximating a value function and using that to compute a deterministic policy, we approximate a stochastic policy directly using an independent function approximator with its own parameters.
我们不是近似一个价值函数并使用它来计算确定性策略，而是直接使用具有自己参数的独立函数近似器 来 近似随机策略。
For example, the policy might be represented by a neural network whose input is a representation of the state, whose output is action selection probabilities, and whose weights are the policy parameters.
例如，策略可能由神经网络表示，其输入是状态的表示，其输出是动作选择概率，其权重是策略参数。
Let $\theta$ denote the vector of policy parameters and $\rho$ the performance of the corresponding policy (e.g., the average reward per step).
设 $\theta$ 表示策略参数的向量， $\rho$ 表示相应策略的性能(例如，每一步的平均奖励)。
Then, in the policy gradient approach, the policy parameters are updated approximately proportional to the gradient:
然后，在策略梯度方法中，策略参数的更新 与梯度近似成正比:

$\Delta \theta\approx \alpha \frac{\partial\rho}{\partial \theta}~~~~~~~~~~(1)$

其中 $\alpha$ 是一个正定的步长。
如果上述条件能够实现，那么通常可以保证 $\theta$ 在性能度量 $\rho$ 中收敛到局部最优策略。
与价值函数方法不同，这里 $\theta$ 的微小变化只会导致策略和状态访问分布的微小变化。

↓ 【证明了一个结论 1 + 得到类似结论的工作+区别】

In this paper we prove that an unbiased estimate of the gradient (1) can be obtained from experience using an approximate value function satisfying certain properties.
本文证明了用满足一定性质的近似价值函数，可以从经验中得到梯度 (1) 的无偏估计。
Williams’s (1988, 1992) REINFORCE algorithm also finds an unbiased estimate of the gradient, but without the assistance of a learned value function.
Williams (1988,1992) 的 REINFORCE 算法也找到了梯度的无偏估计，但没有习得的价值函数的帮助。
REINFORCE learns much more slowly than RL methods using value functions and has received relatively little attention.
REINFORCE 比使用价值函数的强化学习方法 学得慢，并且受到的关注相对较少。
Learning a value function and using it to reduce the variance of the gradient estimate appears to be essential for rapid learning.
学习一个价值函数并用它来减小梯度估计的方差对于快速学习是必不可少的。
Jaakkola, Singh and Jordan (1995) proved a result very similar to ours for the special case of function approximation corresponding to tabular POMDPs. 【partially observable Markov decision problems】
Jaakkola, Singh 和 Jordan(1995) 对于表格形式的 POMDPs 对应的函数近似的特殊情况证明了与我们非常相似的结果。
Our result strengthens theirs and generalizes it to arbitrary differentiable function approximators.
我们的结果加强了他们的结论，并将其推广到任意可微函数近似器。
Konda and Tsitsiklis (in prep.) independently developed a very simialr result to ours.
See also Baxter and Bartlett (in prep.) and Marbach and Tsitsiklis (1998).
Konda 和 Tsitsiklis(准备中) 独立开发了与我们非常相似的结果。
参见 Baxter and Bartlett (in prep.) 和 Marbach and Tsitsiklis(1998)。

↓ 【证明了一个结论 2 + 得到类似结论的工作+区别】

Our result also suggests a way of proving the convergence of a wide variety of algorithms based on “actor-critic” or policy-iteration architectures (e.g., Barto, Sutton, and Anderson, 1983; Sutton, 1984; Kimura and Kobayashi, 1998).
我们的结果还提出了一种方法来证明基于 “actor-critic” 或策略迭代架构的各种算法的收敛性 。
In this paper we take the first step in this direction by proving for the first time that a version of policy iteration with general differentiable function approximation is convergent to a locally optimal policy. 【首次】
在本文中，我们在这个方向上迈出了第一步，首次证明了具有一般可微函数近似的策略迭代收敛于局部最优策略。
Baird and Moore (1999) obtained a weaker but superficially similar result for their VAPS family of methods. 【Value and Policy Search (VAPS) 】【与最近的同期类似方法比较：区别，对方的不足】
Baird 和 Moore(1999) 在他们的 VAPS 系列方法中得到了一个较弱但表面上相似的结果。
Like policy-gradient methods, VAPS includes separately parameterized policy and value functions updated by gradient methods.
与策略梯度方法一样，VAPS 包括分别参数化的，由梯度方法更新的策略和价值函数。
However, VAPS methods do not climb the gradient of performance (expected long-term reward), but of a measure combining performance and value-function accuracy.
然而，VAPS 方法不是沿着性能(长期奖励期望)的梯度往上爬，而是结合性能和价值函数准确性的测量。
As a result, VAPS does not converge to a locally optimal policy, except in the case that no weight is put upon value-function accuracy, in which case VAPS degenerates to REINFORCE.
因此，VAPS 不会收敛到局部最优策略，除非在不重视价值函数准确性的情况下，在这种情况下，VAPS 退化为 REINFORCE。
Similarly, Gordon’s (1995) fitted value iteration is also convergent and value-based, but does not find a locally optimal policy.
同样，Gordon(1995) 的拟合价值迭代也是收敛的，基于价值，但没有找到局部最优策略。

1 策略梯度定理

我们考虑标准强化学习框架(参见，例如，Sutton 和 Barto, 1998)，其中学习代理与马尔可夫决策过程 (MDP) 交互。
每个时间 $t\in\{0,1,2,\cdots\}$ 的状态、动作和奖励分别表示 $s_t \in {\cal S}$ 、 $a_t \in {\cal A}$ 和 $r_t \in {\frak R}$ 。
环境的动态表征为状态转移概率 ${\cal P}_{ss^\prime}^a=Pr\{s_{t+1} =s ' | s_t=s,a_t=a\}$ ，奖励期望为 ${\cal R}_s^a=E\{r_{t+1} | s_t=s,a_t=a\}，\forall ~s,s^\prime\in {\cal S}, a\in {\cal A}$ 。
代理在每个时间的决策过程表征为策略 $=Pr\{a_t= a|s_t =s, θ\}， \forall ~s \in {\cal S},a \in {\cal A}$ ，其中 $\theta\in{\frak R}^l$ ，对于 $|\cal S|$ ， $θ$ 是参数向量。
我们假定 $π$ 对它的参数是可微的，即 $\frac{\partial \pi(s,a)}{\partial \theta}$ 存在。
我们通常把 $π (s, a, θ)$ 写成 $π (s, a)$ 。

${\frak R}$ $~~~~~{\frak R}$

利用函数近似，有两种方法可以有效地描述代理的目标。
一种是平均奖励公式，其中根据每一步的长期奖励预期 $ρ (π)$ 对策略进行排名:

$ρ(π)=\lim\limits_{n\to\infty}\frac{1}{n}E\{r_1+r_2+\cdots+r_n|\pi\}=\sum\limits_sd^\pi(s)\sum\limits_a\pi(s,a){\cal R}_s^a$

其中 $d^{\pi}(s) = \lim_{t→\infty}Pr\{s_t= s|s_0,\pi\}$ 为 $\pi$ 下状态的平稳分布，我们假设对所有策略都存在且独立于 $s_0$ 。
在平均奖励公式中，给定策略的状态-动作对的价值定义为

$Q^\pi(s,a)=\sum\limits_{t=1}^\infty E\{r_t-\rho(\pi)|s_0=s,a_0=a,\pi\}$

我们讨论的第二个公式是有一个指定的开始状态 $s_0$ ，我们只关心从中获得的长期奖励。
我们只给出我们的结果一次，但它们在定义下也适用于这个公式

$ρ(π)=E\Big\{\sum\limits_{t=1}^{\infty}\gamma^{t-1}r_t\Big|s_0,\pi\Big\}$

$Q^\pi(s,a)=E\Big\{\sum\limits_{k=1}^{\infty}\gamma^{k-1}r_{t+k}\Big|s_t=s,a_t=a,\pi\Big\}$

其中， $\gamma \in [0,1]$ 是折扣率( $\gamma=1$ 只允许出现在回合式任务中)。
在这个公式中，我们将 $d^\pi(s)$ 定义为从 $s_0$ 开始遇到的状态的折扣加权，然后遵循 $d^\pi(s)=\sum_{t=0}^\infty \gamma ^t Pr \{s_t=s|s_0,\pi\}$ 。

Our first result concerns the gradient of the performance metric with respect to the policy parameter:
我们的第一个结果涉及性能指标相对于策略参数的梯度:

Theorem 1 (Policy Gradient). 对于任何 MDP，无论是在平均奖励还是启动状态公式中，

$\frac{\partial \rho}{\partial \theta}=\sum\limits_sd^\pi(s)\sum\limits_a\frac{\partial \pi(s,a)}{\partial \theta}Q^\pi(s,a)~~~~~~~~~~(2)$

附录的定理 1 证明：

目标：证明 $\frac{\partial \rho}{\partial \theta}=\sum\limits_sd^\pi(s)\sum\limits_a\frac{\partial \pi(s,a)}{\partial \theta}Q^\pi(s,a)$

我们首先证明平均奖励公式，然后证明起始状态公式。

类似于抛硬币最终正面朝上和朝下这两种状态的概率分别为 0.5。

$\begin{aligned}\frac{\partial V^\pi(s)}{\partial \theta}&\xlongequal{def}\frac{\partial}{\partial \theta}\sum\limits_a\pi(s,a)Q^\pi(s,a)~~~~~\forall ~s\in {\cal S}\\ &=\sum\limits_a\Bigg[\frac{\partial\pi(s,a)}{\partial \theta}Q^\pi(s,a)+\pi(s,a)\frac{\partial}{\partial \theta}Q^\pi(s,a)\Bigg]\\ &=\sum\limits_a\Bigg[\frac{\partial\pi(s,a)}{\partial \theta}Q^\pi(s,a)+\pi(s,a)\frac{\partial}{\partial \theta}\Bigg[\textcolor{blue}{{\cal R}_s^a-\rho(\pi)+\sum\limits_{s^\prime}{\cal P}_{ss^\prime}^a V^\pi(s^\prime)}\Bigg]\Bigg]\\ &=\sum\limits_a\Bigg[\frac{\partial\pi(s,a)}{\partial \theta}Q^\pi(s,a)+\pi(s,a)\Bigg[\textcolor{blue}{-\frac{\partial \rho}{\partial \theta}}+\sum\limits_{s^\prime}{\cal P}_{ss^\prime}^a \frac{\partial V^\pi(s^\prime)}{\partial \theta}\Bigg]\Bigg]\end{aligned}$

$\frac{\partial \rho}{\partial \theta}$ 与 $a$ 无关， $\sum\limits_a\pi(s,a)=1$ ，则

$\frac{\partial \rho}{\partial \theta}=\sum\limits_a\Bigg[\frac{\partial\pi(s,a)}{\partial \theta}Q^\pi(s,a)+\pi(s,a)\sum\limits_{s^\prime}{\cal P}_{ss^\prime}^a \frac{\partial V^\pi(s^\prime)}{\partial \theta}\Bigg]-\frac{\partial V^\pi(s)}{\partial \theta}$

两边对平稳分布 $d^\pi$ 求和，

$\sum\limits_sd^\pi(s)\frac{\partial \rho}{\partial \theta}=\sum\limits_sd^\pi(s)\sum\limits_a\frac{\partial\pi(s,a)}{\partial \theta}Q^\pi(s,a)+\sum\limits_sd^\pi(s)\sum\limits_a\pi(s,a)\sum\limits_{s^\prime}{\cal P}_{ss^\prime}^a \frac{\partial V^\pi(s^\prime)}{\partial \theta}-\sum\limits_sd^\pi(s)\frac{\partial V^\pi(s)}{\partial \theta}$

由于 $d^\pi$ 是平稳的

$\underbrace{\sum\limits_sd^\pi(s)}_{\textcolor{blue}{1}}\frac{\partial \rho}{\partial \theta}=\sum\limits_sd^\pi(s)\sum\limits_a\frac{\partial\pi(s,a)}{\partial \theta}Q^\pi(s,a)+\underbrace{\sum\limits_sd^\pi(s)\sum\limits_a\pi(s,a)\sum\limits_{s^\prime}{\cal P}_{ss^\prime}^a}_{\textcolor{blue}{\sum\limits_{s^\prime}d^\pi(s^\prime)}} \frac{\partial V^\pi(s^\prime)}{\partial \theta}-\sum\limits_sd^\pi(s)\frac{\partial V^\pi(s)}{\partial \theta}$

$\frac{\partial \rho}{\partial \theta}=\sum\limits_sd^\pi(s)\sum\limits_a\frac{\partial\pi(s,a)}{\partial \theta}Q^\pi(s,a)$

——————————————————
对于起始状态公式

$\begin{aligned}\frac{\partial V^\pi(s)}{\partial \theta}&\xlongequal{def}\frac{\partial}{\partial \theta}\sum\limits_a\pi(s,a)Q^\pi(s,a)~~~~~\forall ~s\in {\cal S}\\ &=\sum\limits_a\Bigg[\frac{\partial\pi(s,a)}{\partial \theta}Q^\pi(s,a)+\pi(s,a)\frac{\partial}{\partial \theta}Q^\pi(s,a)\Bigg]\\ &=\sum\limits_a\Bigg[\frac{\partial\pi(s,a)}{\partial \theta}Q^\pi(s,a)+\pi(s,a)\frac{\partial}{\partial \theta}\Bigg[{\cal R}_s^a+\sum\limits_{s^\prime}\textcolor{blue}{\gamma}{\cal P}_{ss^\prime}^a V^\pi(s^\prime)\Bigg]\Bigg]\\ &=\sum\limits_a\Bigg[\frac{\partial\pi(s,a)}{\partial \theta}Q^\pi(s,a)+\pi(s,a)\sum\limits_{s^\prime}\gamma{\cal P}_{ss^\prime}^a \frac{\partial V^\pi(s^\prime)}{\partial \theta}\Bigg]~~~~~~~~~~~~(7)\\ &= \sum\limits_x\sum\limits_{k=0}^\infty\gamma^k Pr(s\to x,k,\pi)\sum\limits_a \frac{\partial\pi(x,a)}{\partial \theta}Q^\pi(x,a)~~~~~~~\textcolor{blue}{????}\end{aligned}$

对 (7) 展开几步，其中 $k,\pi)$ 是策略 $π$ 下 $k$ 步内从状态 $s$ 到状态 $x$ 的概率。
直接得到

$\begin{aligned}\frac{\partial \textcolor{blue}{\rho}}{\partial \theta}&=\frac{\partial}{\partial \theta}E\Big\{\sum\limits_{t=1}^\infty\gamma^{t-1}r_t\Big|s_0,\pi\Big\}=\frac{\partial}{\partial \theta}\textcolor{blue}{V^\pi(s_0)}\\ &=\sum\limits_s\sum\limits_{k=0}^\infty\gamma^k Pr(s_0\to s,k,\pi)\sum\limits_a \frac{\partial\pi(s,a)}{\partial \theta}Q^\pi(s,a)\\ &=\sum\limits_sd^\pi(s)\sum\limits_a \frac{\partial\pi(s,a)}{\partial \theta}Q^\pi(s,a)\end{aligned}$
证毕。

Marbach 和 Tsitsiklis(1998) 基于 Jaakkola、Singh 和Jordan(1995) 以及 Cao 和 Chen(1997) 提出的 状态-价值函数的相关表达式，在平均奖励公式中首次讨论了这种表达梯度的方式。
我们将他们的结果推广到起始状态公式，并提供了更简单、更直接的证明。
Williams(1988,1992) 的 REINFORCE 算法理论也可以被视为暗示 (2)。
无论如何，梯度的两个表达式的关键方面是它们都不是 $\frac{\partial d^\pi(s)}{\partial \theta}$ 形式的项：策略变化对状态分布的影响没有出现。
这便于通过抽样来近似梯度。
例如，如果 $s$ 是从遵循 $π$ 得到的分布中抽样，那么 $\sum_a\frac{\partial \pi(s,a)}{\partial \theta}Q^\pi(s,a)$ 将是 $\frac{\partial \rho}{\partial \theta}$ 的无偏估计。
当然， $Q^\pi(s, a)$ 通常也是未知的，必须进行估计。
一种方法是使用实际回报 $R_t=\sum\limits_{k=1}^\infty r_{t+k}-\rho(π)$ (或 $R_t=\sum\limits_{k=1}^\infty\gamma^{k-1}r_{t+k}$ ，初始状态公式)作为每个 $Q^\pi(s_t, a_t)$ 的近似值。
这得到 Williams 的回合式 REINFORCE 算法 $Δθ_t\propto \frac{\partial \pi(s_t,a_t)}{\partial \theta}R_t\frac{1}{\pi(s_t,a_t)}$ ( $\frac{1}{\pi(s_t,a_t)}$ 纠正了 $\pi$ 偏好的动作的过采样)，已知其期望值遵循 $\frac{\partial \rho}{\partial \theta}$ (Williams, 1988, 1992)。

2 策略梯度近似

现在考虑用一个习得的函数近似器来近似 $Q^\pi$ 的情况。
如果近似足够好，我们可能希望用它来代替 (2) 中的 $Q^\pi$ 且仍然大致指向梯度的方向。
For example, Jaakkola, Singh, and Jordan (1995) proved that for the special case of function approximation arising in a tabular POMDP one could assure positive inner product with the gradient, which is sufficient to ensure improvement for moving in that direction.
例如，Jaakkola, Singh, and Jordan(1995) 证明，对于表格形式 POMDP 【部分可观察 MDP】中产生的函数近似的特殊情况，可以保证与梯度的正内积，这足以保证在该方向上移动的改进。
Here we extend their result to general function approximation and prove equality with the gradient.
本文将其结果推广到一般函数近似，并证明与梯度相等。

设 $f_w:{\cal S} \times {\cal A}→{\frak R}$ 是我们对 $Q^\pi$ 的近似值，参数为 $w$ 。
很自然地，我们可以这样学习 $f_w$ ：遵循 $\pi$ ，通过以下规则更新 $w$ ， $△w_t \propto \frac{\partial}{\partial w}[\hat Q^\pi(s_t,a_t) -f_w(s_t,a_t)] ^2 \propto [\hat Q^\pi(s_t,a_t) -f_w(s_t,a_t)]\frac{\partial f_w(s_t,a_t)}{\partial w}$ ，其中 $\hat Q^\pi(s_t,a_t)$ 是 $Q^\pi(s_t,a_t)$ 的某个无偏估计量，可能是 $R_t$ 。
当此过程收敛到局部最优时，则

$\sum\limits_s d^\pi(s)\sum\limits_a\pi(s,a)[Q^\pi(s,a)-f_w(s,a)]\frac{\partial f_w(s,a)}{\partial w}=0~~~~~~~~~~(3)$

Theorem 2 (Policy Gradient with Function Approximation). 如果 $f_w$ 满足 (3) 且策略参数化是兼容的，即满足等式

$\frac{\partial f_w(s,a)}{\partial \textcolor{blue}{w}}=\frac{\partial \pi(s,a)}{\partial \textcolor{blue}{\theta}}\frac{1}{\pi(s,a)}~~~~~~~~~~(4)$

则

$\frac{\partial \rho}{\partial \theta}=\sum\limits_ad^\pi(s)\sum\limits_a\frac{\partial \pi(s,a)}{\partial \theta}f_w(s,a)~~~~~~~~~~(5)$

1 Tsitsiklis (个人交流) 指出， $f_w$ 右侧给出的特征是线性的，这可能是满足这个条件的唯一途径

证明：

目标：证明 $\frac{\partial \rho}{\partial \theta}=\sum\limits_ad^\pi(s)\sum\limits_a\frac{\partial \pi(s,a)}{\partial \theta}f_w(s,a)$

联立 (3) 和 (4) 得

$\sum\limits_s d^\pi(s)\sum\limits_a\frac{\partial \pi(s,a)}{\partial \theta}[Q^\pi(s,a)-f_w(s,a)]=0~~~~~~~~~~(6)~~~~~$ 将 (4) 代入 (3)， $\pi(s,a)$ 消掉

这告诉我们 $f_w(s,a)$ 中的误差与策略参数化的梯度正交。
由于上面的表达式为零，我们可以从策略梯度定理 (2) 中减去它，得到

$\begin{aligned}&\underbrace{\frac{\partial \rho}{\partial \theta}=\sum\limits_sd^\pi(s)\sum\limits_a\frac{\partial \pi(s,a)}{\partial \theta}Q^\pi(s,a)}_{定理~ 2}-\underbrace{\sum\limits_s d^\pi(s)\sum\limits_a\frac{\partial \pi(s,a)}{\partial \theta}[Q^\pi(s,a)-f_w(s,a)]}_{式~(6)，等于 ~0}\\ &=\sum\limits_sd^\pi(s)\sum\limits_a\frac{\partial \pi(s,a)}{\partial \theta}\Big[Q^\pi(s,a)-Q^\pi(s,a)+f_w(s,a)\Big]\\ &=\sum\limits_sd^\pi(s)\sum\limits_a\frac{\partial \pi(s,a)}{\partial \theta}f_w(s,a)\end{aligned}$
证毕。

3 推导算法和优势的应用

Given a policy parameterization, Theorem 2 can be used to derive an appropriate form for the value-function parameterization.
给定策略参数化，定理 2 可用于推导价值函数参数化的适当形式。
For example, consider a policy that is a Gibbs distribution in a linear combination of features:
例如，考虑一个策略，它是特征线性组合的吉布斯分布:

$\pi(s,a)=\frac{e^{\theta^T\phi_{sa}}}{\sum_be^{\theta^T\phi_{sb}}}~~~~~~~\forall~s\in{\cal S},a\in{\cal A}$

其中每个 $\phi_{sa}$ 是表征 状态-动作对 $s, a$ 的 $l$ 维特征向量。满足兼容性条件 (4) 要求

$\frac{\partial f_w(s,a)}{\partial w}=\frac{\partial \pi(s,a)}{\partial \theta}\frac{1}{\pi(s,a)}=\phi_{sa}-\sum\limits_b\pi(s,b)\phi_{sb}$

所以 $f_w$ 的自然参数化是

$f_w(s,a)=w^T\Big[\phi_{sa}-\sum\limits_b\pi(s,a)\phi_{sb}\Big]$

换句话说， $f_w$ 在与策略相同的特征上必须是线性的，除了标准化为每个状态的平均值为零。
对于各种非线性策略参数化，例如多层反向传播网络，可以很容易地推导出其他算法。

细心的读者会注意到，上面给出的 $f_w$ 的形式要求它对每个状态都有零平均值： $\sum_a\pi(s,a)f_w(s,a) = 0, \forall ~s \in {\cal S}$ 。
在这个意义上，最好把 $f_w$ 看作是优势函数 $A^\pi(s,a) = Q^\pi(s,a)-V^\pi(s)$ 的近似值(很像Baird, 1993)，而不是 $Q^\pi$ 的近似值。
我们的收敛要求 (3) 实际上 $f_w$ 是我们在每个状态下得到正确的动作的相对值，而不是绝对值，也不是状态之间的变化。
我们的结果可以看作是以优势作为 RL 中价值函数近似目标的特殊地位的证明。
实际上，我们的 (2)、(3) 和 (5) 都可以推广到包含任意状态函数添加到价值函数或其近似值中。
例如，(5) 可推广为 $\frac{\partial \rho}{\partial \theta}=\sum_sd^\pi(s)\sum_a\frac{\partial \pi(s,a)}{\partial \theta}[f_w(s,a)+v(s,a)]$ ，其中 $v:{\cal S\to{\frak R}}$ 是一个任意函数。
( 这个成立是因为 $\sum_a\frac{\partial \pi(s,a)}{\partial \theta}=0,~~\forall~s\in{\cal S}$ ) $v$ 的选择不影响我们的任何定理，但可以实质上影响梯度估计量的方差。
这里的问题完全类似于早期工作中使用强化基线的问题(例如，Williams, 1992; Dayan, 1991; Sutton, 1984).。
实际上， $v$ 应该被设为 $V^\pi$ 的最佳近似。
我们的结果表明，该近似过程可以在不影响 $f_w$ 和 $π$ 的预期演化的情况下进行。

4 函数近似的策略梯度的收敛性

Given Theorem 2, we can prove for the first time that a form of policy iteration withfunction approximation is convergent to a locally optimal policy.
给定定理 2，我们首次证明了一种带函数近似的策略迭代 收敛于 局部最优策略。

Theorem 3 (Policy Iteration with Function Approximation).
定理 3 (函数近似的策略迭代)。
设 $\pi$ 和 $f_w$ 分别为策略和价值函数满足相容条件 (4) 且满足 $\max_{~\theta,s,a,i,j}|\frac{\partial ^2\pi(s,a)}{\partial\theta_i\partial \theta_j}|<B < \infty$ 的任意可微函数近似器，
设 $\{\alpha_k\}_{k=0}^\infty$ 为任意步长序列，使得 $\lim_{k \to \infty}\alpha_k=0$ 和 $\sum_k\alpha_k=\infty$ 。
然后，对于任何有界奖励的 MDP，序列 $\{\rho(\pi_k) \}_{k=0}^\infty$ ，定义为任意 $θ_0,\pi_k=\pi(·,·,\theta_k)$ 且

$w_k=w$ 使得 $\sum\limits_s d^{\pi_k}(s)\sum\limits_a\pi_k(s,a)[Q^{\pi_k}(s,a)-f_w(s,a)]\frac{\partial f_w(s,a)}{\partial w}=0~~~~~~~~~~$ 式 (3) 的 $\pi$ 替换成 $\pi_k$

$\theta_{k+1}=\theta_k+\alpha_k\underbrace{\sum\limits_ad^{\pi_k}(s)\sum\limits_a\frac{\partial {\pi_k}(s,a)}{\partial \theta}f_{w_k}(s,a)}_{式~(5) 的 ~\frac{\partial \rho}{\partial \theta}，\pi~ 替换成 ~{\pi_k}}~~~~~~~~~~(5)$

收敛使得 $\lim_{k\to\infty}\frac{\partial \rho(\pi_k)}{\partial \theta}=0$

证明：???
我们的定理 2 保证 $\theta_k$ 更新是在梯度方向上。
在， $\frac{\partial ^2\pi(s,a)}{\partial\theta_i\partial \theta_j}$ 和 MDP 的奖励上的界限共同保证了 $\frac{\partial ^2\rho}{\partial\theta_i\partial \theta_j}$ 也是有界的。
这些与步长要求一起，是适用 Bertsekas 和 Tsitsiklis (1996) 第 96 页的命题 3.5 的必要条件，该命题确保收敛到局部最优。
Proposition 3.5

Bertsekas, D. P., Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Athena Scientific.

这个资料没找到

无关的资料

https://www.vfu.bg/en/e-Learning/Math–Bertsekas_Tsitsiklis_Introduction_to_probability.pdf

致谢

The authors wish to thank Martha Steenstrup and Doina Precup for comments, and Michael Kearns for insights into the notion of optimal policy under function approximation.
作者希望感谢 Martha Steenstrup 和 Doina preup 的评论，以及 Michael Kearns 对函数近似下最优策略概念的见解。