Reinforcement Learning with Code 【Chapter 10. Actor Critic】

This note records how the author begin to learn RL. Both theoretical understanding and code practice are presented. Many material are referenced such as ZhaoShiyu’s Mathematical Foundation of Reinforcement Learning.
This code refers to Mofan’s reinforcement learning course.

文章目录

Reinforcement Learning with Code 【Chapter 10. Actor Critic】
- - 10.1 The simplest actor-critic algorithm (QAC)
  - 10.2 Advantage Actor-Critic (A2C)
  - 10.3 Off-policy Actor-Critic
- Reference

10.1 The simplest actor-critic algorithm (QAC)

Recall the idea of policy gradient method is to search for an optimal policy by maximizing a scalar metric $J(\theta)$ . The metric has three options, average state value $\mathbb{E}[v_\pi(S)]$ , average one step reward $\mathbb{E}[r_\pi(S)]$ or average state value from a specific state $s_0$ 。

According to policy gradient theorem in chapter 9, we are informed that

$\begin{aligned} \theta_{t+1} & = \theta_t + \alpha \nabla_\theta J(\theta_t) \\ & = \theta_t + \alpha \mathbb{E}_{S\sim\eta, A\sim\pi} [\nabla_\theta \ln \pi(A|S,\theta_t)q_\pi(S,A)] \end{aligned}$

where $\eta$ is a distribution of the states. Since the true gradient is unknwon, we can use a stochastic gradient ot approximated it, hence we have

$\theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln\pi(a_t|s_t,\theta_t)q_t(s_t,a_t)$

In policy gradient method such as REINFORCE, we use the idea of Monte-Carlo to approximate the true value $q_t(s_t,a_t)$ . Where $q_t(s_t,a_t)$ is approximated by an episode return that is $q_t(s_t,a_t)=\sum_{k=t+1}^T \gamma^{k-t-1}r$ .
If $q_t(s_t,a_t)$ is estimated by value function approximationg, and the value function is updated using the idea of TD learning. The corresponding algorithms are usually called actor-critic. Therefore, actor-critic methods can be seen as one of the policy gradient method.

When we use the parameterized value function $q (s, a; w)$ to approximate the $q_t(s_t,a_t)$ , and the value function is updated by the idea of Sarsa of TD learning, the algorithm is called Q actor-critic (QAC). The core idea of QAC is that

$\text{QAC:} \left\{ \begin{aligned} \text{Actor}: \theta_{t+1} & = \theta_t + \alpha_\theta \nabla_\theta \ln\pi(a_t|s_t;\theta) q(s_t,a_t;w_t) \\ \text{Critic}: w_{t+1} & = w_t + \alpha_w [r_{t+1}+\gamma q(s_{t+1},a_{t+1};w_t) - q(s_t,a_t;w_t)]\nabla_w q(s_t,a_t;w_t) \end{aligned} \right.$

We use value function approximation to approximate true q value $q_t(s_t,a_t)$ . Meanwhile we use the idea of Sarsa to update our value function.

We can write the objective function of the update rule

$\text{QAC:} \left \{ \begin{aligned} \textcolor{red}{\text{Actor}: \max_\theta J(\theta)} & \textcolor{red}{= \mathbb{E}_{S\sim_\eta, A\sim\pi}[\ln\pi(A|S;\theta_t)q(S,A;w_t)]} \\ \textcolor{red}{\text{Critic}: \min_w J(w)} & \textcolor{red}{= \mathbb{E}_{S\sim_\eta, A\sim\pi}[(R + \gamma q(S^\prime,A;w_t) - q(S_t,A;w_t))^2]} \end{aligned} \right.$

Pesudocode

10.2 Advantage Actor-Critic (A2C)

The core idea of A2C is to introduce a baseline to reduce estimation variance. That is
$\mathbb{E}_{S\sim\eta,A\sim\pi}[\nabla_\theta \ln \pi(A|S;\theta_t)q_\pi(S,A)] = \mathbb{E}_{S\sim\eta,A\sim\pi}[\nabla_\theta \ln \pi(A|S;\theta_t)(q_\pi(S,A)-b(S))]$
where the additional baseline $b (S)$ is a scalar function of $S$ . Add a baseline doesn’t affect the expectation of the above equation that is
$\begin{aligned} \mathbb{E}_{S\sim\eta,A\sim\pi}[\nabla_\theta \ln \pi(A|S;\theta_t)b(S)] & = 0 \\ & = \sum_{s\in\mathcal{S}}\eta(s)\sum_{a\in\mathcal{A}}\pi(a|s;\theta_t) \nabla_\theta \ln \pi(a|s;\theta_t)b(s) \\ & = \sum_{s\in\mathcal{S}}\eta(s)\sum_{a\in\mathcal{A}}\nabla_\theta\pi(a|s;\theta_t)b(s) \\ & = \sum_{s\in\mathcal{S}}\eta(s)b(s)\sum_{a\in\mathcal{A}}\nabla_\theta\pi(a|s;\theta_t) \\ & = \sum_{s\in\mathcal{S}}\eta(s)b(s)\nabla_\theta\pi\sum_{a\in\mathcal{A}}(a|s;\theta_t) \\ & = \sum_{s\in\mathcal{S}}\eta(s)b(s)\nabla_\theta1 = 0 \end{aligned}$

How to find the optimal baseline? The derivation is omitted. The optimal baseline is

$b^*(s) = \frac{\mathbb{E}_{A\sim\pi}[||\nabla_\theta \ln\pi(A|s;\theta_t)||^2 q_\pi(s,A)]}{\mathbb{E}_{A\sim\pi}[||\nabla_\theta \ln\pi(A|s;\theta_t)||^2]}$

But its too complex to use in practice. If the weight $||\nabla_\theta \ln\pi(A|s;\theta_t)||^2$ is removed, we can obtain a suboptimal baseline that has a concise expression:

$\textcolor{red}{b^\dagger (s) = \mathbb{E}[q_\pi(s,A)] = v_\pi(s)}$

The suboptimal baseline is the state value of state $s$ .

When $b(s)=v_\pi(s)$ , the gradient-ascent becomes
$\begin{aligned} \theta_{t+1} & = \theta_t + \alpha\mathbb{E}[\nabla_\theta \ln\pi(A|S;\theta_t)[q_\pi(S,A)-v_\pi(S)]] \\ & = \theta_t + \alpha\mathbb{E}[\nabla_\theta\ln\pi(A|S;\theta_t) \delta_\pi(S,A)] \end{aligned}$
Here,
$\textcolor{red}{\delta_\pi(S,A) = q_\pi(S,A) - v_\pi(S)}$
is called advantage function, which reflects the advantage of one action over the others. More specifically, note that $v_\pi(s)=\sum_{a\in\mathcal{A}}\pi(a|s)q_\pi(s,a)$ is the mean of the action value. If $\delta_\pi(s,a)>0$ , it means that the corresponding action has a greater value than the mean value.

The stochastic version is
$\begin{aligned} \theta_{t+1} & = \theta_t + \alpha\nabla_\theta \ln\pi(a_t|s_t;\theta_t)[q_t(s_t,a_t)-v_t(s_t)] \\ & = \theta_t + \alpha\nabla_\theta\ln\pi(a_t|s_t;\theta_t) \delta_t(s_t,a_t) \end{aligned}$
we need to estimate the true q-value $q_t(s_t,a_t)$ . There are many ways:

If $q_t(s_t,a_t)$ and $v_t(s_t)$ are estimated by Monte-Carlo learning, the algorithm is called REINFORCE with a baseline.
If $q_t(s_t,a_t)$ and $v_t(s_t)$ are estimated by TD learning, the algorithm is usually called advantage actor-critic (A2C).

$\begin{aligned} q_t(s_t,a_t) - v_t(s_t) & = r_{t+1} +\gamma q_t(s_{t+1},a_{t+1}) - v_t(s_t) \\ & \textcolor{red}{\approx r_{t+1} +\gamma v_t(s_{t+1}) - v_t(s_t)} \end{aligned}$

Hence, we don’t need to maintain two networks to represent $v_\pi(s)$ and $q_\pi(s,a)$ . We just need one network to represent $v_\pi(s)$ .

In A2C we use one policy network $\pi(a|s;\theta)$ and one state value network $v (s; w)$ . The core idea of A2C is that

$\text{A2C}: \left \{ \begin{aligned} \text{Advantage}: \delta_t & = r_{t+1} + \gamma v(s_{t+1};w_t) - v(s_t;w_t) \\ \text{Actor}: \theta_{t+1} & = \theta_t + \alpha_\theta \textcolor{blue}{\delta_t}\nabla_\theta \ln\pi(a_t|s_t;\theta) \\ \text{Critic}: w_{t+1} & = w_t + \alpha_w \textcolor{blue}{\delta_t} \nabla_w v(s_t,;w_t) \end{aligned} \right.$

We can write the objective function of the update rule

$\text{A2C}: \left\{ \begin{aligned} \text{Advantage:} \Delta(S) & = R+\gamma v(S^\prime;w_t) - v(S;w_t) \\ \textcolor{red}{\text{Actor}: \max_\theta J(\theta)} & \textcolor{red}{= \mathbb{E}_{S\sim_\eta, A\sim\pi}[\ln\pi(A|S;\theta)\Delta(S)]} \\ \textcolor{red}{\text{Critic}: \min_w J(w)} & \textcolor{red}{= \mathbb{E}_{S\sim_\eta}[(R + \gamma v(S^\prime;w) - v(S;w))^2]} = \mathbb{E}_{S\sim_\eta}[\Delta(S)] \end{aligned} \right.$

Pesudocode

10.3 Off-policy Actor-Critic

Importance Sampling

The key technique to convert the AC algorithm to off-policy is importance sampling. Consider a random variable $X\in\mathcal{X}$ . Suppose that $p_0(X)$ is a probability distribution. Our goal is to estimate $\mathbb{E}_{X\sim p_0}[X]$ . We also known the $p_1(X)$ is a probability distribution of $X$ . How can we use the probability $p_1(X)$ to sample data to estimate $\mathbb{E}_{X\sim p_0}[X]$ . The technique is importance sampling. Suppose we have some i.i.d. samples $\{x_i\}^n_{i=1}$ generated by distribution $p_1(X)$ .
$\mathbb{E}_{X\sim p_0}[X] = \sum_{x\in\mathcal{X}}p_0(x)x = \sum_{x\in\mathcal{X}}p_1(x)\underbrace{\frac{p_0(x)}{p_1(x)}x}_{f(x)} = \mathbb{E}_{X\sim p_1}[f(X)] \\ \mathbb{E}_{X\sim p_0}[X] = \mathbb{E}_{X\sim p_1}[f(X)] \approx \bar{f} = \frac{1}{n} \sum^n_{i=1}f(x_i) = \frac{1}{n} \sum^n_{i=1} \underbrace{\frac{p_0(x_i)}{p_1(x_i)}}_{\text{importance weight}}x_i$
An Example

Consider $X\in\mathcal{X}={+1,-1}$ Suppose the $p_0$ is a probability distribution satisfying
$p_0(X=+1)=0.5, p_0(X=-1)=0.5$
The expectaton of $X$ over $p_0$ is
$\mathbb{E}_{X\sim p_0}[X] = (+1)\times 0.5 + (-1) \times 0.5 = 0$
Suppose $p_1$ is a probability distribution satisfying
$p_0(X=+1)=0.8, p_0(X=-1)=0.2$
The expectaton of $X$ over $p_0$ is
$\mathbb{E}_{X\sim p_1}[X] = (+1)\times 0.8 + (-1) \times 0.2 = 0.6$
We can use the importance samping techique to sample data under distribution $p_1$ to compute $\mathbb{E}_{X\sim p_0}[X]$
$\mathbb{E}_{X\sim p_0}[X] = \frac{1}{n}\sum_{i=1}^n \frac{p_0(x_i)}{p_1(x_i)}x_i$

import numpy as np
import matplotlib.pyplot as plt
# reproducible
np.random.seed(0)

# 定义元素和对应的概率
elements = [1, -1]
probs1 = [0.5, 0.5]
probs2 = [0.8, 0.2]

# 重要性采样 importance sample
sample_times = 300
sample_list = []
i_sample_list = []
average_list = []
importance_list = []
for i in range(sample_times):
    sample = np.random.choice(elements, p=probs2)
    sample_list.append(sample)
    average_list.append(np.mean(sample_list))
    if sample == elements[0]:
        i_sample_list.append(probs1[0] / probs2[0] * sample)
    elif sample == elements[1]:
        i_sample_list.append(probs1[1] / probs2[1] * sample)
    importance_list.append(np.mean(i_sample_list))



plt.plot(range(len(sample_list)), sample_list, 'o', markerfacecolor='none', label='sample data')
plt.plot(range(len(average_list)), average_list, 'b--', label='average')
plt.plot(range(len(importance_list)), importance_list, 'g--', label='importance sampling')
plt.axhline(y=0.6, color='r', linestyle='--')
plt.axhline(y=0, color='r', linestyle='--')
plt.ylim(-1.5, 2.5) # 限制y轴显示范围
plt.xlim(0,sample_times) # 限制x轴显示范围
plt.legend(loc='upper right')
plt.show()

Off-policy policy gradient theorem

With importance sampling, we are ready to present the off-policy gradient theorem. Suppose that the $\beta$ is a behavior policy. Our goal is to use the samples generated by behavoir policy $\beta$ to learn a target policy $\pi$ that can maximize the following metric
$\max_\theta J(\theta) = \mathbb{E}_{S\sim d_\beta}[v_\pi(S)]$
Theorem 10.1 (Stochastic off-policy policy gradient theorem). In the discounted case where $\gamma\in(0,1)$ , the gradient of $J(\theta)$ is
$\textcolor{red}{\nabla_\theta J(\theta) = \mathbb{E}_{S\sim\rho, A\sim\beta}\Big[\underbrace{\frac{\pi(A|S;\theta)}{\beta(A|S)}}_{\text{importance weight}} \nabla_\theta \ln \pi(A|S;\theta) q_\pi(S,A) \Big]}$
where the state distribution $\rho$ is
$\rho(s) \triangleq \sum_{s^\prime\in\mathcal{S}} d_\beta(s^\prime) \Pr_\pi(s|s^\prime)$
where $\Pr_\pi(s|s^\prime)=\sum_{k=0}^\infty \gamma^k[P^k_\pi]_{s^\prime,s}=[(I-\gamma P_\pi)^{-1}]_{s^\prime,s}$ is the discounted total probability of transitioning from $s^\prime$ to $s$ under policy $\pi$ .

The off-policy policy gradient is invariant to any additional baseline $b (s)$ . In particular, we have
$\nabla_\theta J(\theta) = \mathbb{E}_{S\sim\rho, A\sim\beta}\Big[\frac{\pi(A|S;\theta)}{\beta(A|S)} \nabla_\theta \ln \pi(A|S;\theta) \big( q_\pi(S,A) - b(S) \big) \Big]$
when we take the state value as the baseline $v_\pi(S)=b(S)$ , there comes the advantage function.
$\delta_\pi(S,A) = q_\pi(S,A) - v_\pi(S)$
The corresponding stochastic gradient-ascent algorithm is
$\theta_{t+1} = \theta_t + \alpha_\theta \frac{\pi(a_t|s_t;\theta)}{\beta(a_t|s_t)} \nabla_\theta \ln\pi(a_t|s_t;\theta_t)(q_t(s_t,a_t)-v_t(s_t))$
The advantage function can be replaced by the TD error. That is
$q_t(s_t,a_t)-v_t(s_t) \approx r_{t+1} + \gamma v_t(s_{t+1}) - v_t(s_t) \triangleq \delta_t(s_t,a_t)$
In off-policy A2C we use the behavior policy $\beta$ to obtain samples, to learn a policy network $\pi(a|s;\theta)$ , and one value network $v (s; w)$ . The core idea of off-policy A2C is

$\text{off-policy A2C}: \left \{ \begin{aligned} \text{Behavior policy:} S & \sim\beta \\ \text{Advantage}: \delta_t & = r_{t+1} + \gamma v(s_{t+1};w_t) - v(s_t;w_t) \\ \text{Actor}: \theta_{t+1} & = \theta_t + \alpha_\theta \frac{\pi(a_t|s_t;\theta)}{\beta(a_t|s_t)} \delta_t\nabla_\theta \ln\pi(a_t|s_t;\theta) \\ \text{Critic}: w_{t+1} & = w_t + \alpha_w \frac{\pi(a_t|s_t;\theta)}{\beta(a_t|s_t)}\delta_t \nabla_w v(s_t,;w_t) \end{aligned} \right.$

We rewrite the objective function of the update rule

$\text{off-policy A2C}: \left \{ \begin{aligned} \text{Behavior policy:} S & \sim\beta \\ \text{Advantage}: \Delta(S) & = R + \gamma v(S^\prime;w) - v(S;w) \\ \text{Actor}: \max_\theta J(\theta) & = \mathbb{E}_{S\sim\rho,A\sim\beta}[\frac{\pi(A|S;\theta)}{\beta(A|S)}\Delta(S) \ln\pi(A|S;\theta)] \\ \text{Critic}: \min_w J(w) & = \mathbb{E}_{S\sim\rho}[(R + \gamma v(S^\prime;w_t) - v(S_t;w_t))^2] = \mathbb{E}_{S\sim_\eta}[\Delta(S)] \end{aligned} \right.$