Reinforcement Learning with Code

This note records how the author begin to learn RL. Both theoretical understanding and code practice are presented. Many material are referenced such as ZhaoShiyu’s Mathematical Foundation of Reinforcement Learning, .

文章目录

Reinforcement Learning with Code
- Chapter 7. Temporal-Difference Learning
- - 7.1 TD leaning of state value
  - 7.2 TD learning of action value: Sarsa
  - 7.3 TD learning of action value: Expected Sarsa
  - 7.4 TD learning of action values: $n$ -step Sarsa
  - 7.5 TD learning of optimal action values: Q-learning
- Reference

Chapter 7. Temporal-Difference Learning

Temporal-difference (TD) algorithms can be seen as special Robbins-Monro (RM) algorithm solving expectation form of Bellman or Bellman optimality equation.

7.1 TD leaning of state value

Recall the Bellman equation in section 2.2

$v_\pi(s) = \sum_a \pi(a|s) \Big(\sum_rp(r|s,a)r + \gamma \sum_{s^\prime}p(s^\prime|s,a)v_\pi(s^\prime) \Big) \quad \text{(elementwise form)} \\ v_\pi = r_\pi + \gamma P_\pi v_\pi \quad \text{(matrix-vector form)}$

where

$[r_\pi]_s \triangleq \sum_a \pi(a|s) \sum_r p(r|s,a)r \qquad [P_\pi]_{s,s^\prime} = \sum_a \pi(a|s) \sum_{s^\prime} p(s^\prime|s,a)$

Recall the definition of state value: state value is defined as the mean of all possible returns starting from a state, which is actually the expectation of return from a specific state. We can rewrite the above equation into

$\textcolor{red}{v_\pi = \mathbb{E}[R+\gamma v_\pi]} \quad \text{(matrix-vector form)} \\ \textcolor{red}{v_\pi(s) = \mathbb{E}[R+\gamma v_\pi(S^\prime)|S=s]}, \quad s\in\mathcal{S} \quad \text{(elementwise form)}$

where $S,S^\prime$ and $R$ are the random variables representing the current state, next state and immediate reward. This equation also called the Bellman expectation equation.

We can use Robbins-Monro algorithm introduced in chapter 6 to solve the Bellman expectation equation. Reformulate the problem that is to find the root $v_\pi(s)$ of equation $g(v_\pi(s))=v_\pi(s) - \mathbb{E}[R+\gamma v_\pi(S^\prime)|S=s]=0$ , where $S^\prime$ is iid with $s$ . We can only get the measurement with noise

$\begin{aligned} \tilde{g}(v_\pi(s),\eta) & = v_\pi(s) - (r+\gamma v_\pi(s^\prime)) \\ & = \underbrace{v_\pi(s) - \mathbb{E}[R+\gamma v_\pi(S^\prime)|S=s]}_{g(v_\pi(s))} + \underbrace{\Big( \mathbb{E}[R+\gamma v_\pi(S^\prime)|S=s] - [r+\gamma v_\pi(s^\prime)] \Big)}_{\eta} \end{aligned}$

Hence, according to the Robbins-Monro algorithm, we can get the TD learning algorithm as

$v_{k+1}(s) = v_k(s) - \alpha_k \Big(v_k(s) - [r_k+\gamma v_\pi(s^\prime_k)] \Big)$

We do some modification in order to remove some assumptions of TD learning. One modification is the sample data $\{(s,r_k,s_k^\prime) \}$ is changed to ${(s_t, r_{t+1}, s_{t+1}) \}$ . Due to the modification the algorithm is called temporal-difference learning. Rewrite it in a more concise way:

$\text{TD learning} : \left \{ \begin{aligned} \textcolor{red}{\underbrace{v_{t+1}(s_t)}_{\text{new estimation}}} & \textcolor{red}{= \underbrace{v_t(s_t)}_{\text{current estimation}} - \alpha_t(s_t) \overbrace{\Big[v_t(s_t) - \underbrace{(r_{t+1} +\gamma v_t(s_{t+1}))}_{\text{TD target } \bar{v}_t} \Big]}^{\text{TD error or Innovation } \delta_t}} \\ \textcolor{red}{v_{t+1}(s)} & \textcolor{red}{= v_t(s)}, \quad \text{for all } s\ne s_t \end{aligned} \right.$

where $t=0,1,2,\dots$ . Here, $v_t(s_t)$ is the estimated state value of $v_\pi(s_t)$ and $a_t(s_t)$ is the learning rate of $s_t$ at time $t$ . And

$\bar{v}_t \triangleq r_{t+1}+\gamma v(s_{t+1})$

is called the TD target and

$\delta_t \triangleq v(s_t) - [r_{t+1}+\gamma v(s_{t+1})] = v(s_t) - \bar{v}_t$

is called the TD error. TD error reflects the deficiency between the current estimate $v_t$ and the true state value $v_\pi$ .

7.2 TD learning of action value: Sarsa

Sarsa is an algorithm directly estimate action values. Estimating action value is important because the policy can be improved based on action values.

Recall the Bellman equation of action value in section 2.5

$\begin{aligned} q_\pi(s,a) & =\sum_r p(r|s,a)r + \gamma\sum_{s^\prime} p(s^\prime|s,a) v_\pi(s^\prime) \quad \text{(elementwise form)} \\ & = \sum_r p(r|s,a)r + \gamma\sum_{s^\prime} p(s^\prime|s,a) \sum_{a^\prime \in \mathcal{A}(s^\prime)}\pi(a^\prime|s^\prime) q_\pi(s^\prime,a^\prime) \quad \text{(elementwise form)} \end{aligned}$

Due to the conditional probability $p (a, b) = p (b) p (a ∣ b)$ , we have

$\begin{aligned} p(s^\prime, a^\prime |s,a) & = p(s^\prime|s,a)p(a^\prime|s^\prime, s, a) \quad \text{(conditional probility)} \\ & = p(s^\prime|s,a) p(a^\prime|s^\prime) \quad \text{(due to conditional independence)} \\ & = p(s^\prime|s,a) \pi(a^\prime|s^\prime) \end{aligned}$

Due to the above equation, we have

$q_\pi(s,a) = \sum_r p(r|s,a)r + \gamma \sum_{s^\prime} \sum_{a^\prime} p(s^\prime,a^\prime|s,a) q_\pi(s^\prime,a^\prime)$

Regard the probability $p (r ∣ s, a)$ and $p(s^\prime, a^\prime |s,a)$ as the distribution of random variable $R$ and $S^\prime$ respectively. Then rewrite above equation into expectation form

$\textcolor{red}{ q_\pi(s,a) = \mathbb{E}\Big[ R + \gamma q_\pi(S^\prime,A^\prime) \Big| S=s, A=a\Big] }, \text{ for all }s,a \quad \text{(expectation form)}$

where $R,S,S^\prime$ are random variables, denote immediate reward, currnet state and next state respectively.

Hence, we can use the Robbins-Monro algorithm to sovle the Bellamn eqaution of action value. We can define

$g(q_\pi(s,a)) \triangleq q_\pi(s,a) - \mathbb{E}{\Big[ R + \gamma q_\pi(S^\prime,A^\prime) \Big| S=s, A=a\Big]}$

We only can get the observation with noise that

$\begin{aligned} \tilde{g}\Big(q_\pi(s,a),\eta \Big) & = q_\pi(s,a) - \Big[r+\gamma q_\pi(s^\prime,a^\prime) \Big] \\ & = \underbrace{q_\pi(s,a) - \mathbb{E}{\Big[ R + \gamma q_\pi(S^\prime,A^\prime)\Big| S=s, A=a\Big]}}_{g(q_\pi(s,a))} + \underbrace{\Bigg[\mathbb{E}{\Big[ R + \gamma q_\pi(S^\prime,A^\prime) \Big| S=s, A=a\Big]} - \Big[r+\gamma q_\pi(s^\prime,a^\prime)\Big] \Bigg]}_{\eta} \end{aligned}$

Hence, according to the Robbins-Monro algorithm, we can get Sarsa as

$q_{k+1}(s,a) = q_k(s,a) - \alpha_k \Big[ q_k(s,a) - \big(r_k+\gamma q_k(s^\prime_k,a^\prime_k) \big) \Big]$

Similar to the TD learning estimates state value in last section, we do some modification in above equation. The sampled data $(s,a,r_k,s^\prime_k,a^\prime_k)$ is changed to $s_t,a_t,r_{t+1},s_{t+1},a_{t+1})$ . Hence, the Sarsa becomes

$\text{Sarsa} : \left \{ \begin{aligned} \textcolor{red}{q_{t+1}(s_t,a_t)} & \textcolor{red}{= q_t(s_t,a_t) - \alpha_t(s_t,a_t) \Big[q_t(s_t,a_t) - (r_{t+1} +\gamma q_t(s_{t+1},a_{t+1})) \Big]} \\ \textcolor{red}{q_{t+1}(s,a)} & \textcolor{red}{= q_t(s,a)}, \quad \text{for all } (s,a) \ne (s_t,a_t) \end{aligned} \right.$

where $t=0,1,2,\dots$ . Here, $q_t(s,a_t)$ is the estimated action value of $s_t,a_t)$ ; $\alpha_t(s_t,a_t)$ is the learning rate depending on $s_t,a_t$ .

Sarsa is nothing but an action-value version of the TD algorithm. Sarsa is also implemented with policy improvement steps such as $\epsilon$ - greedy algorithm. There is a point that should be noticed. In the $q$ -value update step, unlike the model-based policy iteration or value iteration algorithm where the values of all states are updated in each iteration, Sarsa only updates a single state-action pair that is visited at time step $t$ .

Pesudocode:

7.3 TD learning of action value: Expected Sarsa

Recall the Bellman equation of action value

$q_\pi(s,a) = \sum_r p(r|s,a)r + \gamma\sum_{s^\prime} p(s^\prime|s,a) v_\pi(s^\prime) \quad \text{(elementwise form)}$

Regard the probability $p (r ∣ s, a)$ and $p(s^\prime|s,a)$ as the distribution of random variable $R$ and $v_\pi(S^\prime)$ . Then, we have the expectation form of Bellman equation of action value.

$q_\pi(s,a) = \mathbb{E}[R + \gamma v_\pi(S^\prime)|S=s,A=a] \quad \text{(expectation form)} \quad (1)$

According to the definition of state value we have

$\begin{aligned} \mathbb{E}[q_\pi(s, A) | s] & = \sum_{a\in\mathcal{A}(s)} \pi(a|s) q_\pi(s,a) = v_\pi(s) \\ \to \mathbb{E}[q_\pi(S^\prime, A) | S^\prime] & = v_\pi(S^\prime) \quad (2) \end{aligned}$

Subtitute $(2)$ into $(1)$ we have

$\textcolor{red}{q_\pi(s,a) = \mathbb{E} \Big[ R+\gamma \mathbb{E} \big[ q_\pi(S^\prime, A)|S^\prime \big] \Big| S=s, A=a \Big]}, \text{ for all }s,a \quad \text{(expectation form)}$

Rewirte it into root finding parttern

$g(q_\pi(s,a)) \triangleq q_\pi(s,a) - \mathbb{E} \Big[ R+\gamma \mathbb{E} \big[ q_\pi(S^\prime, A)|S^\prime \big] \Big| S=s, A=a \Big]$

We can only get the observation with noise $\eta$

$\begin{aligned} \tilde{g}(q_\pi(s,a), \eta) & = q_\pi(s,a) - \Big(r + \gamma \mathbb{E} \big[ q_\pi(s^\prime, A)|s^\prime \big] \Big) \\ & = \underbrace{q_\pi(s,a) - \mathbb{E} \Big[ R+\gamma \mathbb{E} \big[ q_\pi(S^\prime, A)|S^\prime \big] \Big| S=s, A=a \Big]}_{g(q_\pi(s,a))} + \underbrace{\mathbb{E} \Big[ R+\gamma \mathbb{E} \big[ q_\pi(S^\prime, A)|S^\prime \big] \Big| S=s, A=a \Big] - \Big(r + \gamma \mathbb{E} \big[ q_\pi(s^\prime, A)|s^\prime \big] \Big)}_{\eta} \end{aligned}$

Hence, we can implement Robbins-Monro algorithm to find the root of $g(q_\pi(s,a))$ that

$q_{k+1}(s,a) = q_k(s,a) - \alpha_k(s,a) \Bigg[ q_k(s,a) - \Big(r_k + \gamma \mathbb{E} \big[ q_k(s^\prime_k, A)|s^\prime_k \big] \Big) \Bigg]$

Similar to the TD learning estimates state value, we do some modification in above equation. The sampled data $(s,a,r_k,s^\prime_k)$ is changed to $s_t,a_t,r_{t+1},s_{t+1})$ . Hence, the Expected-Sarsa becomes

$\text{Expected-Sarsa} : \left \{ \begin{aligned} \textcolor{red}{q_{t+1}(s_t,a_t)} & \textcolor{red}{= q_t(s_t,a_t) - \alpha_t(s_t,a_t) \Big[q_t(s_t,a_t) - \big( r_{t+1} +\gamma \mathbb{E}[q_t(s_{t+1},A|s_{t+1})] \big) \Big]} \\ \textcolor{red}{q_{t+1}(s,a)} & \textcolor{red}{= q_t(s,a)}, \quad \text{for all } (s,a) \ne (s_t,a_t) \end{aligned} \right.$

7.4 TD learning of action values: $n$ -step Sarsa

Recall the definition of action value is

$q_\pi(s,a) = \mathbb{E}[G_t|S_t=s, A_t=a]$

The discounted return $G_t$ can be written in different forms as

$\begin{aligned} \text{Sarsa} \longleftarrow G_t^{(1)} & = R_{t+1} + \gamma q_\pi(S_{t+1},A_{t+1}) \\ G_t^{(2)} & = R_{t+1} + \gamma R_{t+2} + \gamma^2 q_\pi(S_{t+2},A_{t+2}) \\ & \vdots \\ n\text{-step Sarsa} \longleftarrow G_t^{(n)} & = R_{t+1} + \gamma R_{t+2} + \cdots +\gamma^n q_\pi(S_{t+n},A_{t+n}) \\ & \vdots \\ \text{Monte Carlo} \longleftarrow G_t^{(\infty)} & = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3}+ \cdots \\ \end{aligned}$

It should be noted that $G_t=G_t^{(1)}=G_t^{(2)}=G_t^{(n)}=G_t^{(\infty)}$ , where the superscripts merely indicate the different decomposition structures of $G_t$ .

Sarsa aims to solve

$q_\pi(s,a) = \mathbb{E}[G_t^{(1)}|s,a] = \mathbb{E} [R_{t+1}+\gamma q_\pi(S_{t+1},A_{t+1})|s,a ]$

MC learning aims to solve

$q_\pi(s,a) = \mathbb{E} [G_t^{(\infty)}|s,a] = \mathbb{E} [R_{t+1}+\gamma R_{t+2} + \gamma^2 R_{t+3}+\cdots |s,a]$

$n$ -step Sarsa aims to solve

$q_\pi(s,a) = \mathbb{E} [G_t^{(n)}|s,a] = \mathbb{E} [R_{t+1}+\gamma R_{t+2} +\cdots + \gamma^n q_\pi(S_{t+n},A_{t+n}) |s,a]$

The $n$ -step Sarsa algorithm is

$n\text{-step Sarsa} : \left \{ \begin{aligned} \textcolor{red}{q_{t+1}(s_t,a_t)} & \textcolor{red}{= q_t(s_t,a_t) - \alpha_t(s_t,a_t) \Big[q_t(s_t,a_t) - (r_{t+1}+ \gamma r_{t+2} + \cdots + \gamma^n q_t(s_{t+n},a_{t+n})) \Big]} \\ \textcolor{red}{q_{t+1}(s,a)} & \textcolor{red}{= q_t(s,a)}, \quad \text{for all } (s,a) \ne (s_t,a_t) \end{aligned} \right.$

7.5 TD learning of optimal action values: Q-learning

It should be noted that Sarsa can only estimate the action values of a given policy. It must be combined with a policy improvement step to find optimal policies and hence their optimal action values. By contrast, Q-learning can directly estimate optimal action values.

Recall the Bellman optimal equation of state value in section 3.2

$\begin{aligned} v(s) & = \max_\pi \sum_{a\in\mathcal{A}(s)} \pi(a|s) \Big[ \sum_r p(r|s,a)r + \gamma \sum_{s^\prime} p(s^{\prime}|s,a) v(s^{\prime}) \Big] \\ v(s) & = \max_{a\in\mathcal{A}(s)} \Big[\sum_r p(r|s,a)r + \gamma \sum_{s^\prime} p(s^\prime|s,a) v(s^\prime) \Big] \end{aligned}$

where $v(s)\triangleq \max_{a\in\mathcal{A}(s)} q(s,a)$ . Hence we have

$\begin{aligned} \max_{a\in\mathcal{A}(s)} q(s,a) & = \max_{a\in\mathcal{A}(s)} \Big[\sum_r p(r|s,a)r + \gamma \sum_{s^\prime} p(s^\prime|s,a) v(s^\prime) \Big] \\ \max_{a\in\mathcal{A}(s)} q(s,a) & = \max_{a\in\mathcal{A}(s)} \Big[\sum_r p(r|s,a)r + \gamma \sum_{s^\prime} p(s^\prime|s,a) \max_{a\in\mathcal{A}(s)} q(s^\prime,a) \Big] \\ \to q(s,a) & = \sum_r p(r|s,a)r + \gamma \sum_{s^\prime} p(s^\prime|s,a) \max_{a\in\mathcal{A}(s)} q(s^\prime,a) \quad \text{(elementwise form)} \end{aligned}$

Rewrite it into expectation form

$\textcolor{red}{ q(s,a) = \mathbb{E}[R+\gamma \max_{a\in\mathcal{A}(s)} q(S^\prime,a) |S=s,A=a ] }, \text{ for all }s,a \quad \text{(expectation form)}$

This equation is the Bellman optimal equation expressed in terms of action values.

Rewrite it into

$\triangleq q(s,a) - \mathbb{E} [R+\gamma \max_{a\in\mathcal{A}(S^\prime)} q(S^\prime,a) |S=s,A=a ]$

We can get the observation with noise

$\begin{aligned} \tilde{g}(q(s,a)) & = q(s,a) - \Big[r + \gamma \max_{a\in\mathcal{A}(s^\prime)} q(s^\prime,a) \Big] \\ & = \underbrace{q(s,a) - \mathbb{E} [R+\gamma \max_{a\in\mathcal{A}(S^\prime)} q(S^\prime,a) |S=s,A=a ]}_{g(q(s,a))} + \underbrace{\mathbb{E} [R+\gamma \max_{a\in\mathcal{A}(S^\prime)} q(S^\prime,a) |S=s,A=a ] - \Big[r + \gamma \max_{a\in\mathcal{A}(s^\prime)} q(s^\prime,a) \Big]}_{\eta} \end{aligned}$

Hence, we can implement Robbins-Monro algorithm to find the root

$q_{k+1}(s,a) = q_k(s,a) - \alpha_k(s,a) \Big[q_k(s,a) - \big(r_k + \gamma \max_{a\in\mathcal{A}(s^\prime)} q_k(s^\prime,a) \big) \Big]$

$\text{Q-learning} : \left \{ \begin{aligned} \textcolor{red}{q_{t+1}(s_t,a_t)} & \textcolor{red}{= q_t(s_t,a_t) - \alpha_t(s_t,a_t) \Big[q_t(s_t,a_t) - (r_{t+1}+ \gamma \max_{a\in\mathcal{A}(s_{t+1})} q_t(s_{t+1},a)) \Big]} \\ \textcolor{red}{q_{t+1}(s,a)} & \textcolor{red}{= q_t(s,a)}, \quad \text{for all } (s,a) \ne (s_t,a_t) \end{aligned} \right.$

Off-policy vs on-policy:

There exist two policies in a TD learning task: behavior policy and target policy. The behavoir policy is used to generate experience samples. The target policy is constantly updated toward an optimal policy. When the behavior policy is the same as the target policy, such a kind of learning is called on-policy. Otherwise, when they are different, the learning is called off-policy.

The advantage of off-policy learning compared to on-policy learning is that it can search for optimal policies based on the experiences generated by any other policies.

How to determine a algorithm is on-policy or off-policy? The basic reason is that if the algorithm is sovling Bellman equation, then it’s on-policy. This is because Bellman equation is finding the state value or action value under a given policy. Else if the algorithm is sovling Bellman optimal equation, then it’s off-policy. This because Bellman equation does not include any policy, hence, the behavior policy and target policy can be different.

Online learning vs offline learning:

Online learning refers to the case where the value and policy can be updated once an experience sample is obtained. Offline learning refers to the case that the update can only be done after all experience samples have been collected. For example, TD learning is online whereas Monte Carlo learning is offline.

Persudocode:

(On-policy version)