【RL】Basic Concepts in Reinforcement Learning

news2025/7/3 9:39:19

Lecture1: Basic Concepts in Reinforcement Learning

MDP(Markov Decision Process)

Key Elements of MDP

Set

State: The set of states $\mathcal{S}$ （状态 $\mathcal{S}$ 的集合）

Action: the set of actions $\mathcal{A}(s)$ is associated for state $\in \mathcal{S}$ （对应于状态 $\in \mathcal{S}$ 的行为集合 $\mathcal{A(s)}$ ）

Reward: the set of rewards $\mathcal{R}(s, a)$ （对应于某一状态 $\in \mathcal{S}$ 和在该状态的某一行为 $\in \mathcal{A}(s)$ 的奖励分数，是一个实数）

Probability distribution

State transition probability: at state $s$ ,taking action $a$ , the probability to transit to state $s^{'}$ is （状态转移概率，在状态 $s$ ，行为 $a$ ，转移到状态 $s^{'}$ 的概率）
$p (s^{'} ∣ s, a)$
Reward probability: at state $s$ , taking action $a$ , the probability to get reward $r$ is （在状态 $s$ ，行为 $a$ ，得到的奖励分数 $r$ ）
$p (r ∣ s, a)$

Policy

Policy: at state $s$ , the probability to choose action $a$ is（在状态 $\in \mathcal{S}$ ，选择状态 $\in \mathcal{A}(s)$ 的概率）
$\pi (a | s)$

Markov Property

Markov property: memoryless property （无记忆性；无后效性）
$p(s_{t + 1} | a_{t + 1}, s_t, ..., a_1, s_0) = p(s_{t + 1} | a_{t + 1}, s_t) \\ p(r_{t + 1} | a_{t + 1}, s_t, ..., a_1, s_0) = p(r_{t + 1} | a_{t + 1}, s_t)$

Grid-World Example

以grid-world为例对上面例子进行解释

请添加图片描述

Key Elements

state：每个表格所在的位置即为state, 因此其有9个state $s_1, s_2, ..., s_9$ 。

每个表格所在的位置即为state, 因此其有9个state $s_1, s_2, ..., s_9$ 。

state space： $\mathcal{S} = \{ s_i \}^9_{i=1}$

action：对于每一个state，有五个可能的action

$a_1$ : move upwards
$a_2$ : move rightwards
$a_3$ : move downwards
$a_4$ : move leftwards
$a_5$ : stay unchanged

Action space of a state：特定state其所有可能的action的集合 $\mathcal{A}(s)= \{ a_i \}^5_{i=1}$

state transition

当采取action时，agent可能会从一个state转移到另一个state。例如 $s_1 \stackrel{a_2}{\longrightarrow}s_1$

tubular representation: 使用表格表示state transition
state transition probability: 使用概率描述state transition
- intuition: 在state $s_1$ ，如果选择action $a_2$ ，下一个state是 $s_2$
- math:
  $\begin{align*} &p(s_2 | s_1, a_2) = 1\\ &p(s_i | s_1, a_2) = 0 \;\;\; \forall \ne 2 \end{align*}$

Policy

告诉agent在某个state下要采取什么action。

直观表示如下图所示：

在这里插入图片描述

基于以上policy，针对不同的start area和end area，最优路径如下：

在这里插入图片描述

在这里插入图片描述

$\begin{align*} &\pi(a_1 | s_1) = 0\\ &\pi(a_2 | s_1) = 0.5 \\ &\pi(a_3 | s_1) = 0.5 \\ &\pi(a_4 | s_1) = 0 \\ &\pi(a_5 | s_1) = 0 \\ \end{align*}$
tabular representation: 使用表格表示

在这里插入图片描述

Reward

在采取某个action后得到的实数

正数代表鼓励去采取这种action
复数代表惩罚采取这种action
零代表不鼓励不惩罚

对于grid-world样例，reward可以设计成以下四种：

如果agent尝试逃出表格边界： $r_{round}=-1$
如果agent尝试进入禁区（蓝色方块）： $r_{forbid}=-1$
如果agent到达目标单元： $r_{target}=1$
其他情况： $r = 0$

reward可以理解为human-machine interface，通过它我们可以引导agent按照我们的期望行事。

tabular representation:
Mathematical description:
$s_1, a_1) = 1 \\ p(r \ne -1 | s_1, a_1) = 0$

Trajectory and Return

如下图：

在这里插入图片描述

trajectory 是一个state-action-reward 链：
$s_1 \xrightarrow[r=0]{a_2} s_2\xrightarrow[r=0]{a_3} s_5\xrightarrow[r=0]{a_3} s_8\xrightarrow[r=1]{a_2} s_9$
return是沿该轨迹收集的所有奖励的总和：
$\text{return}=0 + 0 + 0 + 1 = 1$

Discounted return

对于下图trajectory ：

在这里插入图片描述

其可以定义为：
$s_1 \xrightarrow[]{a_2} s_2 \xrightarrow[]{a_3} s_5 \xrightarrow[]{a_3} s_8 \xrightarrow[]{a_2} s_9 \xrightarrow[]{a_5} s_9 \xrightarrow[]{a_5} s_9 ...$
return为：
$\text{return} = 0 + 0 + 0 + 1 + 1 + 1 + ... = \infty$
需要引入discount rate $\gamma \in [0, 1)$

discount return:
$\begin{align*} \text{discount return} &= 0 + \gamma0 + \gamma^20 + \gamma^31 + \gamma^41 + ... \\ & = \gamma^3(1 + \gamma + \gamma^2 + ...)\\ &=\gamma^3 \frac{1}{1 - \gamma} \end{align*}$