Reinforcement-Learning 2.State Value and Bellman Equation

news2024/9/20 14:31:57

目录

0.Outline

1.Motivating examples 

Motivating example 1: Why return is important?

Motivating example 2: How to calculate return?

2.State value

3.Bellman equation: Derivation 

Deriving the Bellman equation

An illustrative example

Exercise

4.Bellman equation: Matrix-vector form 

Matrix-vector form of the Bellman equation

Illustrative examples 

5.Bellman equation: Solve the state values 

6.Action value 

Action value

llustrative example for action value

7.Summary


0.Outline

In this lecture:

• A core concept: state value

• A fundamental tool: the Bellman equation

1 Motivating examples

2 State value

3 Bellman equation: Derivation

4 Bellman equation: Matrix-vector form

5 Bellman equation: Solve the state values

6 Action value

7 Summary

1.Motivating examples 

Motivating example 1: Why return is important?

• What is return? The (discounted) sum of the rewards obtained along a trajectory.

• Why is return important? See the following examples.

• Question: From the starting point s1, which policy is the “best”? Which is the “worst”?

• Intuition: the first is the best and the second is the worst, because of the forbidden area.

• Math: can we use mathematics to describe such intuition?

Return could be used to evaluate policies. See the following. 

Based on policy 1 (left figure), starting from s1, the discounted return is

 Based on policy 2 (middle figure), starting from s1, what is the discounted return? Answer:

Policy 3 is stochastic! 

Based on policy 3 (right figure), starting from s1, the discounted return is

Answer: 

return即每种policy的reward值考虑discounted rate的总和 ,用于评估不同policy的好坏

In summary, starting from s1,

return1 > return3 > return2

The above inequality suggests that the first policy is the best and the second policy is the worst, which is exactly the same as our intuition.

Calculating return is important to evaluate a policy.

Motivating example 2: How to calculate return?

While return is important, how to calculate it?

Method 1: by definition

Let vi denote the return obtained starting from si (i = 1, 2, 3, 4)

Method 2:

 The returns rely on each other. Bootstrapping!

How to solve these equations? Write in the following matrix-vector form:

which can be rewritten as

v = r + γPv 

This is the Bellman equation (for this specific deterministic problem)!!

• Though simple, it demonstrates the core idea: the value of one state relies on the values of other states.

• A matrix-vector form is more clear to see how to solve the state values. 

Exercise: Consider the policy shown in the figure. Please write out the relation among the returns (that is to write out the Bellman equation)

Answer:

v1 = 0 + γv3

v2 = 1 + γv4

v3 = 1 + γv4

v4 = 1 + γv4

Exercise: How to solve them? We can first calculate v4, and then v3, v2, v1.

2.State value

Consider the following single-step process:

 

Note that St, At, Rt+1 are all random variables.

This step is governed by the following probability distributions:

• St → At is governed by π(At = a|St = s) 由policy来决定的

• St, At → Rt+1 is governed by p(Rt+1 = r|St = s, At = a) 由reward probability 决定

• St, At → St+1 is governed by p(St+1 = s′|St = s, At = a) 由state transition probability决定

At this moment, we assume we know the model (i.e., the probability distributions)!

Consider the following multi-step trajectory:

 The discounted return is

• γ ∈ (0, 1) is a discount rate.

• Gt is also a random variable since Rt+1, Rt+2, . . . are random variables.

The expectation (or called expected value or mean) of Gt is defined as the state-value function or simply state value: 

 Remarks:

• It is a function of s. It is a conditional expectation with the condition that the state starts from s.

• It is based on the policy π. For a different policy, the state value may be different.

Q: What is the relationship between return and state value?

 

 

3.Bellman equation: Derivation 

• While state value is important, how to calculate? The answer lies in the Bellman equation.

• In a word, the Bellman equation describes the relationship among the values of all states.

• Next, we derive the Bellman equation.

• There is some math.

• We already have the intuition.

Deriving the Bellman equation

Consider a random trajectory:

The return Gt can be written as

 Then, it follows from the definition of the state value that

Next, calculate the two terms, respectively.

First, calculate the first term E[Rt+1|St = s]:

 在s-state下获得Rt+1的期望,即等于在s-state采取a-action的概率乘以在s-state,a-action下Rt+1的期望的和,即全概率公式 

 

等于在s-state,a-action取到r-reward的概率乘以r值,即期望的公式

Note that

 • This is the mean of immediate rewards

Second, calculate the second term E[Gt+1|St = s]:

Note that

• This is the mean of future rewards

• E[Gt+1|St = s, St+1 = s′] = E[Gt+1|St+1 = s′] due to the memoryless Markov property. 

Therefore, we have

Highlights:

• The above equation is called the Bellman equation, which characterizes the relationship among the state-value functions of different states.

• It consists of two terms: the immediate reward term and the future reward term.

• A set of equations: every state has an equation like this!!! 

Highlights: symbols in this equation

• vπ(s) and vπ(s′) are state values to be calculated. Bootstrapping!

• π(a|s) is a given policy. Solving the equation is called policy evaluation.

• p(r|s, a) and p(s′|s, a) represent the dynamic model. What if the model is known or unknown?

An illustrative example

Write out the Bellman equation according to the general expression:

 

This example is simple because the policy is deterministic.

First, consider the state value of s1:

• π(a = a3|s1) = 1 and π(a 6= a3|s1) = 0.

• p(s′ = s3|s1, a3) = 1 and p(s′ 6= s3|s1, a3) = 0.

• p(r = 0|s1, a3) = 1 and p(r 6= 0|s1, a3) = 0. 

Substituting them into the Bellman equation gives 

Write out the Bellman equation according to the general expression:

 

Similarly, it can be obtained that、

 

 

What to do after we have calculated state values? Be patient (calculating action value and improve policy) 

Exercise

 

• write out the Bellman equations for each state.

• solve the state values from the Bellman equations.

• compare with the policy in the last example. 

Solve the above equations one by one from the last to the first.

 Substituting γ = 0.9 yields

Compare with the previous policy. This one is worse.

4.Bellman equation: Matrix-vector form 

Matrix-vector form of the Bellman equation

Why consider the matrix-vector form? Because we need to solve the state values from it!

• One unknown relies on another unknown. How to solve the unknowns?

 • Elementwise form: The above elementwise form is valid for every state s ∈ S. That means there are |S| equations like this! 

• Matrix-vector form: If we put all the equations together, we have a set of linear equations, which can be concisely written in a matrix-vector form. The matrix-vector form is very elegant and important.

Recall that:

Rewrite the Bellman equation as

 where

Suppose the states could be indexed as si (i = 1, . . . , n).

For state si, the Bellman equation is

 

Put all these equations for all the states together and rewrite to a matrix-vector form 

 

 

Illustrative examples 

 

 

For this specific example:

 

 

 

 

5.Bellman equation: Solve the state values 

Why to solve state values?

• Given a policy, finding out the corresponding state values is called policy evaluation! It is a fundamental problem in RL. It is the foundation to find better policies.

• It is important to understand how to solve the Bellman equation.

The Bellman equation in matrix-vector form is

• The closed-form solution is:

In practice, we still need to use numerical tools to calculate the matrix inverse.

Can we avoid the matrix inverse operation? Yes, by iterative algorithms. 

• An iterative solution is:

This algorithm leads to a sequence {v0, v1, v2, . . . }. We can show that 

 

The following are two “good” policies and the state values. The two policies are different for the top two states in the forth column. 

 

The following are two “bad” policies and the state values. The state values are less than those of the good policies. 

6.Action value 

Action value

From state value to action value:

• State value: the average return the agent can get starting from a state.

• Action value: the average return the agent can get starting from a state and taking an action.

Why do we care action value? Because we want to know which action is better. This point will be clearer in the following lectures. We will frequently use action values.

Definition:

 

It follows from the properties of conditional expectation that

 

Hence,

 

 

(2) and (4) are the two sides of the same coin:

• (2) shows how to obtain state values from action values.

• (4) shows how to obtain action values from state values. 

llustrative example for action value

Write out the action values for state s1.

 

Questions:

 

 

Highlights:

• Action value is important since we care about which action to take.

• We can first calculate all the state values and then calculate the action values.

• We can also directly calculate the action values with or without models. 

7.Summary

Key concepts and results:

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/2072878.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

FreeSWITCH 1.10.10 简单图形化界面28 - 麒麟V10 SP3服务器系统X86和ARM版本安装FreeSWITCH

FreeSWITCH 1.10.10 简单图形化界面28 - 麒麟V10 SP3 服务器系统X86和ARM版本安装FreeSWITCH 界面预览00、先看使用手册01、 麒麟服务器v10 sp3 x86版本1、安装操作系统2、下载安装脚本3、安装 02、麒麟服务器v10 sp3 arm版本1、安装操作系统2、下载安装脚本3、安装 03、登录网…

搭建一个私有的知识库mm-wiki

文章目录 前言一、mm-wiki二、安装步骤下载安装 总结 前言 一般公司内部想要记录一些东西,都需要一个共享文档,当然可以选择类似比较简单易用的,有道云笔记,腾讯文档,语雀等,但是肯定有些公司是保密的,所以不希望这些数据被泄露,当然选择本地存储是最安全的~ 一、mm-wiki 对于…

vue3+vite配置环境变量实现开发、测试、生产的区分

文章目录 一、为什么需要区分 (dev)、测试 (test) 和生产 (prod) 环境二、vue3的项目如何通过配置方式区分不同的环境1、创建不同环境的.env文件2、在不同的.env文件中配置相应的环境变量1).env.develoment2).env.test3).env.production 3、在…

查找2

树表的查找 1)二叉排序树 I)二叉排序树的插入 II)二叉排序树的生成 III)二叉排序树的删除 2)平衡二叉树 I)平衡二叉树调整 、

MMOE+ESSM

MMOE 动机 多个任务之间的相关性并不是很强,这个时候如果再用过去那种共享底座embedding的结构,往往会导致『跷跷板』现象。 当前学术界已经有很多工作意识到1中描述的问题并且尝试去解决,但大多数工作的套路都是『大力出奇迹』的路子&…

zigbee笔记、十五、组播通信原理

一、zigbee四种通讯 1、单播(略) 2、广播(略) 3、组播:在zigbee网络中,模块可以用分组来标记,发送的模块如果发送的组号和网络里面标记接收模块的组号相对应,那么这些模块就可以拿到…

深度剖析渗透测试:流程、规范与实战全指南

一、引言 在当今数字化时代,网络安全问题日益凸显。渗透测试作为一种主动的安全评估方法,能够帮助企业和组织发现潜在的安全漏洞,提高系统的安全性。本文将详细介绍渗透测试的实施流程、规范、不同类型的测试方法以及相关的 checklist 和报告…

Matlab处理H5文件

1.读取h5文件 filenamexxx.h5; h5disp(filename) 2.h5文件保存为mat文件 读取 HDF5 文件中的数据 % 指定 HDF5 文件的路径 filename xxx.h5;% 读取 HDF5 文件中的各个数据集 A241_P h5read(filename, /A241_P); A241_W h5read(filename, /A241_W); A242_P h5read(filen…

ensp 中 wlan 的配置过程和示例

一、拓朴: 要求:vlan20 用于笔记本上网,使用Huawei信号,vlan30 用于手机上网,使用 Huawei-5G 信号 二、配置过程: 1、SW1 基本配置: 起 vlan batch 10 20 30,10 为管理 vlan&#…

Acrobat Pro DC 2024 for mac/Win:跨平台PDF编辑与管理的巅峰之作

Adobe Acrobat Pro DC 2024是一款专为Mac和Windows用户设计的全面PDF解决方案软件,它集成了创建、编辑、转换、共享和签署PDF文件的强大功能,为用户带来前所未有的高效与便捷体验。 强大的PDF编辑功能 Acrobat Pro DC 2024在PDF编辑方面表现出色。用户…

JavaScript初级——DOM增删改

1、 document.createElement() —— 可以用于创建一个元素节点对象,他需要一个标签名作为参数,将会根据该标签名创建元素节点对象,并将创建好的对象作为返回值返回。 2、 document.createTextNode(&#…

职场达人必备!MyComputerManager助你轻松管理快捷方式

前言 你是否还在为硬盘管理界面上那一堆乱糟糟的快捷方式头疼不已?是不是每次打开‘此电脑’都像是在玩寻宝游戏,寻找那个被深埋的文件夹?想象一下,如果能在此电脑页面一键启动程序,是不是觉得整个人都轻松了许多&…

使用Tabs组件提升页面内容的聚焦与分类效率

当页面信息量较大时,为了提高用户的浏览效率,我们需要对页面内容进行有效的分类和展示。HarmonyOS提供的Tabs组件是一个理想的解决方案,可以在一个页面内快速切换视图内容,提升用户查找信息的效率,同时减少用户在单次操…

CSS定位与布局

一、display属性(元素如何显示) 网页上的每个元素都是一个​盒模型​。​display​属性决定了盒模型的​行为方式​,设置元素如何被显示。 display常用的属性共有​4个​值: ​display: none;​ -- 让标签消失(隐藏元素并脱离文档…

Mac M1Pro 安装Java性能监控工具VisualVM 2.1.9

本地已经安装了java8,在终端输入jvisualvm提示没有安装 zhiniansara ~ % jvisualvm The operation couldn’t be completed. Unable to locate a Java Runtime that supports jvisualvm. Please visit http://www.java.com for information on installing Java.官网…

Kafka事件(消息、数据)的存储

1、查看有关kafka日志配置文件的信息 2、查看kafka全部主题的日志文件 3、查看每个主题的日志文件 4、__consumer_offsets-xx文件夹的作用 package com.power;public class Test {public static void main(String[] args) {int partition Math.abs("myTopic".hashCo…

企业微信API对接文档【可向微信用户发消息】

目录 企业微信API对接文档 1.背景 2.获取微信第三方token 3.安装docker环境 4.打包与启动 4.1打包镜像 4.2启动容器(启动应用) 5.企业微信二维码验证 5.1 获取初始二维码 5.2 第1次二维码验证 5.3 第2次二维码验证 6. 企业微信三…

上博士为了毕业写学术论文头都大了,但更难受的是英语不咋地,投稿后经常会因为语言问题而惨遭拒稿,每每想起就令人心情郁郁,天台可期。

上博士为了毕业写学术论文头都大了,但更难受的是英语不咋地,投稿后经常会因为语言问题而惨遭拒稿,每每想起就令人心情郁郁,天台可期。有些审稿人也会直接告知需要专业的修改,那咋整呢,让润色呗,…

虚拟机virtualbox linux ubuntu使用usb串口

1.卸载brltty sudo apt remove brltty brltty是一个没啥用但是会抢占ch431的软件,所以卸载它 2.连接上串口,点击连接对应的usb串口 3.查看是否连接上 sudo dmesg -T | grep tty 查看tty组的最近日志,如果连接成功会显示连接的时间和串口…

基于数据挖掘的心力衰竭疾病风险评估系统

B站视频及代码下载:基于数据挖掘的心力衰竭疾病风险评估系统_哔哩哔哩_bilibili 1. 项目简介 心力衰竭是一种常见的心脏疾病,它严重影响患者的生活质量和预期寿命。早期识别和干预对于改善患者的预后至关重要。近年来,随着大数据技术和机器学…