Reinforcement-Learning 1.fundamental concept

news2024/11/15 11:27:01

1.首先用一个网格世界来理解

机器人在网格世界行走有四种形式,Accessible/forbidden/target cells, boundary.

提出一个任务,找到一个good的方式去到target

什么是good,不碰到boundary不进入forbidden最短的道路进入target

2.state

State: The status of the agent with respect to the environment.

对于网格世界机器人有九种状态

State space: the set of all states 

对于双足机器人行走state来说,可以分为抬髋,屈膝,放髋,直膝四种state

3.Action

Action: For each state, there are five possible actions: a1, . . . , a5

• a1: move upward;

• a2: move rightward;

• a3: move downward;

• a4: move leftward;

• a5: stay unchanged;

 Action space of a state: the set of all possible actions of a state.

对于双足机器人来说,在抬髋时有两种action ,屈膝或者是放髋

Question: can different states have different sets of actions? 

4.State transition

When taking an action, the agent may move from one state to another. Such a process is called state transition.

Example: At state s1, if we choose action a2, then what is the next state?

Example: At state s1, if we choose action a1, then what is the next state?

State transition describes the interaction with the environment.

Question: Can we define the state transition in other ways? Simulation vs physics

 simulation我们可以任意的定义

physics不可以的

Pay attention to forbidden areas: Example: at state s5, if we choose action a2, then what is the next state?

Case 1: the forbidden area is accessible but with penalty. Then,

Case 2: the forbidden area is inaccessible (e.g., surrounded by a wall)

We consider the first case, which is more general and challenging.

Tabular representation: We can use a table to describe the state transition: Can only represent deterministic cases

State transition probability: use probability to describe state transition!

Intuition: At state s1, if we choose action a2, the next state is s2.

Math:

 Here it is a deterministic case. The state transition could be stochastic (for example, wind gust).

5.Policy

Policy tells the agent what actions to take at a state.

告诉agent在这个state应该采取什么action

即在双足机器人中抬髋之后应该屈膝

Intuitive representation: We use arrows to describe a policy.

Based on this policy, we get the following trajectories with different starting points.

规定了在每个state中agent应该怎么走,比如双足机器人在抬髋state时,下一个action要是屈膝而不是直膝

 Mathematical representation: using conditional probability For example, for state s1:

It is a deterministic policy. 

上面的都是确定的policy

There are stochastic policies. For example:

In this policy, for s1:

 

Tabular representation of a policy: how to use this table.

This table can represent either deterministic or stochastic cases.

6.Reward

Reward is one of the most unique concepts of RL.

Reward: a real number we get after taking an action.

• A positive reward represents encouragement to take such actions.

• A negative reward represents punishment to take such actions.

注意:是这次action

Questions:

What about a zero reward? No punishment.

• Can positive mean punishment? Yes.

In the grid-world example, the rewards are designed as follows: 

• If the agent attempts to get out of the boundary, let rbound = −1

• If the agent attempts to enter a forbidden cell, let rforbid = −1

• If the agent reaches the target cell, let rtarget = +1

• Otherwise, the agent gets a reward of r = 0.

Reward can be interpreted as a human-machine interface, with which we can guide the agent to behave as what we expect.

For example, with the above designed rewards, the agent will try to avoid getting out of the boundary or stepping into the forbidden cells.

Tabular representation of reward transition: how to use the table?

Can only represent deterministic cases. 

Mathematical description: conditional probability        

• Intuition: At state s1, if we choose action a1, the reward is −1.

• Math: p(r = −1|s1, a1) = 1 and p(r 6= −1|s1, a1) = 0

Remarks:

Here it is a deterministic case. The reward transition could be stochastic. For example, if you study hard, you will get rewards. But how much is uncertain.

7.Trajectory and return

A trajectory is a state-action-reward chain:

The return of this trajectory is the sum of all the rewards collected along the trajectory: 

                return = 0 + 0 + 0 + 1 = 1

A different policy gives a different trajectory:

 The return of this path is:        return = 0 − 1 + 0 + 1 = 0

Which policy is better?

• Intuition: the first is better, because it avoids the forbidden areas.

• Mathematics: the first one is better, since it has a greater return!

• Return could be used to evaluate whether a policy is good or not (see details in the next lecture)! 

8.Discounted return

 A trajectory may be infinite:

The return is

return = 0 + 0 + 0 + 1+1 + 1 + . . . = ∞

The definition is invalid since the return diverges!\

Need to introduce a discount rate γ ∈ [0, 1): 

Roles: 1) the sum becomes finite; 2) balance the far and near future rewards:

• If γ is close to 0, the value of the discounted return is dominated by the rewards obtained in the near future.

• If γ is close to 1, the value of the discounted return is dominated by the rewards obtained in the far future

9.Episode

When interacting with the environment following a policy, the agent may stop at some terminal states. The resulting trajectory is called an episode (or a trial).

Example: episode 

An episode is usually assumed to be a finite trajectory. Tasks with episodes are called episodic tasks. 

Some tasks may have no terminal states, meaning the interaction with the environment will never end. Such tasks are called continuing tasks.

In the grid-world example, should we stop after arriving the target?

In fact, we can treat episodic and continuing tasks in a unified mathematical way by converting episodic tasks to continuing tasks.

• Option 1: Treat the target state as a special absorbing state. Once the agent reaches an absorbing state, it will never leave. The consequent rewards r = 0.

• Option 2: Treat the target state as a normal state with a policy. The agent can still leave the target state and gain r = +1 when entering the target state.如果policy好的话,agent就会一直留在那,如果policy不好的话,agent还可能跳出来

We consider option 2 in this course so that we don’t need to distinguish the target state from the others and can treat it as a normal state.

10.Markov decision process (MDP)

Key elements of MDP:

• Sets:

        • State: the set of states S

        • Action: the set of actions A(s) is associated for state s ∈ S.

        • Reward: the set of rewards R(s, a).

• Probability distribution (or called system model):

        • State transition probability: at state s, taking action a, the probability to transit to state s′ is p(s′|s, a)

        • Reward probability: at state s, taking action a, the probability to get reward r is p(r|s, a)

• Policy: at state s, the probability to choose action a is π(a|s) 

• Markov property: memoryless property

All the concepts introduced in this lecture can be put in the framework in MDP.

The grid world could be abstracted as a more general model, Markov process

The circles represent states and the links with arrows represent the state transition.

11.Summary 

By using grid-world examples, we demonstrated the following key concepts:

• State

• Action

• State transition, state transition probability p(s′|s, a)

• Reward, reward probability p(r|s, a)

• Trajectory, episode, return, discounted return

• Markov decision process 

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/2072147.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

UE基础 —— 打包项目

目录 设置游戏的默认地图 创建打包文件 发布 签名和加密 内容烘焙 优化加载时间 使用事件驱动加载器(Event Driven Loader,EDL)和异步加载线程(Asynchronous Loading Thread,ALT) 压缩.pak文件 对…

PG数据库导致断电/重启无法正常启动

一、问题 数据库断电后,启动PG数据库后无法正常启动,报”psql: could not connect to server: No such file or directory”的错误,错误图片如下: 二、背景分析 数据库是单机版,使用k8s进行部署运行在指定节点&#…

华为OD机试-找座位(C++ Java Python)

题目描述: 在一个大型体育场内举办了一场大型活动,由于疫情防控的需要,要求每位观众的必须间隔至少一个空位才允许落座。现在给出一排观众座位 分布图,座位中存在已落座的观众,请计算出,在不移动现有观众座位的情况下&…

正则表达式匹配——力扣困难题解

力扣链接:正则表达式匹配 题目描述: 给你一个字符串 s 和一个字符规律 p,请你来实现一个支持 ‘.’ 和 ‘*’ 的正则表达式匹配。 ‘.’ 匹配任意单个字符 ‘*’ 匹配零个或多个前面的那一个元素 所谓匹配,是要涵盖 整个 字符串 …

GUI界面开发之tkinter(三) 按钮类组件和选择列表类组件

大家好!我是码银儿~,欢迎关注🥰: CSDN:码银公众号:码银学编程 一、按钮类组件 按钮类组件顾名思义就是按钮,跟平时大家看见的按钮没啥区别,允许用户通过点击执行操作。以下是三种…

Awesome-LLMs-for-Video-Understanding - 基于大型语言模型的视频理解研究

Awesome-LLMs-for-Video-Understanding 是 基于大型语言模型的视频理解研究 github : https://github.com/yunlong10/Awesome-LLMs-for-Video-Understandingpaper:Video Understanding with Large Language Models: A Survey https://arxiv.org/pdf/2312.17432 视频…

五、前后端分离通用权限系统(5)

🌻🌻 目录 一、前端框架1.1、vue-element-admin1.1.1、Vue 概述1.1.2、Element-ui 概述1.1.3、ES6 概述 1.2、vue-admin-template1.2.1、简介1.2.2、下载1.2.3、安装1.2.4、源码目录结构(了解)1.2.5、改造登录&退出功能1.2.5.…

跨域解决 | 面试常问问题

跨域解决 | 面试常问问题 跨域问题一直是前端开发中不可避免的一部分,它涉及到浏览器的同源策略和安全机制。本文将深入解析跨域问题的本质,并探讨前端和后端的多种解决方案,同时分享一些扩展与高级技巧。最后,我们还将总结跨域解…

K8S系列——(二)、K8S部署RocketMQ集群

1、环境准备 要将RocketMQ部署到K8S上,首先你需要提前准备一个K8S集群环境,如图我已经准备好了一个版本为 v1.28.13 的 K8S 集群(其他版本也没问题): 角色IPMaster192.168.6.220Node-1192.168.6.221Node-2192.168.6.…

浏览器不开梯子无法上网,检查代理或防火墙或者找不到服务器ip地址

1、代理没有关闭 检查代理是否关闭 检查方法1: 在控制面版中找到Internet选项,点击连接栏,在连接栏中选择局域网设置。之后将代理服务器下面的框选中的对勾取消。最终如下 检查方法2: 打开设置,找到网络和internet…

书生浦语大模型实战营:LMDeploy量化部署

1.任务: 使用结合W4A16量化与kv cache量化的internlm2_5-1_8b-chat模型封装本地API并与大模型进行一次对话。 2.背景: 1.计算模型需要的权重大小: 1B代表10个亿参数,假如是16位浮点数(f16),也…

计算机视觉概念科普

计算机视觉(Computer Vision, CV)是一门多学科交叉的科学,旨在让计算机具备“看”的能力,即通过图像或视频数据来理解世界。它结合了信号处理、图像处理、模式识别、机器学习等多个领域的技术,让计算机能够执行诸如识别…

【Python学习手册(第四版)】学习笔记20.2-迭代和解析(二)-迭代解析、迭代方法的计时比较、函数陷阱

个人总结难免疏漏,请多包涵。更多内容请查看原文。本文以及学习笔记系列仅用于个人学习、研究交流。 本文较简单,主要是概括了解析语法(列表解析、生成器、集合、字典解析),以及对前面的各种迭代进行计时比较&#xf…

通过python解决原神解密

最近楼主玩原神世界任务做到稻妻了,在稻妻有很多解密游戏,但是博主最头疼的就是稻妻的石头解密QAQ(如图) 就在昨晚,楼主又碰到了石头解密,瞎打,半天解不出来。于是就想,有没有什么严…

如何在Windows下使用make编译Makefile

最近有小伙伴咨询我去编译运行一个程序。我一开始以为是CMakeLists,结果发现是makefile。 什么是Makefile ‌Makefile是一种用于自动化构建和管理程序的工具‌,它定义了项目中文件的依赖关系和构建步骤,帮助程序员自动化编译、链接和打包程序…

Ps:创建帧动画

在 Photoshop 中,帧动画 Frame Animation是一种通过在“时间轴”面板中创建和管理多个帧来实现动画效果的方式。 所谓帧动画,也就是传统意义上的逐帧动画,依次播放每个帧而构成的动画形式。每个帧记录了“图层”面板上所有图层的属性状态&…

QT Mainwindow下指定控件的setMouseTracking(true)和mousemoveevent函数失效-问题解决

目录: 一,问题描述二,解决方法2.1解决依据2.2方法实操 三,参考资料 一,问题描述 ☀️之前碰到过的一个问题,现在分享出来:想在qt哪里搞个鼠标移动在控件显示的图片上,然后实时显示对…

[数据集][目标检测]红外场景下车辆和行人检测数据集VOC+YOLO格式19069张4类别

数据集格式:Pascal VOC格式YOLO格式(不包含分割路径的txt文件,仅仅包含jpg图片以及对应的VOC格式xml文件和yolo格式txt文件) 图片数量(jpg文件个数):19069 标注数量(xml文件个数):19069 标注数量(txt文件个数):19069 标…

一文带你画PCB板,有手就行

背景 最近写Autosar网络唤醒功能,想在实际硬件上验证,但是市面上没有找到板子验证,只能找人帮忙画PCB板。但是这里遇到比较大的问题,1、整个周期会比较长,板子不太可能一次就能完成,中间会出现修改的地方&…

C#/.NET/.NET Core技术前沿周刊 | 第 1 期(2024年8.12-8.18)

前言 C#/.NET/.NET Core技术前沿周刊,你的每周技术指南针!记录、追踪C#/.NET/.NET Core领域、生态的每周最新、最实用的技术文章、社区动态、优质项目和学习资源等。让你时刻站在技术前沿,助力技术成长与视野拓宽。 欢迎投稿,推荐…