Reinforcement-Learning

news2025/1/11 0:34:02

文章目录

  • Reinforcement-Learning
        • 1. RL方法分类汇总:
        • 2. Q-Learning
        • 3. SARSA算法
        • 4. SARSA(λ)

Reinforcement-Learning

1. RL方法分类汇总:

(1)不理解环境(Model-Free RL):不尝试理解环境,环境给了什么就是什么;机器人只能按部就班一步一步等待真实世界的反馈,再根据反馈采取下一步的行动
理解环境(Model-Based RL):学会了用一种模型来模拟环境;能够通过想象来预判断接下来要发生的所有情况,然后根据这些想象中的情况选择最好的那种,并根据这种情况来采取下一步的策略。

(2)基于概率(Policy-Based RL):通过感官分析所处的环境,直接输出下一步采取的各种行动的概率,然后根据概率采取行动,所以每种动作都有可能被选中,只是可能性不同;用一个概率分布在连续动作中选择特定的动作

基于价值(Value-Based RL):通过感官分析所处的环境,直接输出所有动作的价值,我们会选择价值最高的那个动作;对于连续的动作无能为力

(3)回合更新(Monte-Carlo update):假设强化学习是一个玩游戏的过程。游戏开始后需要等待游戏结束,然后再总结,再更新我们的行为准则

单步更新(Temporal-Difference update):在游戏进行中的每一步都在更新,不用等待游戏的结束,这样就能边玩边学习了

(4)在线学习(On-Policy):本人在场,而且必须是本人边玩边学习

离线学习(Off-Policy):可以选择自己玩,也可以选择看着别人玩,通过看着别人玩来学习别人的行为准则,同样是从过往经历中学习,但这些经历没必要是自己的

2. Q-Learning



注意!虽然用了maxQ(s2)来估计下一个s2状态,但还没有在状态s2作出任何的行为,s2的决策部分要等到更新完了以后再重新另外执行这一过程




ϵ - greedy是用在决策上的一种策略,如ϵ=0.9时,说明90%的情况按Q表的最优值来选择行为,10%的时间使用随机选择行为;

α是学习效率,来决定这一次误差有多少要被学习,α<1

γ是对未来奖励的衰减值

"""
Reinforcement learning maze example.
Red rectangle:          explorer.
Black rectangles:       hells       [reward = -1].
Yellow bin circle:      paradise    [reward = +1].
All other states:       ground      [reward = 0].
This script is the environment part of this example. The RL is in RL_brain.py.
"""


import numpy as np
import time
import sys
if sys.version_info.major == 2:
    import Tkinter as tk
else:
    import tkinter as tk


UNIT = 40   # pixels
MAZE_H = 4  # grid height
MAZE_W = 4  # grid width


class Maze(tk.Tk, object):
    def __init__(self):
        super(Maze, self).__init__()
        self.action_space = ['u', 'd', 'l', 'r']
        self.n_actions = len(self.action_space)
        self.title('maze')
        self.geometry('{0}x{1}'.format(MAZE_H * UNIT, MAZE_H * UNIT))
        self._build_maze()

    def _build_maze(self):
        self.canvas = tk.Canvas(self, bg='white',
                           height=MAZE_H * UNIT,
                           width=MAZE_W * UNIT)

        # create grids
        for c in range(0, MAZE_W * UNIT, UNIT):
            x0, y0, x1, y1 = c, 0, c, MAZE_H * UNIT
            self.canvas.create_line(x0, y0, x1, y1)
        for r in range(0, MAZE_H * UNIT, UNIT):
            x0, y0, x1, y1 = 0, r, MAZE_W * UNIT, r
            self.canvas.create_line(x0, y0, x1, y1)

        # create origin
        origin = np.array([20, 20])

        # hell
        hell1_center = origin + np.array([UNIT * 2, UNIT])
        self.hell1 = self.canvas.create_rectangle(
            hell1_center[0] - 15, hell1_center[1] - 15,
            hell1_center[0] + 15, hell1_center[1] + 15,
            fill='black')
        # hell
        hell2_center = origin + np.array([UNIT, UNIT * 2])
        self.hell2 = self.canvas.create_rectangle(
            hell2_center[0] - 15, hell2_center[1] - 15,
            hell2_center[0] + 15, hell2_center[1] + 15,
            fill='black')

        # create oval
        oval_center = origin + UNIT * 2
        self.oval = self.canvas.create_oval(
            oval_center[0] - 15, oval_center[1] - 15,
            oval_center[0] + 15, oval_center[1] + 15,
            fill='yellow')

        # create red rect
        self.rect = self.canvas.create_rectangle(
            origin[0] - 15, origin[1] - 15,
            origin[0] + 15, origin[1] + 15,
            fill='red')

        # pack all
        self.canvas.pack()

    def reset(self):
        self.update()
        time.sleep(0.5)
        self.canvas.delete(self.rect)
        origin = np.array([20, 20])
        self.rect = self.canvas.create_rectangle(
            origin[0] - 15, origin[1] - 15,
            origin[0] + 15, origin[1] + 15,
            fill='red')
        # return observation
        return self.canvas.coords(self.rect)

    def step(self, action):
        s = self.canvas.coords(self.rect)
        base_action = np.array([0, 0])
        if action == 0:   # up
            if s[1] > UNIT:
                base_action[1] -= UNIT
        elif action == 1:   # down
            if s[1] < (MAZE_H - 1) * UNIT:
                base_action[1] += UNIT
        elif action == 2:   # right
            if s[0] < (MAZE_W - 1) * UNIT:
                base_action[0] += UNIT
        elif action == 3:   # left
            if s[0] > UNIT:
                base_action[0] -= UNIT

        self.canvas.move(self.rect, base_action[0], base_action[1])  # move agent

        s_ = self.canvas.coords(self.rect)  # next state

        # reward function
        if s_ == self.canvas.coords(self.oval):
            reward = 1
            done = True
            s_ = 'terminal'
        elif s_ in [self.canvas.coords(self.hell1), self.canvas.coords(self.hell2)]:
            reward = -1
            done = True
            s_ = 'terminal'
        else:
            reward = 0
            done = False

        return s_, reward, done

    def render(self):
        time.sleep(0.1)
        self.update()


def update():
    for t in range(10):
        s = env.reset()
        while True:
            env.render()
            a = 1
            s, r, done = env.step(a)
            if done:
                break

if __name__ == '__main__':
    env = Maze()
    env.after(100, update)
    env.mainloop()
import numpy as np
import pandas as pd

class QLearningTable:
    def __init__(self,actions,learning_rate=0.01,reward_decay=0.9,e_greedy=0.9):
        self.actions=actions 
        self.lr=learning_rate
        self.gamma=reward_decay
        self.epsilon=e_greedy
        self.q_table=pd.DataFrame(columns=self.actions,dtype=np.float64)

    def choose_action(self,observation):
        self.check_state_exist(observation) #判断当前观测值是否在表中

        #动作选择
        if np.random.uniform()<self.epsilon:#numpy.random.uniform(x,y)随机生成一个浮点数,它在 [x, y] 范围内,默认值x=0,y=1
            #choose best action
            state_action=self.q_table.loc[observation,:]
            #some actions may have the same value,randomly choose in these actions
            action=np.random.choice(state_action[state_action==np.max(state_action)].index)
        else:
            #choose random action
            action=np.random.choice(self.actions)
        return action
    
    def learn(self,s,a,r,s_):
        self.check_state_exist(s_)
        q_predict=self.q_table.loc[s,a]
        if s_!='terminal': #next state is not terminal
            q_target=r+self.gamma*self.q_table.loc[s_,:].max() 
        else: #到达terminal,得到奖励
            q_target=r
        self.q_table.loc[s,a]+=self.lr*(q_target-q_predict) #更新

    def check_state_exist(self,state):
        if state not in self.q_table.index:
            #不在,就将新出现的state值追加到表中
            self.q_table=self.q_table.append(
                pd.Series( #Series是能够保存任何类型的数据(整数,字符串,浮点数,Python对象等)的一维标记数组。轴标签统称为索引。
                    [0]*len(self.actions), #len()方法返回列表元素个数,[0]*3=[0,0,0]
                    index=self.q_table.columns,
                    name=state,
                )
        )


         
from maze_env import Maze
from RL_brain import QLearningTable

def update():
    for episode in range(100):
        observation=env.reset() #初始化观测值

        while True:
            env.render() #渲染刷新环境
            action=RL.choose_action(str(observation))

            observation_,reward,done=env.step(action)

            RL.learn(str(observation),action,reward,str(observation_))

            observation=observation_

            if done:
                break
    #end of the game
    print('game over')
    env.destroy()

if __name__=="__main__":
    env=Maze()
    RL=QLearningTable(actions=list(range(env.n_actions)))

    env.after(100,update)
    env.mainloop()

3. SARSA算法



SARSA算法在S2这一步估计的动作也是接下来要做的动作,所以现实值会进行改动,去掉maxQ,改为实实在在的该动作的Q值




SARSA算法:说到做到,行为策略和目标策略相同

Q-Learning:说到不一定做到,行为策略和目标策略不同

import numpy as np
import time
import sys
if sys.version_info.major == 2:
   import Tkinter as tk
else:
   import tkinter as tk


UNIT = 40   # pixels
MAZE_H = 4  # grid height
MAZE_W = 4  # grid width


class Maze(tk.Tk, object):
   def __init__(self):
       super(Maze, self).__init__()
       self.action_space = ['u', 'd', 'l', 'r']
       self.n_actions = len(self.action_space)
       self.title('maze')
       self.geometry('{0}x{1}'.format(MAZE_H * UNIT, MAZE_H * UNIT))
       self._build_maze()

   def _build_maze(self):
       self.canvas = tk.Canvas(self, bg='white',
                          height=MAZE_H * UNIT,
                          width=MAZE_W * UNIT)

       # create grids
       for c in range(0, MAZE_W * UNIT, UNIT):
           x0, y0, x1, y1 = c, 0, c, MAZE_H * UNIT
           self.canvas.create_line(x0, y0, x1, y1)
       for r in range(0, MAZE_H * UNIT, UNIT):
           x0, y0, x1, y1 = 0, r, MAZE_W * UNIT, r
           self.canvas.create_line(x0, y0, x1, y1)

       # create origin
       origin = np.array([20, 20])

       # hell
       hell1_center = origin + np.array([UNIT * 2, UNIT])
       self.hell1 = self.canvas.create_rectangle(
           hell1_center[0] - 15, hell1_center[1] - 15,
           hell1_center[0] + 15, hell1_center[1] + 15,
           fill='black')
       # hell
       hell2_center = origin + np.array([UNIT, UNIT * 2])
       self.hell2 = self.canvas.create_rectangle(
           hell2_center[0] - 15, hell2_center[1] - 15,
           hell2_center[0] + 15, hell2_center[1] + 15,
           fill='black')

       # create oval
       oval_center = origin + UNIT * 2
       self.oval = self.canvas.create_oval(
           oval_center[0] - 15, oval_center[1] - 15,
           oval_center[0] + 15, oval_center[1] + 15,
           fill='yellow')

       # create red rect
       self.rect = self.canvas.create_rectangle(
           origin[0] - 15, origin[1] - 15,
           origin[0] + 15, origin[1] + 15,
           fill='red')

       # pack all
       self.canvas.pack()

   def reset(self):
       self.update()
       time.sleep(0.5)
       self.canvas.delete(self.rect)
       origin = np.array([20, 20])
       self.rect = self.canvas.create_rectangle(
           origin[0] - 15, origin[1] - 15,
           origin[0] + 15, origin[1] + 15,
           fill='red')
       # return observation
       return self.canvas.coords(self.rect)

   def step(self, action):
       s = self.canvas.coords(self.rect)
       base_action = np.array([0, 0])
       if action == 0:   # up
           if s[1] > UNIT:
               base_action[1] -= UNIT
       elif action == 1:   # down
           if s[1] < (MAZE_H - 1) * UNIT:
               base_action[1] += UNIT
       elif action == 2:   # right
           if s[0] < (MAZE_W - 1) * UNIT:
               base_action[0] += UNIT
       elif action == 3:   # left
           if s[0] > UNIT:
               base_action[0] -= UNIT

       self.canvas.move(self.rect, base_action[0], base_action[1])  # move agent

       s_ = self.canvas.coords(self.rect)  # next state

       # reward function
       if s_ == self.canvas.coords(self.oval):
           reward = 1
           done = True
           s_ = 'terminal'
       elif s_ in [self.canvas.coords(self.hell1), self.canvas.coords(self.hell2)]:
           reward = -1
           done = True
           s_ = 'terminal'
       else:
           reward = 0
           done = False

       return s_, reward, done

   def render(self):
       time.sleep(0.1)
       self.update()
"""
import numpy as np
import pandas as pd

#Q-Learning和SARSA的公共部分写在RL class内,让他们俩继承
class RL(object):
    def __init__(self,action_space,learning_rate=0.01,reward_decay=0.9,e_greedy=0.9):
        self.actions=action_space #a list
        self.lr=learning_rate
        self.gamma=reward_decay
        self.epsilon=e_greedy

        self.q_table=pd.DataFrame(columns=self.actions,dtype=np.float64)

    def check_state_exist(self,state):
        if state not in self.q_table.index:
            self.q_table=self.q_table.append(
                pd.Series(
                    [0]*len(self.actions),
                    index=self.q_table.columns,
                    name=state,
                )
            )

        
    def choose_action(self,observation):
        self.check_state_exist(observation)
        if np.random.rand()<self.epsilon:  #np.random.rand()可以返回一个服从“0~1”均匀分布的随机样本值。随机样本取值范围是[0,1)
            #choose best action
            state_action=self.q_table.loc[observation,:]
            #some action may have the same value, randomly choose on in these actions
            action=np.random.choice(state_action[state_action==np.max(state_action)].index)
        else:
            #choose random action
            action=np.random.choice(self.actions)
        return action

    def learn(self,*args): #Q-Learning和SARSA的这个部分不一样,接受的参数也不一样
        pass

#off-policy
class QLearningTable(RL): #继承了class RL
    def __init__(self,actions,learning_rate=0.01,reward_decay=0.9,e_greedy=0.9):
        super(QLearningTable,self).__init__(actions,learning_rate,reward_decay,e_greedy)
    
    def learn(self,s,a,r,s_):
        self.check_state_exist(s_)
        q_prediect=self.q_table.loc[s,a]
        if s_!='terminal': #next state isn't terminal
            q_target=r+self.gamma*self.q_table.loc[s_,:].max() #找出s_下最大的那个动作值
        else: #next state is terminal
            q_target=r
        self.q_table.loc[s,a]+=self.lr*(q_target-q_prediect) #update

#on-policy 边学边走,比Q-Learning要胆小一点的算法
class SarsaTable(RL): ##继承了class RL
    def __init__(self,actions,learning_rate=0.01,reward_decay=0.9,e_greedy=0.9):
        super(SarsaTable,self).__init__(actions,learning_rate,reward_decay,e_greedy)
    
    def learn(self,s,a,r,s_,a_): #比Q-learning多一个a_参数
        self.check_state_exist(s_)
        q_prediect=self.q_table.loc[s,a]
        if s_!='terminal':
            q_target=r+self.gamma*self.q_table.loc[s_,a_] #具体的s_,a_确定的唯一动作值
        else:
            q_target=r;
        self.q_table.loc[s,a]+=self.lr*(q_target-q_prediect)
"""
import numpy as np
import pandas as pd


class RL(object):
    def __init__(self, action_space, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9):
        self.actions = action_space  # a list
        self.lr = learning_rate
        self.gamma = reward_decay
        self.epsilon = e_greedy

        self.q_table = pd.DataFrame(columns=self.actions, dtype=np.float64)

    def check_state_exist(self, state):
        if state not in self.q_table.index:
            # append new state to q table
            self.q_table = self.q_table.append(
                pd.Series(
                    [0]*len(self.actions),
                    index=self.q_table.columns,
                    name=state,
                )
            )

    def choose_action(self, observation):
        self.check_state_exist(observation)
        # action selection
        if np.random.rand() < self.epsilon:
            # choose best action
            state_action = self.q_table.loc[observation, :]
            # some actions may have the same value, randomly choose on in these actions
            action = np.random.choice(state_action[state_action == np.max(state_action)].index)
        else:
            # choose random action
            action = np.random.choice(self.actions)
        return action

    def learn(self, *args):
        pass


# off-policy
class QLearningTable(RL):
    def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9):
        super(QLearningTable, self).__init__(actions, learning_rate, reward_decay, e_greedy)

    def learn(self, s, a, r, s_):
        self.check_state_exist(s_)
        q_predict = self.q_table.loc[s, a]
        if s_ != 'terminal':
            q_target = r + self.gamma * self.q_table.loc[s_, :].max()  # next state is not terminal
        else:
            q_target = r  # next state is terminal
        self.q_table.loc[s, a] += self.lr * (q_target - q_predict)  # update


# on-policy
class SarsaTable(RL):

    def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9):
        super(SarsaTable, self).__init__(actions, learning_rate, reward_decay, e_greedy)

    def learn(self, s, a, r, s_, a_):
        self.check_state_exist(s_)
        q_predict = self.q_table.loc[s, a]
        if s_ != 'terminal':
            q_target = r + self.gamma * self.q_table.loc[s_, a_]  # next state is not terminal
        else:
            q_target = r  # next state is terminal
        self.q_table.loc[s, a] += self.lr * (q_target - q_predict)  # update
            

"""
from maze_env1 import Maze 
from RL_brain1 import SarsaTable

def update():
    for episode in range(100):
        observation=env.reset() #从环境里获得observation
        action=RL.choose_action(str(observation)) 
        #Q-Learning的action是在下面这个while循环里选的,SARSA算法是在循环外
        while(True):
            env.render() #环境更新
            observation_,reward,done=env.step(action)
            action_=RL.choose_action(str(observation_))
            #与Q—learning不同之处:SARSA还要传入下一个动作action_,而Q—learning不需要
            RL.learn(str(observation),action,reward,str(observation_),action_)
            
            #sarsa所估计的下一个action,也是sarsa会采取的action
            #observation和action都更新
            observation=observation_
            action=action_

            if done:
                break
    #end of the game
    print('game over')
    env.destroy()

if __name__=="main":
    env=Maze()
    RL=SarsaTable(actions=list(range(env.n_actions)))

    env.after(100,update)
    env.mainloop()
"""
from maze_env1 import Maze
from RL_brain1 import SarsaTable


def update():
    for episode in range(100):
        # initial observation
        observation = env.reset()

        # RL choose action based on observation
        action = RL.choose_action(str(observation))

        while True:
            # fresh env
            env.render()

            # RL take action and get next observation and reward
            observation_, reward, done = env.step(action)

            # RL choose action based on next observation
            action_ = RL.choose_action(str(observation_))

            # RL learn from this transition (s, a, r, s, a) ==> Sarsa
            RL.learn(str(observation), action, reward, str(observation_), action_)

            # swap observation and action
            observation = observation_
            action = action_

            # break while loop when end of this episode
            if done:
                break

    # end of game
    print('game over')
    env.destroy()

if __name__ == "__main__":
    env = Maze()
    RL = SarsaTable(actions=list(range(env.n_actions)))

    env.after(100, update)
    env.mainloop()

4. SARSA(λ)

λ其实是一个衰变值,让你知道离奖励越远的步可能并不是让你最快拿到奖励的步。所以我们现在站在宝藏所处的位置,回头看看我们所走的寻宝之路,离宝藏越近的脚步我们看得越清楚,越远的脚步越渺小很难看清。所以我们索性认为离宝藏越近的脚步越重要,越需要好好更新。和之前提到的奖励衰减值γ一样,λ是脚步衰减值,都是一个在0和1之间的数.





当λ=0:Sarsa(0)就变成了SARSA的单步更新:每次只能更新最近的一步

当λ=1:Sarsa(1)就变成了SARSA的回合更新:对所有步更新的力度一样

当λ在(0,1),则取值越大,离宝藏越近的步更新力度越大。以不同力度更新所有与宝藏相关的步

SARSA(λ)的伪代码:



SARSA(λ)是向后看的过程,经历了哪些步就要标记一下,标记方法有两种:



Method 1(accumulating trace): 遇到state就加一,没有遇到衰减,没有封顶值(可能会有)

Method 2(replacing trace): 遇到state就加一,没有遇到衰减,有封顶值,到达封顶值在遇到不可以再往上加了,只能保持在峰值。

import numpy as np
import time
import sys
if sys.version_info.major == 2:
    import Tkinter as tk
else:
    import tkinter as tk


UNIT = 40   # pixels
MAZE_H = 4  # grid height
MAZE_W = 4  # grid width


class Maze(tk.Tk, object):
    def __init__(self):
        super(Maze, self).__init__()
        self.action_space = ['u', 'd', 'l', 'r']
        self.n_actions = len(self.action_space)
        self.title('maze')
        self.geometry('{0}x{1}'.format(MAZE_H * UNIT, MAZE_H * UNIT))
        self._build_maze()

    def _build_maze(self):
        self.canvas = tk.Canvas(self, bg='white',
                           height=MAZE_H * UNIT,
                           width=MAZE_W * UNIT)

        # create grids
        for c in range(0, MAZE_W * UNIT, UNIT):
            x0, y0, x1, y1 = c, 0, c, MAZE_H * UNIT
            self.canvas.create_line(x0, y0, x1, y1)
        for r in range(0, MAZE_H * UNIT, UNIT):
            x0, y0, x1, y1 = 0, r, MAZE_W * UNIT, r
            self.canvas.create_line(x0, y0, x1, y1)

        # create origin
        origin = np.array([20, 20])

        # hell
        hell1_center = origin + np.array([UNIT * 2, UNIT])
        self.hell1 = self.canvas.create_rectangle(
            hell1_center[0] - 15, hell1_center[1] - 15,
            hell1_center[0] + 15, hell1_center[1] + 15,
            fill='black')
        # hell
        hell2_center = origin + np.array([UNIT, UNIT * 2])
        self.hell2 = self.canvas.create_rectangle(
            hell2_center[0] - 15, hell2_center[1] - 15,
            hell2_center[0] + 15, hell2_center[1] + 15,
            fill='black')

        # create oval
        oval_center = origin + UNIT * 2
        self.oval = self.canvas.create_oval(
            oval_center[0] - 15, oval_center[1] - 15,
            oval_center[0] + 15, oval_center[1] + 15,
            fill='yellow')

        # create red rect
        self.rect = self.canvas.create_rectangle(
            origin[0] - 15, origin[1] - 15,
            origin[0] + 15, origin[1] + 15,
            fill='red')

        # pack all
        self.canvas.pack()

    def reset(self):
        self.update()
        time.sleep(0.5)
        self.canvas.delete(self.rect)
        origin = np.array([20, 20])
        self.rect = self.canvas.create_rectangle(
            origin[0] - 15, origin[1] - 15,
            origin[0] + 15, origin[1] + 15,
            fill='red')
        # return observation
        return self.canvas.coords(self.rect)

    def step(self, action):
        s = self.canvas.coords(self.rect)
        base_action = np.array([0, 0])
        if action == 0:   # up
            if s[1] > UNIT:
                base_action[1] -= UNIT
        elif action == 1:   # down
            if s[1] < (MAZE_H - 1) * UNIT:
                base_action[1] += UNIT
        elif action == 2:   # right
            if s[0] < (MAZE_W - 1) * UNIT:
                base_action[0] += UNIT
        elif action == 3:   # left
            if s[0] > UNIT:
                base_action[0] -= UNIT

        self.canvas.move(self.rect, base_action[0], base_action[1])  # move agent

        s_ = self.canvas.coords(self.rect)  # next state

        # reward function
        if s_ == self.canvas.coords(self.oval):
            reward = 1
            done = True
            s_ = 'terminal'
        elif s_ in [self.canvas.coords(self.hell1), self.canvas.coords(self.hell2)]:
            reward = -1
            done = True
            s_ = 'terminal'
        else:
            reward = 0
            done = False

        return s_, reward, done

    def render(self):
        time.sleep(0.05)
        self.update()
import numpy as np
import pandas as pd 

class RL(object):
    def __init__(self,action_space,learning_rate=0.01,reward_decay=0.9,e_greedy=0.9):
        self.actions=action_space #a list
        self.lr=learning_rate
        self.gamma=reward_decay
        self.epsilon=e_greedy

        self.q_table=pd.DataFrame(columns=self.actions,dtype=np.float64)
    
    def check_state_exist(self,state):
        if state not in self.q_table.index:
            self.q_table=self.q_table.append(
                pd.Series(
                    [0]*len(self.actions),
                    index=self.q_table.columns,
                    name=state,
                )
            )

    def choose_action(self,observation):
        self.check_state_exist(observation)
        if np.random.rand()<self.epsilon:
            #choose best action
            state_action=self.q_table.loc[observation,:]
            action=np.random.choice(state_action[state_action==np.max(state_action)].index)
        else:
            #choose random action
            action=np.random.choice(self.actions)
        return action

    def learn(self,*args):
        pass


#backward eligibility traces
class SarsaLambdaTable(RL):
    def __init__(self,actions,learning_rate=0.01,reward_decay=0.9,e_greedy=0.9,trace_decay=0.9):
        super(SarsaLambdaTable,self).__init__(actions,learning_rate,reward_decay,e_greedy)
        #除了继承父类的参数,SARSA(lambda)还有自己的参数
        #backward view,eligibility trace——sarsa(lambda)的新参数
        self.lambda_=trace_decay #脚步衰减值,在0-1之间
        self.eligibility_trace=self.q_table.copy() #和q_table一样的table,也是一个行为state,列为action的表,经历了某个state,采取某个action时,在表格对应位置加1

    def check_state_exist(self,state):
        if state not in self.q_table.index:
            #生成一个符合q_table标准的全0数列
            to_be_append=pd.Series(
                [0]*len(self.actions),
                index=self.q_table.columns,
                name=state,
            )
            #追加在q_table后
            self.q_table=self.q_table.append(to_be_append)
            
            #追加在eligibility_trace后
            #also update eligibility trace
            self.eligibility_trace=self.eligibility_trace.append(to_be_append)

    def learn(self,s,a,r,s_,a_):
        self.check_state_exist(s_)
        q_predict=self.q_table.loc[s,a]
        if s_!='terminal':
            q_target=r+self.gamma*self.q_table.loc[s_,a_]
        else:
            q_target=r
        error=q_target-q_predict #求出误差,反向传递过去

        #increase trace amount for visited state_action pair
        #计算每个步的不可或缺性(eligibility trace)

        #Method 1:没有封顶值,遇到就加一
        self.eligibility_trace.loc[s,a]+=1

        #Method 2:有封顶值
        #self.eligibility_trace.loc[s,:]*=0 #对于这个state,把他的action全部设为0
        #self.eligibility_trace.loc[s,a]=1 #在这个state上采取的action,把它变为1

        #Q表update,sarsa(lambda)的更新方式:还要乘以eligibility_trace
        self.q_table+=self.lr*error*self.eligibility_trace

        #decay eligibility trace after update,体现eligibility_trace的衰减:lambda_是脚步衰变值,gamma是reward的衰变值
        self.eligibility_trace*=self.gamma*self.lambda_





from maze_env2 import Maze
from RL_brain2 import SarsaLambdaTable


def update():
    for episode in range(100):
        # initial observation
        observation = env.reset()

        # RL choose action based on observation
        action = RL.choose_action(str(observation))

        # initial all zero eligibility trace
        RL.eligibility_trace *= 0

        while True:
            # fresh env
            env.render()

            # RL take action and get next observation and reward
            observation_, reward, done = env.step(action)

            # RL choose action based on next observation
            action_ = RL.choose_action(str(observation_))

            # RL learn from this transition (s, a, r, s, a) ==> Sarsa
            RL.learn(str(observation), action, reward, str(observation_), action_)

            # swap observation and action
            observation = observation_
            action = action_

            # break while loop when end of this episode
            if done:
                break

    # end of game
    print('game over')
    env.destroy()

if __name__ == "__main__":
    env = Maze()
    RL = SarsaLambdaTable(actions=list(range(env.n_actions)))

    env.after(100, update)
    env.mainloop()

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/815980.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

无涯教程-jQuery - outerWidth( margin])方法函数

outerWidth([margin])方法获取第一个匹配元素的外部宽度(默认情况下包括边框和填充)。 此方法适用于可见和隐藏元素。由于父项被隐藏的元素不支持此功能。 outerWidth( [margin] ) - 语法 selector.outerWidth( [margin] ) 这是此方法使用的所有参数的描述- margin - 此…

OpenLayers实战进阶专栏目录,OpenLayers实战案例,OpenLayers6实战教程

前言 本篇作为OpenLayers实战进阶教程的目录&#xff0c;用于整理汇总专栏所有文章&#xff0c;方便查找。 OpenLayers是前端最流行的JS二维地图引擎之一。 反馈建议 OpenLayers系列-交流专区&#xff0c;建议和问题反馈 Openlayers实战进阶 Openlayers实战&#xff0c;O…

SQL-每日一题【1148. 文章浏览 I】

题目 Views 表&#xff1a; 请查询出所有浏览过自己文章的作者 结果按照 id 升序排列。 查询结果的格式如下所示&#xff1a; 示例 1&#xff1a; 解题思路 1.题目要求我们查询出所有浏览过自己文章的作者&#xff0c;结果按照 id 升序排列。 2.我们只需要筛选出 author_id …

消防应急照明设置要求在炼钢车间电气室的应用

摘 要:文章以GB51309—2018《消防应急照明和疏散指示系统技术标准》为设计依据&#xff0c;结合某炼钢车间转炉项目的设计过程&#xff0c;在炼钢车间电气室的疏散照明和备用照明的设计思路、原则和方法等方面进行阐述。通过选择合理的消防应急疏散照明控制系统及灯具供配电方案…

java 企业工程管理系统软件源码+Spring Cloud + Spring Boot +二次开发+ MybatisPlus + Redis

&#xfeff; 电子招标采购软件 解决方案 招标面向的对象为供应商库中所有符合招标要求的供应商&#xff0c;当库中的供应商有一定积累的时候&#xff0c;会节省大量引入新供应商的时间。系统自动从供应商库中筛选符合招标要求的供应商&#xff0c;改变以往邀标的业务模式。招…

独立站有流量没订单是什么原因呢?怎么解决?

和自带流量的电商平台比起来&#xff0c;外贸独立站自身是没有流量的。独立站卖家要订单&#xff0c;就必须主动去引流。 如果你是1个独立站新手卖家&#xff0c;你很可能遇到过这种问题&#xff1a;跑了一段时间广告&#xff0c;广告费花了几百上千美金&#xff0c;流量来了不…

集合简述

集合ListArrayListLinkedList SetHashSetTreeSet MapHashMapTreeMap 集合与数组的区别 集合 集合是java中的一个容器&#xff0c;可以在里面存放数据&#xff0c;容量可以发生改变 从集合框架结构可以分析得知&#xff1a; 1、集合主要分为Collection和Map两个接口 2、Collecti…

简单版本视频播放服务器V4,layui界面

简单版本视频播放服务器V4 前一个版本内容&#xff0c;可以查看 https://blog.csdn.net/wtt234/article/details/131759154 优化内容&#xff1a; 1.返回列表的优化&#xff0c;优化了原来返回空列表名称的问题 2.前端才有layui优化内容 后端&#xff1a; package mainim…

配置IPv6 over IPv4手动隧道示例

组网需求 如图1所示&#xff0c;两台IPv6主机分别通过SwitchA和SwitchC与IPv4骨干网络连接&#xff0c;客户希望两台IPv6主机能通过IPv4骨干网互通。 图1 配置IPv6 over IPv4手动隧道组网图 配置思路 配置IPv6 over IPv4手动隧道的思路如下&#xff1a; 配置IPv4网络。配置接…

【Golang 接口自动化06】微信支付md5签名计算及其优化

目录 前言 初始方式 代码说明 优化 最终方法 性能对比 参考代码 总结 资料获取方法 前言 可能看过我博客的朋友知道我主要是做的支付这一块的测试工作。而我们都知道现在比较流行的支付方式就是微信支付和支付宝支付&#xff0c;当然最近在使用低手续费大力推广的京东…

ELD透明屏在智能家居中有哪些优点展示?

ELD透明屏是一种新型的显示技术&#xff0c;它能够在不需要背光的情况下显示图像和文字。 ELD透明屏的原理是利用电致发光效应&#xff0c;通过在透明基板上涂覆一层特殊的发光材料&#xff0c;当电流通过时&#xff0c;发光材料会发出光线&#xff0c;从而实现显示效果。 ELD…

企业电子招投标采购系统java spring cloud+spring boot功能模块功能描述+数字化采购管理 采购招投标

​功能模块&#xff1a; 待办消息&#xff0c;招标公告&#xff0c;中标公告&#xff0c;信息发布 描述&#xff1a; 全过程数字化采购管理&#xff0c;打造从供应商管理到采购招投标、采购合同、采购执行的全过程数字化管理。通供应商门户具备内外协同的能力&#xff0c;为外…

【深度学习】MAT: Mask-Aware Transformer for Large Hole Image Inpainting

论文&#xff1a;https://arxiv.org/abs/2203.15270 代码&#xff1a;https://github.com/fenglinglwb/MAT 文章目录 AbstractIntroductionRelated WorkMethod总体架构卷积头Transformer主体Adjusted Transformer Block Multi-Head Contextual Attention Style Manipulation Mo…

计算机视觉实验:图像处理综合-路沿检测

目录 实验步骤与过程 1. 路沿检测方法设计 2. 路沿检测方法实现 2.1 视频图像提取 2.2 图像预处理 2.3 兴趣区域提取 2.4 边缘检测 ​​​​​​​2.5 Hough变换 ​​​​​​​2.6 线条过滤与图像输出 3. 路沿检测结果展示 4. 其他路沿检测方法 实验结论或体会 实…

防雷保护区如何划分,防雷分区概念LPZ介绍

在防雷设计中&#xff0c;很重要的一点就是防雷分区的划分&#xff0c;只有先划分好防雷区域等级&#xff0c;才好做出比较好的防雷器设计方案。 因为标准对不同区安装的防雷浪涌保护器要求是不一样的。 那么&#xff0c;防雷保护区是如何划分的呢&#xff1f; 如上图所示&…

关于led显示屏编程技术有哪些

LED显示屏编程技术主要涉及控制LED显示屏的内容、亮度、颜色等参数&#xff0c;以及与其他设备或系统的数据交互。下面列举一些常见的LED显示屏编程技术&#xff1a; 1. LED显示屏控制协议&#xff1a; 不同品牌和型号的LED显示屏通常都采用特定的控制协议&#xff0c;如DMX51…

【数据结构】_6.队列

目录 1.概念 2.队列的使用 3.队列模拟实现 4.循环队列 5.双端队列 6.OJ题 6.1 用队列实现栈 6.2 用栈实现队列 1.概念 &#xff08;1&#xff09;队列是只允许在一端进行插入数据操作&#xff0c;在另一端进行删除数据操作的特殊线性表&#xff1b; &#xff08;2&am…

Linux6.21 ansible playbook 剧本

文章目录 计算机系统5G云计算第一章 LINUX ansible playbook 剧本一、概述二、playbook应用1.示例2.运行playbook3.定义、引用变量4.指定远程主机sudo切换用户5.when条件判断6.迭代7.Templates 模块8.tags 模块 计算机系统 5G云计算 第一章 LINUX ansible playbook 剧本 一、…

金现代LIMS在电子行业的应用

近期&#xff0c;随着国家政策对可靠性检验标准的进一步提升&#xff0c;电子、机械等相关行业对LIMS实验室管理系统的需求愈发迫切。 政策速递&#xff08;一&#xff09; 提升制造业质量与可靠性管理水平 01 2023年6月&#xff0c;工业和信息化部、教育部、科技部、财政部、…

一文带你全面掌握Git技能知识!

简单地说&#xff0c;Git 究竟是怎样的一个系统呢&#xff1f;请注意接下来的内容非常重要&#xff0c;若你理解了 Git 的思想和基本工作原理&#xff0c;用起来就会知其所以然&#xff0c;游刃有余。在学习 Git 时&#xff0c;请尽量理清你对其它版本管理系统已有的认识&#…