题意:“OpenAI Gym: 什么时候需要重置?”
问题背景:
Although I can manage to get the examples and my own code to run, I am more curious about the real semantics / expectations behind OpenAI gym API, in particular Env.reset()
“虽然我能够让示例代码和我自己的代码运行起来,但我更好奇 OpenAI Gym API 背后的真实语义和预期,特别是对 `Env.reset()` 方法。”
When is reset expected/required? At the end of each episode? Or only after creating an environment?
“什么时候应该/需要调用重置?是在每个回合结束时,还是只在创建环境后调用?”
I rather think it makes sense before each episode but I have not been able to read that explicitly!
“我认为在每个回合开始前调用重置是有道理的,但我没有明确读到这一点!”
问题解决:
You typically use reset after an entire episode. So that could be after you reached a terminal state in the mdp, or after you reached you maximum amount of time steps (set by you). I also typically reset it at the very start of training as well.
“通常,你会在整个回合结束后使用 `reset`。这可能是在你达到马尔可夫决策过程(MDP)中的终止状态之后,或者在你达到设定的最大时间步数之后。我通常也会在训练刚开始时调用 `reset`。”
So if you are at your starting state 'A' and you want to reach state 'Z', you would run your time steps going from 'A' -> 'B' -> 'C' ..., then when you reach the terminal state 'Z', you start a new episode using reset, which would take you back to 'A'.
“所以,如果你处于起始状态 ‘A’ 并且想要到达状态 ‘Z’,你会执行时间步,从 ‘A’ -> ‘B’ -> ‘C’ ……,然后当你到达终止状态 ‘Z’ 时,使用 `reset` 开始新的一回合,这会让你回到 ‘A’。”
for episode in range(iterations):
state = env.reset() // first state
for time_step in range(1000): //max amount of iterations
action = take_action(state)
state, reward, done, _ = env.step(action)
if done:
break // takes you to the next episode where the environment is reset