Description
Hi,
thanks for the thorough implementation and making this code available, it really helps to understand the internal mechanisms of the SAC algorithm.
I have a question regarding the code in sac/sac/envs/gym_env.py -
At the file's header - you comment: " Rllab implementation with a HACK. See comment in GymEnv.init().", and then in the init() method, you write:
# HACK: Gets rid of the TimeLimit wrapper that sets 'done = True' when
# the time limit specified for each environment has been passed and
# therefore the environment is not Markovian (terminal condition depends
# on time rather than state).
I understand the point here, but I'm not sure I followed the implementation, as it seems to be an internal Gym code and is not found in the SAC code found in this repository.
Can you explain exactly what are you doing with the TimeLimit wrapper?
If you omit the done flag, do you still terminate the episode?
Specifically - in Gym's registration.py file the env class is wrapped with:
if env.spec.max_episode_steps is not None:
from gym.wrappers.time_limit import TimeLimit
env = TimeLimit(env, max_episode_steps=env.spec.max_episode_steps)
Furthermore, in the time_limit.py file -
def step(self, action):
assert self._elapsed_steps is not None, "Cannot call env.step() before calling reset()"
observation, reward, done, info = self.env.step(action)
self._elapsed_steps += 1
if self._elapsed_steps >= self._max_episode_steps:
info['TimeLimit.truncated'] = not done
done = True
return observation, reward, done, info
If you omit these lines of code - how does the environment resets itself when the max_episode_steps flag is raised?
Thanks!
Lior