Conversation
There was a problem hiding this comment.
I only roughly skimmed over this PR. In general, I think it's not clear if it's necessary for most changes to be included in ALF. From my understanding, the best scenario would be to separate LBTQ algorithm from other misc stuff in a clean way with a general API, and only integrate it in ALF. All other stuff can be put in your personal repo which would contain all experiments details and helper functions, etc. So far I think the generality of some ALF apis has been compromised by the PR.
alf/algorithms/data_transformer.py
Outdated
|
|
||
| result = result._replace(discount=discount) | ||
| result = result._replace(step_type=step_type) | ||
| relabeled_rewards = suite_socialbot.transform_reward_tensor( |
There was a problem hiding this comment.
My understanding is that HindsightExperienceTransformer is a general data transformer independent of the environment. So it feels strange to include suite_socialbot here.
There was a problem hiding this comment.
This transform_reward_tensor function is used to translate -1/0 reward into 0/1 reward. We also depend on the -1/0 reward assumption to determine the relabeled LAST (episode end) steps (also relabel discount 0).
Removed the dependency here, added it in the task gin or alf configs. This runs the risk of people forgetting to set it in a new task config.
For episodic tasks (sparse_reward==True), the Hindsight transformer does depend on the original reward being -1/0. In this case, maybe I should move ``transform_reward_tensor'' out of suite_socialbot, into a more general place like reward_utils.py? Would you prefer this more than the current solution of relying on the task's alf config to setup the transform_reward_tensor function?
alf/algorithms/data_transformer.py
Outdated
|
|
||
| if HindsightExperienceTransformer in data_transformer_ctor: | ||
| assert HindsightExperienceTransformer == data_transformer_ctor[0], \ | ||
| "Hindsight relabeling should happen before all other transforms." |
There was a problem hiding this comment.
This is no longer true after Break's recent PR on SequentialDataTransformer. Basically UntransformedTimeStep could appear before any other data transformer. I suggest either justifying why HindsightExperienceTransformer has to appear before UntransformedTimeStep or removing this assertion here.
There was a problem hiding this comment.
Good point. Removed.
alf/algorithms/data_transformer.py
Outdated
|
|
||
| final_relabeled_rewards = relabeled_rewards | ||
| if result.reward.ndim > 2: | ||
| # multi dimensional env reward, first dim is goal related reward. |
There was a problem hiding this comment.
Is this reward assumption too strict and easily violated by a user?
There was a problem hiding this comment.
True. Made a note about the assumption upfront in the class description.
The proper way is maybe to divide the current multi-dim reward into a nested reward field. Added TODO.
alf/algorithms/data_transformer.py
Outdated
| exp nest. | ||
| desired_goal_field (str): path to the desired_goal field in the | ||
| exp nest. | ||
| sparse_reward (bool): Whether to transform reward from -1/0 to 0/1. |
There was a problem hiding this comment.
This transform from -1/0 to 0/1 seems specific to certain environments and is not a general thing.
There was a problem hiding this comment.
Environment rewards can be 0/1 or -1/0 or any form. But because the Hindsight relabeler needs to know whether goal is achieved to be able to relabel discount and step_type correctly, here, we assume the reward_fn always returns -1/0 reward, and it needs to be transformed into 0/1 form when needed.
Since it relabels discount to 0, this also relabels infinite horizon tasks to episodic. Updated comment.
There was a problem hiding this comment.
Environment rewards can be 0/1 or -1/0 or any form. But because the Hindsight relabeler needs to know whether goal is achieved to be able to relabel discount and step_type correctly, here, we assume the reward_fn always returns -1/0 reward, and it needs to be transformed into 0/1 form when needed.
Since it relabels discount to 0, this also relabels infinite horizon tasks to episodic. Updated comment.
From the docstring, the observation is assumed to have a field "achieved_goal" indicating what goal has been achieved. If this is true, then there is no need to decide goal achieving from the reward, which is not a general setup. (on the other hand, -1/0 is as sparse as 0/1 so the name 'sparse_reward' could be improved).
There was a problem hiding this comment.
Good point about goal achieved. It's still hard to find a general way though. The reward function is general, and may not return anything about goal achievement, e.g. when it is a dense reward based on distance to goal.
Calling the variable relabel_with_episodic_rewards, since -1/0 is for continuous and 0/1 is for episodic.
| observation_spec, | ||
| stack_size=4, | ||
| stack_axis=0, | ||
| convert_only_minibatch_to_device=False, |
There was a problem hiding this comment.
What does this argument mean? docstring should explain its actual meaning.
| lb_loss_scale: bool = False, | ||
| reward_multiplier: float = 1., | ||
| positive_reward: bool = True, | ||
| use_retrace: bool = False, |
There was a problem hiding this comment.
To me, it's not clear which of these arguments are really essential to TDLoss in a general sense. Some of them seem only applicable to specific scenarios.
There was a problem hiding this comment.
Good point. Created a subclass to do lower bounding. Kept a few general functions like clip and retrace in td_loss.
alf/bin/train_play_test.py
Outdated
| 'TrainerConfig.replay_buffer_length=64', | ||
| # If replay_buffer_length is above 6 (3 iters * 2 unroll length), whole replay buffer training | ||
| # algorithms (e.g. mbrl_pendulum) will not get enough training data, and will not start training. | ||
| # This fails during playing, when the config file isn't generated in root_dir. |
There was a problem hiding this comment.
Why a longer replay buffer won't give enough training data?
There was a problem hiding this comment.
because the replay buffer is cleared before it's fully filled. So training never starts.
There was a problem hiding this comment.
because the replay buffer is cleared before it's fully filled. So training never starts.
Whole replay buffer training doesn't require the buffer to be full? It just gets all of data in the buffer even though then it's not full.
Just curious: this number of 64 had no issue before? Did you meet this problem for new test cases?
There was a problem hiding this comment.
Ah, you are right. Turns out it was due to my earlier change in rl_algorithm which required buffer to be full even when clear_replay_buffer is False. That change was already reverted. This change is also reverted.
alf/environments/suite_robotics.py
Outdated
| env, | ||
| reward_cap=1., | ||
| positive_reward=True, | ||
| append_reward_dim=False): |
There was a problem hiding this comment.
Docstrings are needed for the new arguments. The name of reward_cap doesn't correctly reflect its usage which is actually a scaling factor. Why do you need append_reward_dim (it should be put in another wrapper if needed)?
There was a problem hiding this comment.
Added docstrings. Renamed reward_cap to reward_weight. Removed append_reward_dim.
| assert ( | ||
| environment_name.startswith("Fetch") | ||
| or environment_name.startswith("HandManipulate") | ||
| or environment_name.startswith("Ant") |
There was a problem hiding this comment.
Not sure if it's appropriate to include multiworld Ant in this suite? Note that "Fetch" and "HandManipulate" were introduced at the same time by OpenAI with very similar implementations.
There was a problem hiding this comment.
It's probably Ok? The code here is quite simple. The if conditions mark clearly where Ant is the env.
There was a problem hiding this comment.
It's probably Ok? The code here is quite simple. The if conditions mark clearly where Ant is the env.
I mean it's hard for someone else to imagine that metaworld Ant is included in this suite_robotics.py file. The reason of having "robotics" in the name is because OpenAI Gym originally had "robotics" category which contains Fetch and HandManipulate.
There was a problem hiding this comment.
I see. How about moving this code to a separate suite_mujoco.py, since all these Ant envs are under: https://github.com/openai/gym/tree/master/gym/envs/mujoco
| return y.sign() * ((y / self._alpha).abs().exp() - 1) | ||
|
|
||
|
|
||
| @alf.configurable |
There was a problem hiding this comment.
These hindsight related functions shouldn't be put in a general file like math_ops.py.
There was a problem hiding this comment.
Both data_transformer and suite_robotics use these functions, so util is perhaps a reasonable place? Should I create a utils/reward_functions.py?
There was a problem hiding this comment.
Both data_transformer and suite_robotics use these functions, so util is perhaps a reasonable place? Should I create a utils/reward_functions.py?
These calculations are very simple (basically one line) and don't have to be factored into functions.
There was a problem hiding this comment.
The hindsight data_transformer requires an input function, and the reward function in these two places need to be exactly the same, otherwise, bugs can happen. Plus it is already not that simple due to multi dim reward.
That's why I keep them as alf configurable functions.
le-horizon
left a comment
There was a problem hiding this comment.
Thanks for the insightful comments @hnyu .
Tried to address all. Replies or questions inlined. PTAL.
Thanks,
Le
alf/algorithms/data_transformer.py
Outdated
| exp nest. | ||
| desired_goal_field (str): path to the desired_goal field in the | ||
| exp nest. | ||
| sparse_reward (bool): Whether to transform reward from -1/0 to 0/1. |
There was a problem hiding this comment.
Environment rewards can be 0/1 or -1/0 or any form. But because the Hindsight relabeler needs to know whether goal is achieved to be able to relabel discount and step_type correctly, here, we assume the reward_fn always returns -1/0 reward, and it needs to be transformed into 0/1 form when needed.
Since it relabels discount to 0, this also relabels infinite horizon tasks to episodic. Updated comment.
alf/algorithms/data_transformer.py
Outdated
|
|
||
| result = result._replace(discount=discount) | ||
| result = result._replace(step_type=step_type) | ||
| relabeled_rewards = suite_socialbot.transform_reward_tensor( |
There was a problem hiding this comment.
This transform_reward_tensor function is used to translate -1/0 reward into 0/1 reward. We also depend on the -1/0 reward assumption to determine the relabeled LAST (episode end) steps (also relabel discount 0).
Removed the dependency here, added it in the task gin or alf configs. This runs the risk of people forgetting to set it in a new task config.
For episodic tasks (sparse_reward==True), the Hindsight transformer does depend on the original reward being -1/0. In this case, maybe I should move ``transform_reward_tensor'' out of suite_socialbot, into a more general place like reward_utils.py? Would you prefer this more than the current solution of relying on the task's alf config to setup the transform_reward_tensor function?
alf/algorithms/data_transformer.py
Outdated
|
|
||
| final_relabeled_rewards = relabeled_rewards | ||
| if result.reward.ndim > 2: | ||
| # multi dimensional env reward, first dim is goal related reward. |
There was a problem hiding this comment.
True. Made a note about the assumption upfront in the class description.
The proper way is maybe to divide the current multi-dim reward into a nested reward field. Added TODO.
alf/algorithms/data_transformer.py
Outdated
|
|
||
| if HindsightExperienceTransformer in data_transformer_ctor: | ||
| assert HindsightExperienceTransformer == data_transformer_ctor[0], \ | ||
| "Hindsight relabeling should happen before all other transforms." |
There was a problem hiding this comment.
Good point. Removed.
| output=noisy_action, | ||
| state=state, | ||
| info=DdpgInfo(action=noisy_action, action_distribution=action)) | ||
| info=DdpgInfo(action=noisy_action, action_distribution=())) |
There was a problem hiding this comment.
This is to fail early and clearly. action is a tensor, not distribution. Putting action directly there could cause confusion when debugging. See comment 3 lines above.
| lb_loss_scale: bool = False, | ||
| reward_multiplier: float = 1., | ||
| positive_reward: bool = True, | ||
| use_retrace: bool = False, |
There was a problem hiding this comment.
Good point. Created a subclass to do lower bounding. Kept a few general functions like clip and retrace in td_loss.
alf/bin/train_play_test.py
Outdated
| 'TrainerConfig.replay_buffer_length=64', | ||
| # If replay_buffer_length is above 6 (3 iters * 2 unroll length), whole replay buffer training | ||
| # algorithms (e.g. mbrl_pendulum) will not get enough training data, and will not start training. | ||
| # This fails during playing, when the config file isn't generated in root_dir. |
There was a problem hiding this comment.
because the replay buffer is cleared before it's fully filled. So training never starts.
alf/environments/suite_robotics.py
Outdated
| env, | ||
| reward_cap=1., | ||
| positive_reward=True, | ||
| append_reward_dim=False): |
There was a problem hiding this comment.
Added docstrings. Renamed reward_cap to reward_weight. Removed append_reward_dim.
| assert ( | ||
| environment_name.startswith("Fetch") | ||
| or environment_name.startswith("HandManipulate") | ||
| or environment_name.startswith("Ant") |
There was a problem hiding this comment.
It's probably Ok? The code here is quite simple. The if conditions mark clearly where Ant is the env.
| return y.sign() * ((y / self._alpha).abs().exp() - 1) | ||
|
|
||
|
|
||
| @alf.configurable |
There was a problem hiding this comment.
Both data_transformer and suite_robotics use these functions, so util is perhaps a reasonable place? Should I create a utils/reward_functions.py?
…ield "improve_w_nstep_bootstrap".
emailweixu
left a comment
There was a problem hiding this comment.
This PR still contains a lot of separable parts. It's better to separate them and describe the motivation for each of them.
| output=noisy_action, | ||
| state=state, | ||
| info=DdpgInfo(action=noisy_action, action_distribution=action)) | ||
| info=DdpgInfo(action=noisy_action, action_distribution=())) |
There was a problem hiding this comment.
It is intended to return action_distribution here so that some other algorithm can use it (e.g. TracAlgorithm). Do you find it causing problem?
alf/algorithms/sac_algorithm.py
Outdated
| reward_weights=None, | ||
| epsilon_greedy=None, | ||
| rollout_epsilon_greedy=1.0, | ||
| use_epsilon_schedule=0, |
There was a problem hiding this comment.
It's better to change epsilon_greedy to accept both a float and a Scheduler. For float, you can use schedulers.as_scheduler to convert it to a scheduler. Similar to TrainerConfig.priority_replay_alpha
alf/algorithms/sac_algorithm.py
Outdated
| def _init_log_alpha(): | ||
| return nn.Parameter(torch.tensor(float(initial_log_alpha))) | ||
| alpha = torch.tensor(float(initial_log_alpha)) | ||
| if alpha_optimizer is None: |
There was a problem hiding this comment.
I think if you want to fix alpha, you can simply set the learning rate for it to be zero. ???_optimizer==None ususally means using a default optimizer.
There was a problem hiding this comment.
Oh, I misunderstood the param. lr=0 sounds good. Removed.
alf/algorithms/sac_algorithm.py
Outdated
| epsilon_greedy=self._epsilon_greedy, | ||
| eps_greedy_sampling=True) | ||
| info = SacInfo(action_distribution=action_dist) | ||
| if (alf.summary.render.is_rendering_enabled() |
There was a problem hiding this comment.
Agree with @hynu. The current situation is that different person want to render different thing. So it's hard to have an agreed-upon set of figures for rendering. This is different from tensorboard summary where space is not that precious.
alf/algorithms/ddpg_algorithm.py
Outdated
| "_improve_w_nstep_bootstrap") and \ | ||
| self._critic_losses[0]._improve_w_nstep_bootstrap: | ||
| # Ignore 2nd - nth step actor losses. | ||
| actor_loss.loss[1:] = 0 |
There was a problem hiding this comment.
why do you need to igore the actor loss for those steps? It seems to me it doesn't hurt to keep them.
There was a problem hiding this comment.
Good point. Since we compare with n-step return methods anyways, it may be more fair to keep the other steps' action and alpha losses.
alf/algorithms/td_loss.py
Outdated
| value itself is not normalized. | ||
| use_retrace: turn on retrace loss | ||
| :math:`\mathcal{R} Q(x, a):=Q(x, a)+\mathbb{E}_{\mu}\left[\sum_{t \geq 0} \gamma^{t}\left(\prod_{s=1}^{t} c_{s}\right)\left(r_{t}+\gamma \mathbb{E}_{\pi} Q\left(x_{t+1}, \cdot\right)-Q\left(x_{t}, a_{t}\right)\right)\right]` | ||
| copied from PR #695. |
| return importance_ratio, importance_ratio_clipped | ||
|
|
||
|
|
||
| def generalized_advantage_estimation(rewards, |
There was a problem hiding this comment.
This should be put side-by-side with the original generalized_advantage_estimation for comparison.
alf/utils/value_ops.py
Outdated
| importance_ratio, | ||
| use_retrace=False, | ||
| td_lambda=1.0, | ||
| time_major=True): |
There was a problem hiding this comment.
also should cite retrace paper.
alf/environments/suite_robotics.py
Outdated
| def __init__(self, env, reward_weight=1., positive_reward=True): | ||
| """ | ||
| Args: | ||
| reward_weight (float): weight of output reward. |
There was a problem hiding this comment.
use type hints at function signature instead
def __init__(self, env, reward_weight: float = 1., positive_reward: float = True)| stack_size (int): stack so many frames | ||
| stack_axis (int): the dimension to stack the observation. | ||
| convert_only_minibatch_to_device (bool): whether to convert only the | ||
| minibatch or the whole batch of data to the default device. |
There was a problem hiding this comment.
Variable name and docstring inconsistent with the code? FrameStacker will always affect the entire batch?
alf/algorithms/data_transformer.py
Outdated
| exp nest. | ||
| desired_goal_field (str): path to the desired_goal field in the | ||
| exp nest. | ||
| sparse_reward (bool): Whether to transform reward from -1/0 to 0/1. |
There was a problem hiding this comment.
Environment rewards can be 0/1 or -1/0 or any form. But because the Hindsight relabeler needs to know whether goal is achieved to be able to relabel discount and step_type correctly, here, we assume the reward_fn always returns -1/0 reward, and it needs to be transformed into 0/1 form when needed.
Since it relabels discount to 0, this also relabels infinite horizon tasks to episodic. Updated comment.
From the docstring, the observation is assumed to have a field "achieved_goal" indicating what goal has been achieved. If this is true, then there is no need to decide goal achieving from the reward, which is not a general setup. (on the other hand, -1/0 is as sparse as 0/1 so the name 'sparse_reward' could be improved).
alf/algorithms/data_transformer.py
Outdated
| reward_fn=l2_dist_close_reward_fn): | ||
| sparse_reward=False, | ||
| add_noise_to_goals=False, | ||
| threshold=.05, |
There was a problem hiding this comment.
Can merge add_noise_to_goals and threshold into just one relabeled_goal_noise, and when it's 0 there is no noise added.
There was a problem hiding this comment.
Good point. Done.
alf/algorithms/data_transformer.py
Outdated
| "replayer/" + buffer._name + ".mean_steps_to_episode_end", | ||
| torch.mean(dist.type(torch.float32))) | ||
|
|
||
| def _add_noise(t): |
There was a problem hiding this comment.
This function transform_experience() has already over 200 lines, and it's really difficult to read. I suggest splitting it into several member functions to make it more modular.
There was a problem hiding this comment.
Good point. Cleaned up a bit.
alf/algorithms/ddpg_algorithm.py
Outdated
| "DdpgInfo", [ | ||
| "reward", "step_type", "discount", "action", "action_distribution", | ||
| "actor_loss", "critic", "discounted_return" | ||
| "actor_loss", "critic", "discounted_return", "future_distance", "her" |
There was a problem hiding this comment.
Better not to put DDPG-irrelevant concepts into this Info structure. HER and DDPG are two orthogonal concepts and should be disentangled in the code. Otherwise when we have a third algorithm in the future, this Info structure will be added more fields again.
| assert ( | ||
| environment_name.startswith("Fetch") | ||
| or environment_name.startswith("HandManipulate") | ||
| or environment_name.startswith("Ant") |
There was a problem hiding this comment.
It's probably Ok? The code here is quite simple. The if conditions mark clearly where Ant is the env.
I mean it's hard for someone else to imagine that metaworld Ant is included in this suite_robotics.py file. The reason of having "robotics" in the name is because OpenAI Gym originally had "robotics" category which contains Fetch and HandManipulate.
| return reward | ||
|
|
||
|
|
||
| class SparseReward(gym.Wrapper): |
There was a problem hiding this comment.
You can reuse the one in suite_robotics.py?
There was a problem hiding this comment.
This also assumes first reward dimension being goal reward with multi dim reward.
So perhaps a bit of code duplication is actually better to separate out the env specific reward logic.
|
|
||
|
|
||
| @alf.configurable | ||
| def transform_reward_tensor(reward, reward_cap=1., positive_reward=True): |
There was a problem hiding this comment.
Where is this function used? If it's only in your conf file, then it should be put there instead.
There was a problem hiding this comment.
Both suite_robotics and suite_socialbot and their conf files could use this.
|
|
||
|
|
||
| @alf.configurable | ||
| def transform_reward(reward, reward_cap=1., positive_reward=True): |
There was a problem hiding this comment.
Should be a class member of SparseReward below, as it is not used by other files / code.
There was a problem hiding this comment.
Cannot do that. positive_reward needs to be consistent with transform_reward_tensor, typically done in config file or command line config. It needs to be alf.configurable.
| return y.sign() * ((y / self._alpha).abs().exp() - 1) | ||
|
|
||
|
|
||
| @alf.configurable |
There was a problem hiding this comment.
Both data_transformer and suite_robotics use these functions, so util is perhaps a reasonable place? Should I create a utils/reward_functions.py?
These calculations are very simple (basically one line) and don't have to be factored into functions.
le-horizon
left a comment
There was a problem hiding this comment.
Addressed all comments. Please hold off from reviewing, and let me further split this PR.
Thanks,
Le
| observation_spec, | ||
| stack_size=4, | ||
| stack_axis=0, | ||
| convert_only_minibatch_to_device=False, |
| output=noisy_action, | ||
| state=state, | ||
| info=DdpgInfo(action=noisy_action, action_distribution=action)) | ||
| info=DdpgInfo(action=noisy_action, action_distribution=())) |
There was a problem hiding this comment.
Yes, when used with e.g. Retrace, a distribution is needed, but action is not a distribution.
alf/algorithms/ddpg_algorithm.py
Outdated
| "_improve_w_nstep_bootstrap") and \ | ||
| self._critic_losses[0]._improve_w_nstep_bootstrap: | ||
| # Ignore 2nd - nth step actor losses. | ||
| actor_loss.loss[1:] = 0 |
There was a problem hiding this comment.
Good point. Since we compare with n-step return methods anyways, it may be more fair to keep the other steps' action and alpha losses.
alf/algorithms/sac_algorithm.py
Outdated
| "reward", "step_type", "discount", "action", "action_distribution", | ||
| "actor", "critic", "alpha", "log_pi", "discounted_return" | ||
| "actor", "critic", "alpha", "log_pi", "discounted_return", | ||
| "future_distance", "her" |
There was a problem hiding this comment.
Good point. For now, added comment that these three fields are optional, added TODO for future work to add the HER algorithm wrapper.
alf/algorithms/sac_algorithm.py
Outdated
| rollout_epsilon_greedy. | ||
| max_target_action (bool): whether to use the action with the highest | ||
| target value as the target action for computing bootstrapped value. | ||
| To mimic the DQN algorithm, set this to True. |
There was a problem hiding this comment.
Given it a try, how about this new dqn_algorithm.py?
alf/algorithms/data_transformer.py
Outdated
| reward_fn=l2_dist_close_reward_fn): | ||
| sparse_reward=False, | ||
| add_noise_to_goals=False, | ||
| threshold=.05, |
There was a problem hiding this comment.
Good point. Done.
alf/algorithms/data_transformer.py
Outdated
| exp nest. | ||
| desired_goal_field (str): path to the desired_goal field in the | ||
| exp nest. | ||
| sparse_reward (bool): Whether to transform reward from -1/0 to 0/1. |
There was a problem hiding this comment.
Good point about goal achieved. It's still hard to find a general way though. The reward function is general, and may not return anything about goal achievement, e.g. when it is a dense reward based on distance to goal.
Calling the variable relabel_with_episodic_rewards, since -1/0 is for continuous and 0/1 is for episodic.
alf/algorithms/data_transformer.py
Outdated
| "replayer/" + buffer._name + ".mean_steps_to_episode_end", | ||
| torch.mean(dist.type(torch.float32))) | ||
|
|
||
| def _add_noise(t): |
There was a problem hiding this comment.
Good point. Cleaned up a bit.
alf/algorithms/sac_algorithm.py
Outdated
| alpha = torch.tensor(float(initial_log_alpha)) | ||
| if alpha_optimizer is None: | ||
| return alpha | ||
| return nn.Parameter(alpha) |
There was a problem hiding this comment.
Got it. Removed.
alf/environments/suite_robotics.py
Outdated
|
|
||
| def step(self, action): | ||
| obs, reward, done, info = self.env.step(action) | ||
| return obs, reward, done, {} |
|
Split into #1330 and a few others |
main change for lower bound target q experiments.
still quite large.. Let me know if you'd prefer even smaller pieces.