-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the deep dyna-q agent #322
base: dev
Are you sure you want to change the base?
Conversation
) | ||
batch = self.preprocess_update_batch(batch) | ||
|
||
self._model_optimizer.zero_grad() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be moved down to right above line 435
:py:class:`~hive.replays.circular_replay.CircularReplayBuffer`. | ||
discount_rate (float): A number between 0 and 1 specifying how much | ||
future rewards are discounted by the agent. | ||
n_step (int): The horizon used in n-step returns to compute TD(n) targets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doubt: Is the length of the horizon while planning to tune the policy?
stack_size=stack_size, | ||
gamma=discount_rate, | ||
) | ||
self._planning_buffer = planning_buffer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doubt: Why are there separate replay buffers for planning and learning?
): | ||
self._logger.log_scalar("train_qval", torch.max(qvals), self._timescale) | ||
agent_traj_state = {} | ||
return action, agent_traj_state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment: Defining agent_traj_state might not be necessary.
"observation": update_info["observation"], | ||
"action": update_info["action"], | ||
"reward": update_info["reward"], | ||
"done": update_info["terminated"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why "or update_info["truncated"]" in not added for this replay buffer?
return | ||
|
||
( | ||
preprocessed_learning_update_info, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why have 2 replay buffers? From what I understood, both replay buffers are storing the same transitions. It's just that the batch_size for planning and model learning might change. But that can be passed as a separate instead. Also, having 2 buffers increases the memory required by the model.
|
||
# Observations | ||
obs_pred_list = [] | ||
for a in range(self._act_dim): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious question: Isn't there a better way to do it without the for loop?
# Observations | ||
self._obs_encoder = observation_encoder_net(in_dim) | ||
obs_predictor_in_dim = ( | ||
np.prod(calculate_output_dim(self._obs_encoder, in_dim)) + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Is the dimension 1 added for the action? I thought the actions are one-hot in general for discrete action spaces.
No description provided.