Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extended Debugger reductions and fixed some bugs #523

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

NRauschmayr
Copy link
Contributor

Description of changes:

Extended smdebug's reductions to check for nan- and inf-values and to compute quantiles for PT tensors. Tensors are now also written out in Tensorboard format such that users can display all reductions for a specific tensor within the same visualization and visualizations will be grouped by Debugger collections.
Here is an example visualization:

Screen Shot 2021-11-14 at 2 52 42 PM

Style and formatting:

I have run pre-commit install && pre-commit run --all-files to ensure that auto-formatting happens with every commit.

Issue number, if available

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@codecov-commenter
Copy link

codecov-commenter commented Nov 14, 2021

Codecov Report

Merging #523 (66999aa) into master (b4dd4c1) will decrease coverage by 6.36%.
The diff coverage is 57.44%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #523      +/-   ##
==========================================
- Coverage   77.60%   71.24%   -6.37%     
==========================================
  Files         127      117      -10     
  Lines       11111    10614     -497     
==========================================
- Hits         8623     7562    -1061     
- Misses       2488     3052     +564     
Impacted Files Coverage Δ
smdebug/pytorch/utils.py 48.14% <23.52%> (-32.81%) ⬇️
smdebug/core/locations.py 85.71% <60.00%> (-5.96%) ⬇️
smdebug/core/reduction_config.py 86.58% <77.77%> (-9.52%) ⬇️
smdebug/core/hook.py 86.90% <81.25%> (-0.53%) ⬇️
smdebug/mxnet/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
smdebug/mxnet/singleton_utils.py 0.00% <0.00%> (-100.00%) ⬇️
...debug/profiler/analysis/notebook_utils/__init__.py 0.00% <0.00%> (-100.00%) ⬇️
smdebug/mxnet/hook.py 0.00% <0.00%> (-84.85%) ⬇️
smdebug/mxnet/utils.py 0.00% <0.00%> (-78.13%) ⬇️
smdebug/rules/action/message_action.py 13.25% <0.00%> (-75.91%) ⬇️
... and 61 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b4dd4c1...66999aa. Read the comment docs.

Comment on lines +523 to +524
if subfolder == None:
subfolder = self.mode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use subfolder = os.path.join(self.mode)?
It makes the intentions of this line much clearer to the reader.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subfolder is just a string and not a filepath.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.path.join returns an object of type str. If I understand this part correctly, the subfolder variable contains the path to sub directory for tensorboard data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no subfolder is just the name of the reduction. each reduction will be its own subfolder in the tensorboard directory.

if subfolder == None:
subfolder = self.mode

if subfolder in self.tb_writers:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should rename the name of this map. Maybe something more explicit like self.tb_writer_to_dir_map ?

@@ -660,16 +663,32 @@ def export_collections(self):
collection_file_name = f"{self.worker}_collections.json"
self.collection_manager.export(self.out_dir, collection_file_name)

def _get_reduction_tensor_name(self, tensor_name, reduction_name, abs):
return get_reduction_tensor_name(tensor_name, reduction_name, abs, remove_colon_index=True)
def _get_reduction_tensor_name(self, tensor_name, reduction_name, abs, collection_name=""):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do we accept an empty string collection name?
Should we use the DEFAULT collection key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extended this part, which is needed for Tensorboard to group visualizations by Debugger collections. To keep it consistent with previous code, the collection name will be empty per default.

reduction_name = "abs_" + reduction_name
tb_writer = self._maybe_get_tb_writer(subfolder=reduction_name)
if tb_writer:
scalar = self._make_numpy_array(tensor_data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the value of scalar if tb_writer = None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per default Debugger writes reductions like normal tensors into debug-output folder and users can retrieve the data via the smdebug API. I extended this part, so that reductions are also written in Tensorboard format (in case user provided a Tensorboard configuration)

if hasattr(torch.Tensor, reduction_name):
f = getattr(torch.Tensor, reduction_name)
op = f(tensor_data.float())
if reduction_name == "isnan" or reduction_name == "isinf":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we manage reduction_name values with Enums?

@NihalHarish NihalHarish self-requested a review November 23, 2021 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants