-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't get deterministic results with NGC 20.09 #29
Comments
That's a very thorough report, @ornellamarciano; thank you. Having scanned your code, what stands out to me is your use of Deeper information: the forward path of selecting word embeddings for a given batch is usually implemented with Thanks to my colleague @wenscarl for helping to isolate the source of nondeterminism and develop the patch for it. |
@duncanriach @wenscarl Thanks a lot for your quick and detailed answer ! |
Right, I would expect you to see a much smaller amount of noise accumulating when using float64, which seems to be what you witnessed. |
Hi @duncanriach do you know when the patch solving the non determinism of embedding layers will be released ? Thanks for your help! |
I don't have an ETA for you, but I can tell you that we're actively working on it and that's it's top priority. I'm hoping for a release in the next few weeks. |
Hi @duncanriach do you have any update ? Thanks a lot! |
Hi @ornellamarciano, Thanks for checking-in. From stock TensorFlow version 2.5 onwards, the use of TensorFlow's segment reduction ops running on a GPU, with the expectation of reproducibility (i.e. when TF_DETERMINISTIC_OPS is set to @benbarsdell has developed deterministic GPU implementations of all the TensorFlow segment reduction ops, including |
You might also want to try applying the unreleased patch from the cloned repo. See the discussion in issue 19. |
Prerequisites
I am using latest TensorFlow NGC container nvcr.io/nvidia/tensorflow:20.09-tf2-py3
Issue
When running my Deep Learning model on 1 GPU (Tesla T4) within the NGC container 20.09, I don't get reproductible results between consecutive runs (different predictions) although all the seeds are set correctly. Running the same code on CPU gives reproductible results over consecutive runs.
On a small dataset (138k samples), I managed to get reproductible results within the NGC by setting the default float type to float64 (instead of float32).
But with a bigger dataset (2.6m samples), after 1million of samples, the predictions start to differ over different runs.
The more I increase the dataset size, the bigger is the difference between predictions of different consecutive runs.
My datasets are tfrecord files of hashed values, serialized by batch of 10,000 samples.
Here are the small_dataset (size 13M):
small_dataset
and the big_dataset (size 251M) - can you send me your email to share it with you via Google Drive to be able to reproduce the issue?
You need to unzip them before running the script bellow.
Command lines:
Script
Could you help me to understand why I don't get the exact same predictions after consecutive runs ?
The text was updated successfully, but these errors were encountered: