Failed on different complexes for inference.py on CPU and GPU are different for different batch size #229

srilekha1993 · 2024-05-27T07:25:27Z

Hi,
I have tried running inference.py on 363 complexes given on testset_csv.csv. But getting failed cases on CPU are 21 complexes (linalg.svd: The algorithm failed to converge because the input matrix contained non-finite values) and
on GPU are 16 complexes (CUDA out of memory)
And 6hlb: which is not available in test data
As our aim to check the CPU and GPU time comparison of the sets which are not showing any failed cases.
So we again tested for CPU by removing 21 complexes but get again 14 failed cases(complexes) and after removing those 14 complexes get 16 complexes failed.
For GPU , after removing 16 complexes we successfully run the rest complexes without any failed cases.
The above experiments for (batch_size=10)

For batch_size = 1, For GPU get 29 failed complexes and for CPU Failed for 30 complexes. Got the error like below
"""" Failed on ['6mjj'] linalg.svd: (Batch element 2): The algorithm failed to converge because the input matrix contained non-finite values.
""""
Please let us know what is the reason behind this as we are getting different failed cases on cpu and gpu and for different batch size

srilekha1993 · 2024-06-02T14:51:16Z

@gcorso Please clarify on above issue

jsilter · 2024-06-12T17:27:34Z

Could you try again with the most recent version?

srilekha1993 · 2024-06-13T05:22:05Z

sure

srilekha1993 · 2024-06-22T04:18:57Z

Hi @jsilter,
As per your suggestion we have run the most recent version of Diffdock-1.1.2
So while executing inference.py on GPU we are getting 0 failed cases and 5 skipped

but for CPU execution we are getting different number of failed cases for different runs. The error pasted below
--- Logging error ---
Traceback (most recent call last):
File "/home/hgx/omics/srilekhx/Diffdock_1.1.2_mod/inference.py", line 260, in main
data_list, confidence = sampling(data_list=data_list, model=model,
File "/home/hgx/omics/srilekhx/Diffdock_1.1.2_mod/utils/sampling.py", line 190, in sampling
modify_conformer_batch(complex_graph_batch['ligand'].pos, complex_graph_batch, tr_perturb, rot_perturb,
File "/home/hgx/omics/srilekhx/Diffdock_1.1.2_mod/utils/diffusion_utils.py", line 73, in modify_conformer_batch
R, t = rigid_transform_Kabsch_3D_torch_batch(flexible_new_pos, rigid_new_pos)
File "/home/hgx/omics/srilekhx/Diffdock_1.1.2_mod/utils/geometry.py", line 266, in rigid_transform_Kabsch_3D_torch_batch
U, S, Vt = torch.linalg.svd(H)
torch._C._LinAlgError: linalg.svd: (Batch element 2): The algorithm failed to converge because the input matrix contained non-finite values.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/data/sudarsh2/miniconda3/envs/diff_L_new_rec/lib/python3.9/logging/init.py", line 1083, in emit
msg = self.format(record)
File "/data/sudarsh2/miniconda3/envs/diff_L_new_rec/lib/python3.9/logging/init.py", line 927, in format
return fmt.format(record)
File "/data/sudarsh2/miniconda3/envs/diff_L_new_rec/lib/python3.9/logging/init.py", line 663, in format
record.message = record.getMessage()
File "/data/sudarsh2/miniconda3/envs/diff_L_new_rec/lib/python3.9/logging/init.py", line 367, in getMessage
msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
File "/data/sudarsh2/miniconda3/envs/diff_L_new_rec/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/data/sudarsh2/miniconda3/envs/diff_L_new_rec/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/hgx/omics/srilekhx/Diffdock_1.1.2_mod/inference.py", line 318, in
main(_args)
File "/home/hgx/omics/srilekhx/Diffdock_1.1.2_mod/inference.py", line 302, in main
logger.warning("Failed on", orig_complex_graph["name"], e)
Message: 'Failed on'
Arguments: (['6seo'], _LinAlgError('linalg.svd: (Batch element 2): The algorithm failed to converge because the input matrix contained non-finite values.'))

The complexes which are getting failed when we are trying individually its getting executed successfully most of the time and sometimes getting failed also.

Can you please let us know the reason behind such variation of output.

srilekha1993 changed the title ~~Different Failed on complexes for inference.py on CPU and GPU are different for different batch size~~ Failed on different complexes for inference.py on CPU and GPU are different for different batch size May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed on different complexes for inference.py on CPU and GPU are different for different batch size #229

Failed on different complexes for inference.py on CPU and GPU are different for different batch size #229

srilekha1993 commented May 27, 2024 •

edited

Loading

srilekha1993 commented Jun 2, 2024

jsilter commented Jun 12, 2024

srilekha1993 commented Jun 13, 2024

srilekha1993 commented Jun 22, 2024 •

edited

Loading

Failed on different complexes for inference.py on CPU and GPU are different for different batch size #229

Failed on different complexes for inference.py on CPU and GPU are different for different batch size #229

Comments

srilekha1993 commented May 27, 2024 • edited Loading

srilekha1993 commented Jun 2, 2024

jsilter commented Jun 12, 2024

srilekha1993 commented Jun 13, 2024

srilekha1993 commented Jun 22, 2024 • edited Loading

srilekha1993 commented May 27, 2024 •

edited

Loading

srilekha1993 commented Jun 22, 2024 •

edited

Loading