Error message when job killed for running out of slurm time is very misleading #5177

glennhickey · 2024-12-03T17:09:06Z

In my experience, if a job runs beyond the max value given for its Slurm partition it gets killed (fine) but with a KeyboardInterrupt which is super cryptic. It always makes me think I've somehow pressed CRTRL-C or something by accident. It's also confusing Cactus users on github:

ComparativeGenomicsToolkit/cactus#1554

Would it be possible to patch Toil to give a more informative error for this?

┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-1685

The text was updated successfully, but these errors were encountered:

adamnovak · 2024-12-09T21:56:13Z

This is set up here:

toil/src/toil/batchSystems/slurm.py

Lines 591 to 593 in 8e0d84a

    
           # container cleanup finally blocks can run. Ask for SIGINT so we 
        
           # can get the default Python KeyboardInterrupt which third-party 
        
           # code is likely to plan for. Make sure to send it to the batch

We could make it configurable or just change it to another signal, but we'd need to make sure that e.g. the WDL and CWL Docker container management code knows to trap that signal, clean up running containers, and then exit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error message when job killed for running out of slurm time is very misleading #5177

Error message when job killed for running out of slurm time is very misleading #5177

glennhickey commented Dec 3, 2024 •

edited by unito-bot

Loading

adamnovak commented Dec 9, 2024

Error message when job killed for running out of slurm time is very misleading #5177

Error message when job killed for running out of slurm time is very misleading #5177

Comments

glennhickey commented Dec 3, 2024 • edited by unito-bot Loading

adamnovak commented Dec 9, 2024

glennhickey commented Dec 3, 2024 •

edited by unito-bot

Loading