-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
utils: fix total job count to exclude restored jobs from workspace #64
utils: fix total job count to exclude restored jobs from workspace #64
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #64 +/- ##
======================================
Coverage 3.84% 3.84%
======================================
Files 6 6
Lines 182 182
======================================
Hits 7 7
Misses 175 175
|
@@ -17,7 +17,7 @@ def publish_workflow_start( | |||
workflow_uuid: str, publisher: WorkflowStatusPublisher, job: Job | |||
): | |||
"""Publish to MQ the start of the workflow.""" | |||
job_count = len([rule for rule in job.dag.rules if not rule.norun]) | |||
job_count = len([j for j in job.dag.needrun_jobs if not j.rule.norun]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking at the source code of snakemake (source)
@property
def needrun_jobs(self):
"""Jobs that need to be executed."""
return filterfalse(self.finished, self._needrun)
It seems like needrun_jobs
returns the job that need to be run and that haven't finished yet. Maybe we should use _needrun
directly here, even though it's internal?
Otherwise what's happening is that we are sending the total multiple times, often with a different total number because some jobs have finished in the meantime. I have to say that it works even as it is now, but just because we have this check in r-w-controller (source)
if previous_status:
previous_total = previous_status.get("total") or 0
if status == "total":
if previous_total > 0:
continue
else:
job_progress["total"] = msg["progress"]["total"]
The first message brings the right total amount, so that is saved to the database and is never updated (at least, that's what I think it's happening).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! It indeed appears that only the first received total is considered by REANA, probably for this exact reason.
I investigated a bit more and, if we choose to use the internal variables and not rely on this check from r-w-controller, it's probably better to use dag._finished | dag._needrun
instead of just dag._needrun
(as done by Snakemake to check the length of the DAG). This is because sometimes the update_needrun
method might be called (e.g. for dynamic workflows or whenever there is a change to the DAG topology) and the jobs that are already finished are removed from the _needrun
set.
c323835
to
1cf987b
Compare
Amend the total number of jobs in a workflow so that when Snakemake detects that one or more jobs can be restored from the workspace (same workflow caching) they are not considered in the total count, reflecting Snakemake's local behaviour. Closes reanahub#62.
1cf987b
to
bc9cfee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Amend the total number of jobs in a workflow so that when Snakemake
detects that one or more jobs can be restored from the workspace (same
workflow caching) they are not considered in the total count, reflecting
Snakemake's local behaviour.
Closes #62.
How to test
roofit
examplereana-client run -w roofit-snakemake -f ./reana-snakemake.yaml
. The total number of jobs should be 2/2 (as usual)reana-client restart -w roofit-snakemake
. The total number of jobs should be 0/0, because the workflow is entirely restored from the workspace.code/fitdata.C
(such as a comment) and upload it to the workspace withreana-client upload -w roofit-snakemake code/fitdata.C
reana-client restart -w roofit-snakemake
. The total number of jobs should be 1/1, because only thefitdata
step is executed (gendata
is restored from the "local cache").cms-h4l
examplereana-client run -w cms-h4l
. The total number of jobs should be 4/4 (as usual)reana-client restart -w cms-h4l
. The total number of jobs should be 0/0, because the workflow is entirely restored from the workspace.Higgs4L1file.root
(which is the input for themake_plot
step, generated byanalyze_mc
) withreana-client rm -w cms-h4l-snakemake.1.4 results/Higgs4L1file.root
and the final output plot:reana-client rm -w cms-h4l-snakemake.1.4 results/mass4l_combine_userlvl3.pdf
reana-client restart -w cms-h4l
. The total number of jobs should be 2/2, because theanalyze_mc
step should regenerate theHiggs4L1file.root
file and themake_plot
step should create the final output plot.