Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utils: fix total job count to exclude restored jobs from workspace #64

Merged
merged 1 commit into from
Sep 20, 2023

Conversation

giuseppe-steduto
Copy link
Member

@giuseppe-steduto giuseppe-steduto commented Sep 14, 2023

Amend the total number of jobs in a workflow so that when Snakemake
detects that one or more jobs can be restored from the workspace (same
workflow caching) they are not considered in the total count, reflecting
Snakemake's local behaviour.

Closes #62.

How to test

roofit example

  1. Run the workflow normally with reana-client run -w roofit-snakemake -f ./reana-snakemake.yaml. The total number of jobs should be 2/2 (as usual)
  2. Restart the workflow in the same workspace with reana-client restart -w roofit-snakemake. The total number of jobs should be 0/0, because the workflow is entirely restored from the workspace.
  3. Change something in code/fitdata.C (such as a comment) and upload it to the workspace with reana-client upload -w roofit-snakemake code/fitdata.C
  4. Restart the workflow in the same workspace with reana-client restart -w roofit-snakemake. The total number of jobs should be 1/1, because only the fitdata step is executed (gendata is restored from the "local cache").

cms-h4l example

  1. Run the workflow normally with reana-client run -w cms-h4l. The total number of jobs should be 4/4 (as usual)
  2. Restart the workflow in the same workspace with reana-client restart -w cms-h4l. The total number of jobs should be 0/0, because the workflow is entirely restored from the workspace.
  3. Remove the Higgs4L1file.root (which is the input for the make_plot step, generated by analyze_mc) with reana-client rm -w cms-h4l-snakemake.1.4 results/Higgs4L1file.root and the final output plot: reana-client rm -w cms-h4l-snakemake.1.4 results/mass4l_combine_userlvl3.pdf
  4. Restart the workflow in the same workspace with reana-client restart -w cms-h4l. The total number of jobs should be 2/2, because the analyze_mc step should regenerate the Higgs4L1file.root file and the make_plot step should create the final output plot.

@codecov
Copy link

codecov bot commented Sep 14, 2023

Codecov Report

Merging #64 (bc9cfee) into master (3a8f770) will not change coverage.
The diff coverage is 0.00%.

Additional details and impacted files

Impacted file tree graph

@@          Coverage Diff           @@
##           master     #64   +/-   ##
======================================
  Coverage    3.84%   3.84%           
======================================
  Files           6       6           
  Lines         182     182           
======================================
  Hits            7       7           
  Misses        175     175           
Files Changed Coverage Δ
reana_workflow_engine_snakemake/utils.py 0.00% <0.00%> (ø)

@@ -17,7 +17,7 @@ def publish_workflow_start(
workflow_uuid: str, publisher: WorkflowStatusPublisher, job: Job
):
"""Publish to MQ the start of the workflow."""
job_count = len([rule for rule in job.dag.rules if not rule.norun])
job_count = len([j for j in job.dag.needrun_jobs if not j.rule.norun])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at the source code of snakemake (source)

@property
def needrun_jobs(self):
    """Jobs that need to be executed."""
    return filterfalse(self.finished, self._needrun)

It seems like needrun_jobs returns the job that need to be run and that haven't finished yet. Maybe we should use _needrun directly here, even though it's internal?

Otherwise what's happening is that we are sending the total multiple times, often with a different total number because some jobs have finished in the meantime. I have to say that it works even as it is now, but just because we have this check in r-w-controller (source)

            if previous_status:
                previous_total = previous_status.get("total") or 0
            if status == "total":
                if previous_total > 0:
                    continue
                else:
                    job_progress["total"] = msg["progress"]["total"]

The first message brings the right total amount, so that is saved to the database and is never updated (at least, that's what I think it's happening).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! It indeed appears that only the first received total is considered by REANA, probably for this exact reason.

I investigated a bit more and, if we choose to use the internal variables and not rely on this check from r-w-controller, it's probably better to use dag._finished | dag._needrun instead of just dag._needrun (as done by Snakemake to check the length of the DAG). This is because sometimes the update_needrun method might be called (e.g. for dynamic workflows or whenever there is a change to the DAG topology) and the jobs that are already finished are removed from the _needrun set.

@giuseppe-steduto giuseppe-steduto force-pushed the fix-total-job-count branch 3 times, most recently from c323835 to 1cf987b Compare September 19, 2023 09:46
Amend the total number of jobs in a workflow so that when Snakemake
detects that one or more jobs can be restored from the workspace (same
workflow caching) they are not considered in the total count, reflecting
Snakemake's local behaviour.

Closes reanahub#62.
Copy link
Member

@mdonadoni mdonadoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mdonadoni mdonadoni merged commit bc9cfee into reanahub:master Sep 20, 2023
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

improper progress indication for cached steps
2 participants