Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition when creating run directory #3728

Open
stevenstetzler opened this issue Dec 28, 2024 · 2 comments
Open

Race condition when creating run directory #3728

stevenstetzler opened this issue Dec 28, 2024 · 2 comments
Labels

Comments

@stevenstetzler
Copy link

Describe the bug
A race condition exists in the creation of a run directory for a Parsl workflow. If many workflows are started at once, it is possible for two or more workflows to assign themselves the same run directory when operating under the same runinfo prefix. This currently causes one of the workflows to crash.

Here is an example trackback when this occurs:

Traceback (most recent call last):
  File "/work2/10000/stetzler/stampede3/opt_lsst/conda/envs/lsst-scipipe-8.0.0/share/eups/Linux64/ctrl_bps/g95c51951e8+10f5ce77f8/python/lsst/ctrl/bps/cli/cmd/commands.py", line 92, in submit
    submit_driver(*args, **kwargs)
  File "/work2/10000/stetzler/stampede3/opt_lsst/conda/envs/lsst-scipipe-8.0.0/share/eups/Linux64/ctrl_bps/g95c51951e8+10f5ce77f8/python/lsst/ctrl/bps/drivers.py", line 463, in submit_driver
    workflow = submit(wms_workflow_config, wms_workflow, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work2/10000/stetzler/stampede3/opt_lsst/conda/envs/lsst-scipipe-8.0.0/share/eups/Linux64/utils/g5476671ec8+b86e4b8053/python/lsst/utils/timer.py", line 300, in timeMethod_wrapper
    res = func(self, *args, **keyArgs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work2/10000/stetzler/stampede3/opt_lsst/conda/envs/lsst-scipipe-8.0.0/share/eups/Linux64/ctrl_bps/g95c51951e8+10f5ce77f8/python/lsst/ctrl/bps/submit.py", line 84, in submit
    workflow = wms_service.submit(wms_workflow, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/10000/stetzler/DEEP_processing/modules/ctrl_bps_parsl/python/lsst/ctrl/bps/parsl/service.py", line 76, in submit
    workflow.start()
  File "/scratch/10000/stetzler/DEEP_processing/modules/ctrl_bps_parsl/python/lsst/ctrl/bps/parsl/workflow.py", line 296, in start
    self.load_dfk()
  File "/scratch/10000/stetzler/DEEP_processing/modules/ctrl_bps_parsl/python/lsst/ctrl/bps/parsl/workflow.py", line 291, in load_dfk
    self.dfk = parsl.load(self.parsl_config)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work2/10000/stetzler/stampede3/opt_lsst/conda/envs/lsst-scipipe-8.0.0/lib/python3.11/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/10000/stetzler/DEEP_processing/env/parsl/dataflow/dflow.py", line 1532, in load
    cls._dfk = DataFlowKernel(config)
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/work2/10000/stetzler/stampede3/opt_lsst/conda/envs/lsst-scipipe-8.0.0/lib/python3.11/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/10000/stetzler/DEEP_processing/env/parsl/dataflow/dflow.py", line 93, in __init__
    self.run_dir = make_rundir(config.run_dir)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/10000/stetzler/DEEP_processing/env/parsl/dataflow/rundirs.py", line 34, in make_rundir
    os.makedirs(current_rundir)
  File "<frozen os>", line 225, in makedirs
FileExistsError: [Errno 17] File exists: 'runinfo/208'

The behavior can be tracked to here: https://github.com/Parsl/parsl/blob/2024.12.23/parsl/dataflow/rundirs.py#L25-L34 where the time between

prev_rundirs = glob(os.path.join(path, "[0-9]*[0-9]"))

and

os.makedirs(current_rundir)

is where the filesystem state could have changed and the race condition occurs.

To Reproduce
Start many workflows at the same time.

Expected behavior
I expected all workflows to start.

@benclifford
Copy link
Collaborator

I was experimenting with fixing that in #3409 because @QLeB encountered that too - it never rose back to the top of my stack to finish it off.

In comparison to your #3729 - there's a maximum number of retries. there's an exponential backoff to reduce contention on retries. it uses while rather than recursion, which is less hungry on stack (although that doesn't matter so much if the number of retries are limited).

@stevenstetzler
Copy link
Author

Okay, #3409 seems like the better approach. I'll go ahead and close #3729.

I'm concerned about max_tries since it implies workflows can still fail under the race condition if contention is not lifted after 3 retries. It seems like workflows should continue to retry until the code fails due to an issue with the filesystem that raises naturally. But I can't think of what that issue would be.

Also to address your concern about using random in that PR, I looked at the distribution in delays using your exponential back off, and it seems like the process produces a very large spread in delays after 3 tries. My guess is that approach will work fine.
samples

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants