Race condition when creating run directory #3728

stevenstetzler · 2024-12-28T18:56:59Z

Describe the bug
A race condition exists in the creation of a run directory for a Parsl workflow. If many workflows are started at once, it is possible for two or more workflows to assign themselves the same run directory when operating under the same runinfo prefix. This currently causes one of the workflows to crash.

Here is an example trackback when this occurs:

Traceback (most recent call last):
  File "/work2/10000/stetzler/stampede3/opt_lsst/conda/envs/lsst-scipipe-8.0.0/share/eups/Linux64/ctrl_bps/g95c51951e8+10f5ce77f8/python/lsst/ctrl/bps/cli/cmd/commands.py", line 92, in submit
    submit_driver(*args, **kwargs)
  File "/work2/10000/stetzler/stampede3/opt_lsst/conda/envs/lsst-scipipe-8.0.0/share/eups/Linux64/ctrl_bps/g95c51951e8+10f5ce77f8/python/lsst/ctrl/bps/drivers.py", line 463, in submit_driver
    workflow = submit(wms_workflow_config, wms_workflow, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work2/10000/stetzler/stampede3/opt_lsst/conda/envs/lsst-scipipe-8.0.0/share/eups/Linux64/utils/g5476671ec8+b86e4b8053/python/lsst/utils/timer.py", line 300, in timeMethod_wrapper
    res = func(self, *args, **keyArgs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work2/10000/stetzler/stampede3/opt_lsst/conda/envs/lsst-scipipe-8.0.0/share/eups/Linux64/ctrl_bps/g95c51951e8+10f5ce77f8/python/lsst/ctrl/bps/submit.py", line 84, in submit
    workflow = wms_service.submit(wms_workflow, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/10000/stetzler/DEEP_processing/modules/ctrl_bps_parsl/python/lsst/ctrl/bps/parsl/service.py", line 76, in submit
    workflow.start()
  File "/scratch/10000/stetzler/DEEP_processing/modules/ctrl_bps_parsl/python/lsst/ctrl/bps/parsl/workflow.py", line 296, in start
    self.load_dfk()
  File "/scratch/10000/stetzler/DEEP_processing/modules/ctrl_bps_parsl/python/lsst/ctrl/bps/parsl/workflow.py", line 291, in load_dfk
    self.dfk = parsl.load(self.parsl_config)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work2/10000/stetzler/stampede3/opt_lsst/conda/envs/lsst-scipipe-8.0.0/lib/python3.11/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/10000/stetzler/DEEP_processing/env/parsl/dataflow/dflow.py", line 1532, in load
    cls._dfk = DataFlowKernel(config)
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/work2/10000/stetzler/stampede3/opt_lsst/conda/envs/lsst-scipipe-8.0.0/lib/python3.11/site-packages/typeguard/__init__.py", line 1033, in wrapper
    retval = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/10000/stetzler/DEEP_processing/env/parsl/dataflow/dflow.py", line 93, in __init__
    self.run_dir = make_rundir(config.run_dir)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/10000/stetzler/DEEP_processing/env/parsl/dataflow/rundirs.py", line 34, in make_rundir
    os.makedirs(current_rundir)
  File "<frozen os>", line 225, in makedirs
FileExistsError: [Errno 17] File exists: 'runinfo/208'

The behavior can be tracked to here: https://github.com/Parsl/parsl/blob/2024.12.23/parsl/dataflow/rundirs.py#L25-L34 where the time between

prev_rundirs = glob(os.path.join(path, "[0-9]*[0-9]"))

and

os.makedirs(current_rundir)

is where the filesystem state could have changed and the race condition occurs.

To Reproduce
Start many workflows at the same time.

Expected behavior
I expected all workflows to start.

The text was updated successfully, but these errors were encountered:

benclifford · 2024-12-29T12:24:34Z

I was experimenting with fixing that in #3409 because @QLeB encountered that too - it never rose back to the top of my stack to finish it off.

In comparison to your #3729 - there's a maximum number of retries. there's an exponential backoff to reduce contention on retries. it uses while rather than recursion, which is less hungry on stack (although that doesn't matter so much if the number of retries are limited).

stevenstetzler · 2024-12-29T17:40:29Z

Okay, #3409 seems like the better approach. I'll go ahead and close #3729.

I'm concerned about max_tries since it implies workflows can still fail under the race condition if contention is not lifted after 3 retries. It seems like workflows should continue to retry until the code fails due to an issue with the filesystem that raises naturally. But I can't think of what that issue would be.

Also to address your concern about using random in that PR, I looked at the distribution in delays using your exponential back off, and it seems like the process produces a very large spread in delays after 3 tries. My guess is that approach will work fine.

stevenstetzler added the bug label Dec 28, 2024

stevenstetzler mentioned this issue Dec 28, 2024

Fix race condition in rundir creation #3729

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition when creating run directory #3728

Race condition when creating run directory #3728

stevenstetzler commented Dec 28, 2024

benclifford commented Dec 29, 2024

stevenstetzler commented Dec 29, 2024

Race condition when creating run directory #3728

Race condition when creating run directory #3728

Comments

stevenstetzler commented Dec 28, 2024

benclifford commented Dec 29, 2024

stevenstetzler commented Dec 29, 2024