You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
A race condition exists in the creation of a run directory for a Parsl workflow. If many workflows are started at once, it is possible for two or more workflows to assign themselves the same run directory when operating under the same runinfo prefix. This currently causes one of the workflows to crash.
Here is an example trackback when this occurs:
Traceback (most recent call last):
File "/work2/10000/stetzler/stampede3/opt_lsst/conda/envs/lsst-scipipe-8.0.0/share/eups/Linux64/ctrl_bps/g95c51951e8+10f5ce77f8/python/lsst/ctrl/bps/cli/cmd/commands.py", line 92, in submit
submit_driver(*args, **kwargs)
File "/work2/10000/stetzler/stampede3/opt_lsst/conda/envs/lsst-scipipe-8.0.0/share/eups/Linux64/ctrl_bps/g95c51951e8+10f5ce77f8/python/lsst/ctrl/bps/drivers.py", line 463, in submit_driver
workflow = submit(wms_workflow_config, wms_workflow, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/work2/10000/stetzler/stampede3/opt_lsst/conda/envs/lsst-scipipe-8.0.0/share/eups/Linux64/utils/g5476671ec8+b86e4b8053/python/lsst/utils/timer.py", line 300, in timeMethod_wrapper
res = func(self, *args, **keyArgs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/work2/10000/stetzler/stampede3/opt_lsst/conda/envs/lsst-scipipe-8.0.0/share/eups/Linux64/ctrl_bps/g95c51951e8+10f5ce77f8/python/lsst/ctrl/bps/submit.py", line 84, in submit
workflow = wms_service.submit(wms_workflow, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch/10000/stetzler/DEEP_processing/modules/ctrl_bps_parsl/python/lsst/ctrl/bps/parsl/service.py", line 76, in submit
workflow.start()
File "/scratch/10000/stetzler/DEEP_processing/modules/ctrl_bps_parsl/python/lsst/ctrl/bps/parsl/workflow.py", line 296, in start
self.load_dfk()
File "/scratch/10000/stetzler/DEEP_processing/modules/ctrl_bps_parsl/python/lsst/ctrl/bps/parsl/workflow.py", line 291, in load_dfk
self.dfk = parsl.load(self.parsl_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/work2/10000/stetzler/stampede3/opt_lsst/conda/envs/lsst-scipipe-8.0.0/lib/python3.11/site-packages/typeguard/__init__.py", line 1033, in wrapper
retval = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/scratch/10000/stetzler/DEEP_processing/env/parsl/dataflow/dflow.py", line 1532, in load
cls._dfk = DataFlowKernel(config)
^^^^^^^^^^^^^^^^^^^^^^
File "/work2/10000/stetzler/stampede3/opt_lsst/conda/envs/lsst-scipipe-8.0.0/lib/python3.11/site-packages/typeguard/__init__.py", line 1033, in wrapper
retval = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/scratch/10000/stetzler/DEEP_processing/env/parsl/dataflow/dflow.py", line 93, in __init__
self.run_dir = make_rundir(config.run_dir)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/scratch/10000/stetzler/DEEP_processing/env/parsl/dataflow/rundirs.py", line 34, in make_rundir
os.makedirs(current_rundir)
File "<frozen os>", line 225, in makedirs
FileExistsError: [Errno 17] File exists: 'runinfo/208'
I was experimenting with fixing that in #3409 because @QLeB encountered that too - it never rose back to the top of my stack to finish it off.
In comparison to your #3729 - there's a maximum number of retries. there's an exponential backoff to reduce contention on retries. it uses while rather than recursion, which is less hungry on stack (although that doesn't matter so much if the number of retries are limited).
Okay, #3409 seems like the better approach. I'll go ahead and close #3729.
I'm concerned about max_tries since it implies workflows can still fail under the race condition if contention is not lifted after 3 retries. It seems like workflows should continue to retry until the code fails due to an issue with the filesystem that raises naturally. But I can't think of what that issue would be.
Also to address your concern about using random in that PR, I looked at the distribution in delays using your exponential back off, and it seems like the process produces a very large spread in delays after 3 tries. My guess is that approach will work fine.
Describe the bug
A race condition exists in the creation of a run directory for a Parsl workflow. If many workflows are started at once, it is possible for two or more workflows to assign themselves the same run directory when operating under the same
runinfo
prefix. This currently causes one of the workflows to crash.Here is an example trackback when this occurs:
The behavior can be tracked to here: https://github.com/Parsl/parsl/blob/2024.12.23/parsl/dataflow/rundirs.py#L25-L34 where the time between
and
is where the filesystem state could have changed and the race condition occurs.
To Reproduce
Start many workflows at the same time.
Expected behavior
I expected all workflows to start.
The text was updated successfully, but these errors were encountered: