Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow assigned with TrustSitelists=True can fail at WorkQueueManager if files have not replicated #11501

Open
amaltaro opened this issue Mar 1, 2023 · 8 comments · May be fixed by #12212

Comments

@amaltaro
Copy link
Contributor

amaltaro commented Mar 1, 2023

Impact of the bug
WMAgent

Describe the bug
When a workflow is assigned with TrustSitelists=True, it means that data can be read from any Disk storage and jobs don't necessarily need to run at the site that is associated with the RSE.
There is one requirement enforced in MSTransferor though, which is, every single primary (and parent) input block needs to be available somewhere in Disk, otherwise primary input data placement is required.

Here is a workflow/WQE that is failing to be acquired in vocms0255:
haozturk_Run2022C_ParkingDoubleElectronLowMass3_10Dec2022_230221_140043_3850

precisely this WQE (which will be removed once this workflow completes):
https://cmsweb.cern.ch/couchdb/workqueue/_design/WorkQueue/_rewrite/element/098bafb31af9d37863bad1286d43012b

and here is the error thrown in the WorkQueueManager component log [1].

Here is the current status of the replication rule for the relevant block [2].

How to reproduce it
Unsure, it could be that a rule got deleted and recreated after the workflow went from assigned to staging status transition.

Expected behavior
I don't think we can avoid this situation, given that data transfer problems can happen at any time.
However, we need to at least:

  • ensure that MSTransferor enforces input data to be FULLY available in at least one Disk endpoint (keep in mind that only T1 and T2 is used as valid input location)
  • and we should implement an alert in WMAgent such that people get notified when this unusual situation happens.

Additional context and error message
[1]

2023-03-01 17:07:16,373:140522083251968:INFO:WMBSHelper:"haozturk_Run2022C_ParkingDoubleElectronLowMass3_10Dec2022_230221_140043_3850" Injecting block /ParkingDoubleElectronLowMass3/Run2022C-v1/RAW#fb8c28
eb-9c46-439c-853a-21b00fd4309f (253 files) into wmbs.
2023-03-01 17:07:16,373:140522083251968:INFO:WMBSHelper:Adding files into WMBS for haozturk_Run2022C_ParkingDoubleElectronLowMass3_10Dec2022_230221_140043_3850 with PNNs: []
2023-03-01 17:07:16,455:140522083251968:INFO:WMBSHelper:Inserting 253 files in bulk into WMBS for haozturk_Run2022C_ParkingDoubleElectronLowMass3_10Dec2022_230221_140043_3850
2023-03-01 17:07:16,455:140522083251968:ERROR:File:File created in WMBS without locations!
File lfn: /store/data/Run2022C/ParkingDoubleElectronLowMass3/RAW/v1/000/357/440/00000/eb6367dc-cde6-4964-8ab3-2af9679e2557.root

2023-03-01 17:07:16,465:140522083251968:ERROR:WorkQueue:Failed to create subscription for haozturk_Run2022C_ParkingDoubleElectronLowMass3_10Dec2022_230221_140043_3850 with block name /ParkingDoubleElectro
nLowMass3/Run2022C-v1/RAW#fb8c28eb-9c46-439c-853a-21b00fd4309f
Error: File created in WMBS without locations!
File lfn: /store/data/Run2022C/ParkingDoubleElectronLowMass3/RAW/v1/000/357/440/00000/eb6367dc-cde6-4964-8ab3-2af9679e2557.root
Traceback (most recent call last):
  File "/data/srv/wmagent/v2.1.6.1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.1.6.1/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueue.py", line 357, in getWork
    match['Subscription'] = self._wmbsPreparation(match,
  File "/data/srv/wmagent/v2.1.6.1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.1.6.1/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueue.py", line 458, in _wmbsPreparation
    sub, match['NumOfFilesAdded'] = wmbsHelper.createSubscriptionAndAddFiles(block=dbsBlock)
  File "/data/srv/wmagent/v2.1.6.1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.1.6.1/lib/python3.8/site-packages/WMCore/WorkQueue/WMBSHelper.py", line 423, in createSubscriptionAndAddFiles
    addedFiles = self.addFiles(block)
  File "/data/srv/wmagent/v2.1.6.1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.1.6.1/lib/python3.8/site-packages/WMCore/WorkQueue/WMBSHelper.py", line 463, in addFiles
    totalFiles = self.topLevelFileset.addFilesToWMBSInBulk(self.wmbsFilesToCreate,
  File "/data/srv/wmagent/v2.1.6.1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.1.6.1/lib/python3.8/site-packages/WMCore/WMBS/Fileset.py", line 246, in addFilesToWMBSInBulk
    files = addFilesToWMBSInBulk(self.id, workflowName, files,
  File "/data/srv/wmagent/v2.1.6.1/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.1.6.1/lib/python3.8/site-packages/WMCore/WMBS/File.py", line 558, in addFilesToWMBSInBulk
    raise RuntimeError(msg)
RuntimeError: File created in WMBS without locations!
File lfn: /store/data/Run2022C/ParkingDoubleElectronLowMass3/RAW/v1/000/357/440/00000/eb6367dc-cde6-4964-8ab3-2af9679e2557.root

[2]

amaltaro@lxplus8s11:~ $ rucio list-rules "cms:/ParkingDoubleElectronLowMass3/Run2022C-v1/RAW#fb8c28eb-9c46-439c-853a-21b00fd4309f"
ID                                ACCOUNT            SCOPE:NAME                                                                               STATE[OK/REPL/STUCK]    RSE_EXPRESSION                                                                                                                                                                                                                                                                                     COPIES    EXPIRES (UTC)    CREATED (UTC)

a1d876b4be1542f2ba72699ecd004431  wmcore_transferor  cms:/ParkingDoubleElectronLowMass3/Run2022C-v1/RAW#fb8c28eb-9c46-439c-853a-21b00fd4309f  REPLICATING[245/8/0]    T1_UK_RAL_Disk|T2_UK_SGrid_Bristol|T1_IT_CNAF_Disk|T2_HU_Budapest|T2_PL_Swierk|T2_CH_CSCS|T2_ES_CIEMAT|T2_BE_IIHE|T2_TW_NCHC|T2_IT_Bari|T2_BE_UCL|T2_DE_DESY|T2_IT_Legnaro|T1_DE_KIT_Disk|T2_RU_ITEP|T2_CH_CERN|T2_IT_Pisa|T2_UK_London_IC|T2_IT_Rome|T2_ES_IFCA|T2_EE_Estonia                     1                          2023-02-21 14:04:15
61fc37c07af248feb006b8e713fc6048  wmcore_transferor  cms:/ParkingDoubleElectronLowMass3/Run2022C-v1/RAW#fb8c28eb-9c46-439c-853a-21b00fd4309f  SUSPENDED[207/46/0]     T2_IT_Bari|T1_RU_JINR_Disk|T2_US_Nebraska|T2_HU_Budapest|T2_BE_UCL|T2_BE_IIHE|T2_US_MIT|T2_BR_UERJ|T2_IT_Legnaro|T2_US_Vanderbilt|T2_FR_GRIF_IRFU|T2_ES_CIEMAT|T1_UK_RAL_Disk|T2_UK_London_Brunel|T2_IT_Pisa|T2_US_Purdue|T1_DE_KIT_Disk|T2_EE_Estonia|T1_IT_CNAF_Disk|T2_US_Wisconsin|T2_ES_IFCA  1                          2023-01-21 08:58:33
@amaltaro
Copy link
Contributor Author

amaltaro commented Mar 1, 2023

@jhonatanamado Jhonatan, once you have some spare cycles, would you mind looking into the rules above and try to get those last files finally replicated?
@haozturk FYI

@jhonatanamado
Copy link
Contributor

Hi Alan, I updated the rules with some priority. Lets see if that helps. But Im puzzle about those rules. Why the rules (replicating the same data) were created with a 1 month of difference? Could be the aborted vs the new campaign ?

@amaltaro
Copy link
Contributor Author

amaltaro commented Mar 2, 2023

Thanks @jhonatanamado , yes, there were two workflows processing this data. The first one was aborted beginning of Feb, second one came in later in Feb, thus triggering the rule creation.

@jhonatanamado
Copy link
Contributor

Hi @amaltaro , @haozturk rules are OK now. Is this issue still relevant. Or even better, Are all the rules from this new campaign in state OK?

@amaltaro
Copy link
Contributor Author

amaltaro commented Mar 8, 2023

Thank you Jhonatan!
For now, I think we should keep this issue open as there is no real solution on how to deal with this from the WMCore perspective. We need to come up with a plan on how to avoid such failures in WorkQueueManager.

@haozturk
Copy link

haozturk commented Mar 9, 2023

Hi @amaltaro thanks for spotting this. In general, we will start the processing of many workflows with partial input in production since the staging is the bottleneck. In this regard, we should start focusing on how we can achieve this in the optimal way.

Concerning this issue; I didn't know this bug. As an exception this year, we kept all the 2022 data at CERN and ran jobs in Eurasia by counting on AAA [1] It's more than 40 workflows.

Our expectation is to fix this bug and don't allow jobs to run unless the corresponding input is on disk. I know that this works when AAA is off. We should make it work when AAA is ON, too. Can you please explain why it doesn't work when AAA is ON? How hard is it to fix this bug, roughly?

Note that, fixing this bug is key as we'll start using partial pileup soon w/ secondary AAA ON. There, we'll have thousands of failures unless we fix this, as I understand.

How to handle these workflows when they are stuck in running-open due to non-existing files because of the problems in staging is another story. We're developing tools to handle that in P&R.

Hope it was clear. Please let me know what you think.

[1] https://its.cern.ch/jira/browse/CMSTRANSF-477

@haozturk
Copy link

haozturk commented Mar 9, 2023

Btw, Alan can you please explain impact of this failure? As I understand it doesn't lead to job failures. What harm does it really cause?

@amaltaro
Copy link
Contributor Author

As I was about to report another workflow having this issue, I noticed I've never replied to your question, sorry.

This should be a fairly container issue, which will cause LQ -> WMBS work acquisition to fail for that specific workflow, but it should not be a blocker for the rest of the work in the agent. Relevant code is around this try/except block:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WorkQueue/WorkQueue.py#L356-L369

Nonetheless, I do see a behavior where the WorkQueueManager thread seems to stop processing the remaining "matches" in the queue, from submit10 logs...

For the record, here is further log from WMAgent 2.3.7.4:

2024-12-20 19:08:48,587:140064165144320:INFO:WorkQueueBackend:Accepting workflow: cmsunified_task_GEN-Run3Summer23BPixNanoAODv13-00007__v1_T_241206_133910_3564, with prio: 110110, element id: 2445cac635e54454648b1c1d93bd7813, for site:
 T1_US_FNAL
2024-12-20 19:08:48,587:140064165144320:INFO:WorkQueueBackend:And 10 elements passed location and siteJobCounts restrictions for: http://cmsgwms-submit10.fnal.gov:5984
2024-12-20 19:08:48,588:140064165144320:INFO:WorkQueueBackend:Total of 10 elements passed location and siteJobCounts restrictions for: http://cmsgwms-submit10.fnal.gov:5984
2024-12-20 19:08:48,588:140064165144320:INFO:WorkQueue:Got 10 elements matching the constraints
2024-12-20 19:14:06,871:140064165144320:INFO:WorkQueue:Running WMBS preparation for cmsunified_task_GEN-Run3Summer23BPixNanoAODv13-00004__v1_T_241205_130555_814 with ParentQueueId 13871493ab56a6949b2ff7afe3ef88c0,
  with common location ['T2_CH_CERN', 'T1_US_FNAL_Disk', 'T3_CH_CERNBOX']
2024-12-20 19:14:06,875:140064165144320:INFO:Fileset:Fileset created: cmsunified_task_GEN-Run3Summer23BPixNanoAODv13-00004__v1_T_241205_130555_814-GEN-Run3Summer23BPixNanoAODv13-00004_0-/TTto2L2Nu-2Jets_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/Run3Summer23BPixMiniAODv4-130X_mcRun3_2023_realistic_postBPix_v2-v2/MINIAODSIM
2024-12-20 19:14:06,878:140064165144320:INFO:Workflow:Workflow id 48875 created for cmsunified_task_GEN-Run3Summer23BPixNanoAODv13-00004__v1_T_241205_130555_814
2024-12-20 19:14:06,909:140064165144320:INFO:WMBSHelper:"cmsunified_task_GEN-Run3Summer23BPixNanoAODv13-00004__v1_T_241205_130555_814" Injecting block /TTto2L2Nu-2Jets_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/Run3Summer23BPixMiniAODv4-130X_mcRun3_2023_realistic_postBPix_v2-v2/MINIAODSIM (1798 files) into wmbs.
2024-12-20 19:14:06,910:140064165144320:INFO:WMBSHelper:Adding files into WMBS for cmsunified_task_GEN-Run3Summer23BPixNanoAODv13-00004__v1_T_241205_130555_814 with PNNs: []
2024-12-20 19:14:07,169:140064165144320:INFO:WMBSHelper:Inserting 1798 files in bulk into WMBS for cmsunified_task_GEN-Run3Summer23BPixNanoAODv13-00004__v1_T_241205_130555_814
2024-12-20 19:14:07,169:140064165144320:ERROR:File:File created in WMBS without locations!
File lfn: /store/mc/Run3Summer23BPixMiniAODv4/TTto2L2Nu-2Jets_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/MINIAODSIM/130X_mcRun3_2023_realistic_postBPix_v2-v2/70000/e11cfa18-634b-44c2-98a2-2074550075ca.root
2024-12-20 19:14:07,170:140064165144320:ERROR:WorkQueue:Failed to create subscription for cmsunified_task_GEN-Run3Summer23BPixNanoAODv13-00004__v1_T_241205_130555_814 with block name /TTto2L2Nu-2Jets_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/Run3Summer23BPixMiniAODv4-130X_mcRun3_2023_realistic_postBPix_v2-v2/MINIAODSIM
Error: File created in WMBS without locations!
File lfn: /store/mc/Run3Summer23BPixMiniAODv4/TTto2L2Nu-2Jets_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/MINIAODSIM/130X_mcRun3_2023_realistic_postBPix_v2-v2/70000/e11cfa18-634b-44c2-98a2-2074550075ca.root
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueue.py", line 357, in getWork
    match['Subscription'] = self._wmbsPreparation(match,
  File "/usr/local/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueue.py", line 463, in _wmbsPreparation
    sub, match['NumOfFilesAdded'] = wmbsHelper.createSubscriptionAndAddFiles(block=dbsBlock)
  File "/usr/local/lib/python3.8/site-packages/WMCore/WorkQueue/WMBSHelper.py", line 422, in createSubscriptionAndAddFiles
    addedFiles = self.addFiles(block)
  File "/usr/local/lib/python3.8/site-packages/WMCore/WorkQueue/WMBSHelper.py", line 462, in addFiles
    totalFiles = self.topLevelFileset.addFilesToWMBSInBulk(self.wmbsFilesToCreate,
  File "/usr/local/lib/python3.8/site-packages/WMCore/WMBS/Fileset.py", line 246, in addFilesToWMBSInBulk
    files = addFilesToWMBSInBulk(self.id, workflowName, files,
  File "/usr/local/lib/python3.8/site-packages/WMCore/WMBS/File.py", line 558, in addFilesToWMBSInBulk
    raise RuntimeError(msg)
RuntimeError: File created in WMBS without locations!
File lfn: /store/mc/Run3Summer23BPixMiniAODv4/TTto2L2Nu-2Jets_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/MINIAODSIM/130X_mcRun3_2023_realistic_postBPix_v2-v2/70000/e11cfa18-634b-44c2-98a2-2074550075ca.root

@amaltaro amaltaro linked a pull request Dec 21, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants