-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workflow assigned with TrustSitelists=True can fail at WorkQueueManager if files have not replicated #11501
Comments
@jhonatanamado Jhonatan, once you have some spare cycles, would you mind looking into the rules above and try to get those last files finally replicated? |
Hi Alan, I updated the rules with some priority. Lets see if that helps. But Im puzzle about those rules. Why the rules (replicating the same data) were created with a 1 month of difference? Could be the aborted vs the new campaign ? |
Thanks @jhonatanamado , yes, there were two workflows processing this data. The first one was aborted beginning of Feb, second one came in later in Feb, thus triggering the rule creation. |
Thank you Jhonatan! |
Hi @amaltaro thanks for spotting this. In general, we will start the processing of many workflows with partial input in production since the staging is the bottleneck. In this regard, we should start focusing on how we can achieve this in the optimal way. Concerning this issue; I didn't know this bug. As an exception this year, we kept all the 2022 data at CERN and ran jobs in Eurasia by counting on AAA [1] It's more than 40 workflows. Our expectation is to fix this bug and don't allow jobs to run unless the corresponding input is on disk. I know that this works when AAA is off. We should make it work when AAA is ON, too. Can you please explain why it doesn't work when AAA is ON? How hard is it to fix this bug, roughly? Note that, fixing this bug is key as we'll start using partial pileup soon w/ secondary AAA ON. There, we'll have thousands of failures unless we fix this, as I understand. How to handle these workflows when they are stuck in Hope it was clear. Please let me know what you think. |
Btw, Alan can you please explain impact of this failure? As I understand it doesn't lead to job failures. What harm does it really cause? |
As I was about to report another workflow having this issue, I noticed I've never replied to your question, sorry. This should be a fairly container issue, which will cause LQ -> WMBS work acquisition to fail for that specific workflow, but it should not be a blocker for the rest of the work in the agent. Relevant code is around this try/except block: Nonetheless, I do see a behavior where the WorkQueueManager thread seems to stop processing the remaining "matches" in the queue, from submit10 logs... For the record, here is further log from WMAgent 2.3.7.4:
|
Impact of the bug
WMAgent
Describe the bug
When a workflow is assigned with
TrustSitelists=True
, it means that data can be read from any Disk storage and jobs don't necessarily need to run at the site that is associated with the RSE.There is one requirement enforced in MSTransferor though, which is, every single primary (and parent) input block needs to be available somewhere in Disk, otherwise primary input data placement is required.
Here is a workflow/WQE that is failing to be acquired in vocms0255:
haozturk_Run2022C_ParkingDoubleElectronLowMass3_10Dec2022_230221_140043_3850
precisely this WQE (which will be removed once this workflow completes):
https://cmsweb.cern.ch/couchdb/workqueue/_design/WorkQueue/_rewrite/element/098bafb31af9d37863bad1286d43012b
and here is the error thrown in the WorkQueueManager component log [1].
Here is the current status of the replication rule for the relevant block [2].
How to reproduce it
Unsure, it could be that a rule got deleted and recreated after the workflow went from assigned to staging status transition.
Expected behavior
I don't think we can avoid this situation, given that data transfer problems can happen at any time.
However, we need to at least:
Additional context and error message
[1]
[2]
The text was updated successfully, but these errors were encountered: