-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement: handle infinite-requeuing of jobs with pod-stuck-terminating #599
Comments
If a complete solution from MCAD is not possible, one partial solution supporting external automation for handling this problem would be MCAD emitting a warning or error (and/or AppWrapper state transition) when pods-thought-to-have-been-terminated as part of the requeuing process are detected. |
Good, we agree this is not an MCAD bug (per se as you put it). The name of the pod itself is not what matters here, in fact. It is the label that attaches the pod to the AppWrapper what matters. MCAD queries all the nodes and gets all the pods matching the AppWrapper label. This IS controller independent behavior and doesn't care how the generic items create the pods (or what name they give them for that matter) as long as the pods have the proper labels. I guess I want to understand exactly what it means to "rename" a job because, as I stated, names don't matter, labels do. I suppose what you are actually doing is resubmitting the AppWrapper with a different name which causes the Ok, that is how it works, now what?. I honestly doubt there is anything MCAD can do here as it is working as it was designed: a terminating pod could be terminating because of preemption or because of something else. MCAD doesn't know why, so it treats all the cases the same and re-queues the job as many times as it needs (which can be controlled with the .... Now that I think of it, there might actually be a very straightforward solution to this. When the job is requeued we change the labels and append the number of times the job has been requeued to the label. This would guarantee that the new requeuing looks like a "renaming" of the AppWrapper without having to actually do that. |
Excellent distinction, and idea, thank you! |
I'll work on this today and make a PR as soon as I can. |
I don't think changing the spec of the object is a good solution. I guess we can open ADR for this to be implemented. |
If the pod is stuck in the terminating state then I suspect it is waiting on some finalizer. a real fix could be that MCAD should not re-queue until the finalizer is satisfied or delete the object without the finalizer being honored. this could be a user intent represented by a flag where we do or do not honor the finalizer. |
They need this ASAP. The ADR process would take longer. |
We can try to move ADRs quickly. Here is the link to ADR template, let try to keep ADR short with details about final design: https://github.com/project-codeflare/adr/blob/main/PCF-ADR-0000-template.md |
We need to agree on this, so I'll hold working on the implementation until we have an agreement. @asm582 is right in that changing the |
After doing some testing locally with my own finalizers simulating what happens when the node dies, I realized something. The problem is indeed the naming of the pods that are created. Lets take for example the training operator. Whenever a
This shouldn't need an ADR as the pods are just modified to remove the finalizer so they can be deleted. The AppWrapper stays the same. I guess if we want to disable this finalizer deletion then the user can control that in the Opinions @asm582 @MEllis-github |
|
@MEllis-github not really. I tried deleting nodes in the cluster but this doesn't do what you see. I imagine, internally, minikube does the proper cleanup before deleting a node, so I need a different strategy. I just want a way to be able to see this behavior in my local cluster so I can see why the pod stays in infinite termination. It seems that it is not about a finalizer so if that is the case, then what is it? Also for the record, even though I haven't been able to reproduce the test, yet, I know it is not a finalizer because calling |
Building on this idea and this concept, does MCAD internally represent and track the expectation that a pod deleted while "rewrapping" and requeuing an AppWrapper should be gone before the next dispatch attempt? If the requeuing cycle can be stopped in the scenario where expectations are not met, that may be a first step. |
From discussion, the consensus is to address this in MCAD by introducing support for monitoring resource deletion i.e. MCAD will not only issue deletion of resources as it does now, but also ensure the deletion is complete before the next attempt to dispatch. @tardieu please add any comments or clarifications you think would be helpful here. I just wanted to leave a quick status note. |
WHY
Scenario (observed in practice):
Initial state:
pod-1
isRunning
on nodex
and is 1 of the minimum number of members required, as set in the MCAD scheduling spec.genericitem
list or it may be embedded in a CRD such as aPyTorchJob
. The important point is that the pod name does not change between MCAD’s initial dispatch and subsequent dispatches.Problematic observed sequence of transitions and states:
x
becomes unschedulable*.pod-1
enters aTerminating
state.pod-1
remains in aTerminating
state despite attempts to force-terminate the pod (automated within MCAD or manual with…delete —force=true —grace-period=0…
). We have observed pods stuck like this for over 24hrs.Running
pods is less thanminMembers
, it removes the group and requeues the AppWrapper. This is expected.pod-1
is still in aTerminating
state.pod-1
can still not be transitioned out of aTerminating
state and still considered a member of the group, MCAD removes the group and requeues the AppWrapper for the same reason as before (insufficiently many members areRunning
).Terminating
state (which, again, we have observed to be hours, even over a day) and the job fails to make progress.The cost:
Existing workaround
Call for solution ideas:
This is not an MCAD “bug” per se; it is an issue for any job/pod controller which keeps pod names constant between job restarts (of which there are multiple)
It is an opportunity for improved job resilience from MCAD in a job-controller-independent manner
Note, we have observed a variety of reasons why a node might become unschedule in practice, from unexpected hardware failures to intentional cordoning. The sought-for solution here is independent of the underlying cause of failure.
WHAT
What is being asked for?
Ideas/proposals for a MCAD-level (pod/job controller independent) solution.
HOW
Suggestions for how this may be solved. [Optional]
TESTS
List of related tests
The above description could be captured in even a single pod, single node test.
DONE
Bullet point items for what should be completed
The text was updated successfully, but these errors were encountered: