-
Notifications
You must be signed in to change notification settings - Fork 595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow user to drop MV or sink during recovery #12999
Comments
We had a discussion before: #12203 (and PR #12317), and the temporary conclusion is that "safe mode" is sufficient. The only potential problem is that "safe mode" still requires the executor to be initialized correctly: if some executor triggers OOM before awaiting on the first barrier, then there's still no way to drop as we rely on the barrier to drop a job.
This is true, but it will require implementing a new code path. I agree that this could be a beneficial improvement to consider in the future. |
The "potential problem" actually sounds acceptable to me. Instead, my major motivation is to make this available to users by simply Is it possible to leverage the existent implementation to make |
I guess first we need to define what is "during recovery". From our perspective, as long as it's not the case of "insufficient parallel units", recovery should be fast enough as the first barrier should always be propagated instantly. So in the most cases, As a result, even if we have implemented the "metadata cleaning" approach, it may not be hit most of the time as the state of "during recovery" does not last long. If the recovery already "succeeds", another recovery might be necessary to get it to take effect. From users' perspective, things get much simpler: they just want to get the job dropped as fast as possible, no matter what we do behind that. Therefore, I guess the ultimate goal of this issue is to make dropping more responsive, especially when there's performance issue or failure. |
Totally agree with you. And I believe the "metadata cleaning" approach may be not a good idea. So do you have any ideas about this question?
By "existent implementation" I mean:
By "to make drop materialized view work during recovery" I mean: automate these things above, so that user can simply run a |
So here is my naive idea 🤣 I am thinking that, instead of returning the "cluster is under recovery" error to users, can we enhance this code branch like:
|
See updated discussion here: #12203 (comment) |
Is your feature request related to a problem? Please describe.
We have met for a couple of times that users created a problematic MV, the cluster entered into endless crash loop, but he can hardly drop the MV.
Our temporary solution is let users:
ALTER SYSTEM SET pause_on_next_bootstrap to true
ctl
to resume.However, most users don't know about this. They feel very frustrated about the error message "cluster is under recovery"
Is it possible to allow user to drop MV or sink during recovery?
Describe the solution you'd like
From my understanding, during cluster recovery, we can just delete the metadata of specified MV/Sink without caring about actors, because actors will naturally disappear in next time of scheduling.
Describe alternatives you've considered
If the above way is too hard to work out, we might also put the "temporary solution" into offical doc or even error message.
Additional context
No response
The text was updated successfully, but these errors were encountered: