Allow user to drop MV or sink during recovery #12999

fuyufjh · 2023-10-23T07:23:17Z

Is your feature request related to a problem? Please describe.

We have met for a couple of times that users created a problematic MV, the cluster entered into endless crash loop, but he can hardly drop the MV.

Our temporary solution is let users:

Set system parameter: ALTER SYSTEM SET pause_on_next_bootstrap to true
Restart the meta service.
Drop the relevant mviews.
Restart again or use ctl to resume.

However, most users don't know about this. They feel very frustrated about the error message "cluster is under recovery"

Is it possible to allow user to drop MV or sink during recovery?

Describe the solution you'd like

From my understanding, during cluster recovery, we can just delete the metadata of specified MV/Sink without caring about actors, because actors will naturally disappear in next time of scheduling.

Describe alternatives you've considered

If the above way is too hard to work out, we might also put the "temporary solution" into offical doc or even error message.

Additional context

No response

The text was updated successfully, but these errors were encountered:

BugenZhao · 2023-10-26T05:30:26Z

We had a discussion before: #12203 (and PR #12317), and the temporary conclusion is that "safe mode" is sufficient. The only potential problem is that "safe mode" still requires the executor to be initialized correctly: if some executor triggers OOM before awaiting on the first barrier, then there's still no way to drop as we rely on the barrier to drop a job.

we can just delete the metadata of specified MV/Sink without caring about actors, because actors will naturally disappear in next time of scheduling

This is true, but it will require implementing a new code path. I agree that this could be a beneficial improvement to consider in the future.

fuyufjh · 2023-11-14T07:50:55Z

...and the temporary conclusion is that "safe mode" is sufficient. The only potential problem is that "safe mode" still requires the executor to be initialized correctly: if some executor triggers OOM before awaiting on the first barrier, then there's still no way to drop as we rely on the barrier to drop a job.

The "potential problem" actually sounds acceptable to me.

Instead, my major motivation is to make this available to users by simply drop materialized view, instead of doing the secret ALTER SYSTEM SET pause_on_next_bootstrap to true

Is it possible to leverage the existent implementation to make drop materialized view work during recovery?

BugenZhao · 2023-11-15T08:14:04Z

I guess first we need to define what is "during recovery".

From our perspective, as long as it's not the case of "insufficient parallel units", recovery should be fast enough as the first barrier should always be propagated instantly. So in the most cases, DROP is called on a recovered pipeline. It's not that we don't allow users to drop MV during the recovery, but the Stop mutation is executed slowly on a problematic pipeline.

As a result, even if we have implemented the "metadata cleaning" approach, it may not be hit most of the time as the state of "during recovery" does not last long. If the recovery already "succeeds", another recovery might be necessary to get it to take effect.

From users' perspective, things get much simpler: they just want to get the job dropped as fast as possible, no matter what we do behind that. Therefore, I guess the ultimate goal of this issue is to make dropping more responsive, especially when there's performance issue or failure.

fuyufjh · 2023-11-15T08:27:22Z

From users' perspective, things get much simpler: they just want to get the job dropped as fast as possible, no matter what we do behind that. Therefore, I guess the ultimate goal of this issue is to make dropping more responsive, especially when there's performance issue or failure.

Totally agree with you. And I believe the "metadata cleaning" approach may be not a good idea.

So do you have any ideas about this question?

Is it possible to leverage the existent implementation to make drop materialized view work during recovery?

By "existent implementation" I mean:

Set system parameter: ALTER SYSTEM SET pause_on_next_bootstrap to true

Restart the meta service.

Drop the relevant mviews.

Restart again or use ctl to resume.

By "to make drop materialized view work during recovery" I mean: automate these things above, so that user can simply run a drop mv and we do these dark work for them.

fuyufjh · 2023-11-15T08:32:57Z

So here is my naive idea 🤣 I am thinking that, instead of returning the "cluster is under recovery" error to users, can we enhance this code branch like:

Set a boolean flag pause_on_next_recovery = true
Once it recovers again, after emitting the first barrier and before emitting any data events, do the drop i.e. emitting the mutation barrier

BugenZhao · 2023-11-30T06:03:14Z

See updated discussion here: #12203 (comment)

fuyufjh added the type/feature label Oct 23, 2023

github-actions bot added this to the release-1.4 milestone Oct 23, 2023

BugenZhao mentioned this issue Oct 26, 2023

fix(stream): fix the minput's indicies when it's for distinct call #13015

Merged

8 tasks

fuyufjh modified the milestones: release-1.4, release-1.5 Nov 8, 2023

BugenZhao self-assigned this Nov 9, 2023

BugenZhao assigned yezizp2012 Nov 30, 2023

yezizp2012 mentioned this issue Nov 30, 2023

feat: allow drop streaming jobs during recovery #12317

Merged

8 tasks

yezizp2012 closed this as completed in #12317 Nov 30, 2023

kwannoel mentioned this issue Dec 4, 2023

Provide a system variable to drop all background stream jobs #12787

Closed

BugenZhao mentioned this issue Dec 6, 2023

feat(meta): show recovery cause when request is rejected #13836

Merged

4 tasks

xxchan mentioned this issue Jan 9, 2024

Discussion support a restart/kill meta statement through psql #14438

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow user to drop MV or sink during recovery #12999

Allow user to drop MV or sink during recovery #12999

fuyufjh commented Oct 23, 2023

BugenZhao commented Oct 26, 2023 •

edited

Loading

fuyufjh commented Nov 14, 2023

BugenZhao commented Nov 15, 2023

fuyufjh commented Nov 15, 2023

fuyufjh commented Nov 15, 2023

BugenZhao commented Nov 30, 2023

Allow user to drop MV or sink during recovery #12999

Allow user to drop MV or sink during recovery #12999

Comments

fuyufjh commented Oct 23, 2023

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

BugenZhao commented Oct 26, 2023 • edited Loading

fuyufjh commented Nov 14, 2023

BugenZhao commented Nov 15, 2023

fuyufjh commented Nov 15, 2023

fuyufjh commented Nov 15, 2023

BugenZhao commented Nov 30, 2023

BugenZhao commented Oct 26, 2023 •

edited

Loading