-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
temporal filter can emit a large number of messages to downstream in one barrier causing OOM #12715
Comments
I think that if a cluster is down for too long and then recoveryed, it will definitely encounter the situation where the first barrier data of the temporal filter is too large, which may lead to OOM. The only solution to this case is write anytime (that is, the solution using gap epoch mentioned before) We previously thought about using high-frequency barriers to achieve passive spill(#12393), but now it seems that write anytime is indeed needed, which can
I will implement spill anytime later to see whether this issue can be resolved. |
It turns out However, you are correct that this "big epoch" will eventually occur once everything starts running. |
What's the current workaround for this? Is it just to recreate the MV? |
Actually no good workaround other than spilling within epoch. This OOM issues occurs in the mirror cluster rather than real cluster, and I'm working on the spill to solve this issue. |
When @zwang28 is trying to repro an performance issue through backup and restore an existing cluster, we accidentally meet a OOM case caused by temporal filter emitting excessive changes to downstream in one barrier. Here are what happened:
pause_barrier_on_next_bootstrap
, meaning that source will only emit barriers, not data.@zwang28 manually changes the codes and have confirmed that temporal filter will yield row_count
831960
, rows_total_size317640446
in one barrier.We think that the reason why C1 doesn't OOM but C2 does is because the deletions caused by temporal filter are amortized over time across barriers if the cluster is always up while the deletions will be accumulated in one barrier if the cluster is paused for some time.
This is a potential risk and a motivation to do spill within one barrier.
The text was updated successfully, but these errors were encountered: