temporal filter can emit a large number of messages to downstream in one barrier causing OOM #12715

hzxa21 · 2023-10-10T05:24:46Z

When @zwang28 is trying to repro an performance issue through backup and restore an existing cluster, we accidentally meet a OOM case caused by temporal filter emitting excessive changes to downstream in one barrier. Here are what happened:

Cluster C1 has a MV with temporal filter to do 1-day retention:

CREATE MATERIALIZED VIEW C1 AS (SELECT * FROM source_1 WHERE proc_time > now() - INTERVAL '1 day')

We backup the cluster C1 at T and restore it in a new cluster C2 at T+24hr.
C2 is started with pause_barrier_on_next_bootstrap, meaning that source will only emit barriers, not data.
C2 OOMed constantly and the only changes flowing in the graph are the changes emit by temporal filter.

@zwang28 manually changes the codes and have confirmed that temporal filter will yield row_count 831960, rows_total_size 317640446 in one barrier.

We think that the reason why C1 doesn't OOM but C2 does is because the deletions caused by temporal filter are amortized over time across barriers if the cluster is always up while the deletions will be accumulated in one barrier if the cluster is paused for some time.

This is a potential risk and a motivation to do spill within one barrier.

The text was updated successfully, but these errors were encountered:

wcy-fdu · 2023-10-10T05:59:40Z

I think that if a cluster is down for too long and then recoveryed, it will definitely encounter the situation where the first barrier data of the temporal filter is too large, which may lead to OOM. The only solution to this case is write anytime (that is, the solution using gap epoch mentioned before)

We previously thought about using high-frequency barriers to achieve passive spill(#12393), but now it seems that write anytime is indeed needed, which can

Solve OOM caused by join amplification
Solve OOM caused by the data in one epoch of temporal filter is too large

I will implement spill anytime later to see whether this issue can be resolved.

BugenZhao · 2023-10-10T06:17:57Z

3. C2 is started with pause_barrier_on_next_bootstrap, meaning that source will only emit barriers, not data.

It turns out Now executor was not paused correctly, causing the DynamicFilter executor to be aware of the increasing inner side value. #12716 is a fix for this.

However, you are correct that this "big epoch" will eventually occur once everything starts running.

kwannoel · 2023-10-12T02:39:23Z

What's the current workaround for this? Is it just to recreate the MV?

wcy-fdu · 2023-10-12T05:58:27Z

Actually no good workaround other than spilling within epoch. This OOM issues occurs in the mirror cluster rather than real cluster, and I'm working on the spill to solve this issue.

hzxa21 added the type/feature label Oct 10, 2023

github-actions bot added this to the release-1.4 milestone Oct 10, 2023

BugenZhao mentioned this issue Oct 10, 2023

fix(streaming): handle pause on bootstrap for Now and Values #12716

Merged

4 tasks

wcy-fdu mentioned this issue Oct 11, 2023

Implement Mem Table spill anytime #12789

Closed

fuyufjh removed this from the release-1.4 milestone Nov 8, 2023

fuyufjh added type/bug Something isn't working and removed type/feature labels Nov 8, 2023

wcy-fdu closed this as completed Nov 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

temporal filter can emit a large number of messages to downstream in one barrier causing OOM #12715

temporal filter can emit a large number of messages to downstream in one barrier causing OOM #12715

hzxa21 commented Oct 10, 2023

wcy-fdu commented Oct 10, 2023

BugenZhao commented Oct 10, 2023

kwannoel commented Oct 12, 2023

wcy-fdu commented Oct 12, 2023

temporal filter can emit a large number of messages to downstream in one barrier causing OOM #12715

temporal filter can emit a large number of messages to downstream in one barrier causing OOM #12715

Comments

hzxa21 commented Oct 10, 2023

wcy-fdu commented Oct 10, 2023

BugenZhao commented Oct 10, 2023

kwannoel commented Oct 12, 2023

wcy-fdu commented Oct 12, 2023