perf(stream): concurrently fetch row from storage and refill cache #19629

kwannoel · 2024-12-01T07:54:31Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Currently we read all rows matching a key, encode them, and store them in an in-memory data structure. We add this to the cache. Then we iterate each row from the in-memory data structure, decode them. This means we have to incur IO latency of reading all rows first. We also have to incur the memory cost of buffering all matching records.

Instead of doing this, we can iterate the rows, concurrently refill cache for them. Then we don't have to wait for IO to finish for ALL rows. We also avoid OOM, since while refilling cache, if we notice the number of rows exceeds some threshold, we can stop refilling.

Further, if we tolerate inconsistency, we will revert to the old strategy of reading all records into cache first, so we will preserve that logic.

This is done for all join types in the hash join executor.

Handling degree table matches and updates

Before this PR we:

Read from the match side degree table when doing cache refill (hold an immutable ref)
Update the match side degree table when doing cache refill.

However, in this PR we concurrently refill cache and handle matches. This means that 1&2 happen concurrently, which means we hold an immutable and mutable reference to the underlying match side degree table. This will be rejected by the borrow checker.

To solve this, we read from the match side degree table, keeping all the degrees in-memory. Only after that is complete, we start concurrently doing cache refill of the match side, and also updating degrees of match side. In this way the lifetimes of the mutable and immutable reference to match side degree table won't overlap.

Benchmark results

With a 30,000 record limit for each key, here's the comparison (using in-memory statestore, inner join):

amplification	optimized peak memory (bytes)	unoptimized peak memory (bytes)	optimized runtime (ms)	unoptimized runtime (ms)
10K	1,687,682	1,696,824	7.9825 (-13.791%)	10.188
20K	3,260,364	3,279,096	16.432 (-4.3679%)	17.146
30K	4,832,972	4,861,368	25.263 (-4.0490%)	26.360
40K	4,863,221	6,444,552	29.717 (-16.242%)	35.388
100K	4,863,221	15,942,456	56.554 (-38.688%)	92.481
200K	4,863,221	31,769,735	99.493 (-47.094%)	188.06
400K	4,863,221	63,427,368	187.22 (-51.652%)	387.23

You can observe how after 30,000, we don't do cache refill anymore, and so the memory usage becomes capped, and we can avoid OOM.

Do also note that because this is an in-memory statestore, we can see significant runtime improvements. I expect that with block cache enabled, we can also see similar improvements, although may not be as high.

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
I have added test labels as necessary. See details.
I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
All checks passed in ./risedev check (or alias, ./risedev c)
My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)

My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

kwannoel · 2024-12-01T07:54:49Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

fuyufjh · 2024-12-04T08:00:17Z

The motivation is similar to #10979 / #12615, but this is much better. Love it!

IIRC, Nexmark has some queries with large join state. May use it as benchmark.

kwannoel · 2024-12-06T13:11:48Z

Benchmarking this in RWC does not show decrease in mem utilization. There's probably something wrong with the implementation. Writing a memory benchmark first.

kwannoel · 2024-12-09T05:50:52Z

Added benchmark: #19712. Continuing investigation.

kwannoel · 2024-12-09T06:22:08Z

Works well. See PR description. Memory utilization plateaus after the threshold set.

kwannoel · 2024-12-11T02:15:50Z

We don't have any issues supporting this for INNER JOIN, however for joins which require a degree table, more consideration is needed. Carving out this optimization just for INNER JOIN is possible, but it will add a new code branch and more complexity.

Currently we:

Read all the degrees and matched rows into memory.
Iterate through them, updating the degree table for the build side.

So the upfront IO cost is: max(read_degree_latency, read_matched_rows_latency). Typically read_matched_rows_latency should be higher, since degree will have less data, only key + degree.

Then the total latency for cache missed will be max(read_degree_latency, read_matched_rows_latency) + cpu latency of processing matched rows.

After this PR:
We concurrently read matched rows and handle them. However, for degrees we can't do so because
we need to iterate over the degrees, and update them at the same time. This means holding an immutable and mutable reference to StateTable at the same time.

Currently I workaround this by reading all the degrees into memory first. The size of the degrees shouldn't be too bad, since we just need to store the degrees.
This means the upfront IO cost will be read_degree_latency. Then we will need to pay IO costs of read_matched_rows_latency in an amortized way, since we concurrently stream and handle each matched row.

Then the total latency for cache missed will be read_degree_latency + cpu AND io latency of processing matched rows.
So it might take longer to do cache refill.

However, given that cache miss is expected to seldom occur, I think this is a reasonable trade-off to make.

Will benchmark this approach against a nexmark query with LEFT OUTER JOIN.

Another approach is to concurrently do point get, and write to the degree state table, since the point get reference to the degree table is short-lived. However, point get does not have prefetch interface yet.

The second issue is tolerating inconsistency. Previously we buffered all matched rows into memory. If there's a mismatch in the number of rows in the degree table rows and the matched table rows, we will compare their pk.

I think it's possible to support this with the point get approach. Alternatively, if tolerate inconsistency is set to true, we can always follow the old approach of refilling cache first.

kwannoel · 2024-12-11T06:12:59Z

Will add a micro-bench for left join, cache hit before proceeding.

kwannoel · 2024-12-27T09:02:25Z

Benchmarking https://buildkite.com/risingwave-test/nexmark-benchmark/builds/4974

github-actions bot added the Invalid PR Title label Dec 1, 2024

github-actions bot added the type/refactor label Dec 1, 2024

kwannoel changed the title ~~refactor~~ perf(stream): concurrently fetch row from storage and refill cache Dec 1, 2024

github-actions bot added type/perf and removed Invalid PR Title labels Dec 1, 2024

kwannoel marked this pull request as ready for review December 1, 2024 07:58

kwannoel marked this pull request as draft December 1, 2024 07:59

kwannoel marked this pull request as ready for review December 2, 2024 01:29

kwannoel marked this pull request as draft December 2, 2024 01:30

kwannoel force-pushed the 11-29-refactor branch from 120103d to b547dc8 Compare December 6, 2024 13:10

kwannoel force-pushed the 11-29-refactor branch from b547dc8 to fb55b51 Compare December 9, 2024 05:54

kwannoel changed the base branch from main to kwannoel/join-bench December 9, 2024 05:54

kwannoel mentioned this pull request Dec 9, 2024

perf(stream): add hash join memory benchmarking for cache refill #19712

Merged

9 tasks

kwannoel force-pushed the kwannoel/join-bench branch from efda035 to 5d72e75 Compare December 9, 2024 07:22

kwannoel force-pushed the 11-29-refactor branch 2 times, most recently from 8766e99 to 3f8c1a6 Compare December 9, 2024 11:24

kwannoel changed the base branch from kwannoel/join-bench to graphite-base/19629 December 10, 2024 01:25

kwannoel force-pushed the 11-29-refactor branch from 3f8c1a6 to bf1ac94 Compare December 10, 2024 06:42

kwannoel force-pushed the graphite-base/19629 branch from f5188c8 to bd82fe3 Compare December 10, 2024 06:42

kwannoel changed the base branch from graphite-base/19629 to main December 10, 2024 06:42

kwannoel requested review from yuhao-su and st1page December 11, 2024 04:28

kwannoel force-pushed the 11-29-refactor branch from af75e8e to 08ac933 Compare December 12, 2024 16:09

kwannoel added 16 commits December 27, 2024 14:35

more comments

41bab27

handle strong consistency requirement

a994948

unify branches

7b669ac

doc

13bcaa9

remove unused

d75cc7a

replace

fbc378b

refactor inner row handling

bf4a33e

handle degrees implcitly

6af5bc2

simplify

8fcd656

unify caching

6b9fe55

unify them in eq join

97e2e8d

remove old cache handling fn

82864f9

fix cache refill

7fc6253

minor fix

3693d23

optimize

699d8e2

clean up

05946a0

kwannoel force-pushed the 12-13-add_bench_with_join_type_and_cache_workload branch from 63656aa to f21aa13 Compare December 27, 2024 06:35

kwannoel force-pushed the 11-29-refactor branch from c82596b to 05946a0 Compare December 27, 2024 06:36

kwannoel added 4 commits December 27, 2024 15:20

simplify with macro

adbbd0d

simplify

9f3ba22

cleanup

55c0ebe

simplify

f01798b

kwannoel requested a review from fuyufjh December 27, 2024 09:12

kwannoel added the ci/main-cron/run-all label Dec 27, 2024

kwannoel added 2 commits December 27, 2024 17:54

increase hash join cache size

b0063e6

make configurable

2c30ee3

kwannoel added ci/run-e2e-test ci/pr/run-selected labels Dec 27, 2024

handle null safe

ba967ef

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(stream): concurrently fetch row from storage and refill cache #19629

perf(stream): concurrently fetch row from storage and refill cache #19629

kwannoel commented Dec 1, 2024 •

edited

Loading

kwannoel commented Dec 1, 2024 •

edited

Loading

fuyufjh commented Dec 4, 2024 •

edited

Loading

kwannoel commented Dec 6, 2024

kwannoel commented Dec 9, 2024

kwannoel commented Dec 9, 2024

kwannoel commented Dec 11, 2024

kwannoel commented Dec 11, 2024 •

edited

Loading

kwannoel commented Dec 27, 2024

perf(stream): concurrently fetch row from storage and refill cache #19629

Are you sure you want to change the base?

perf(stream): concurrently fetch row from storage and refill cache #19629

Conversation

kwannoel commented Dec 1, 2024 • edited Loading

What's changed and what's your intention?

Handling degree table matches and updates

Benchmark results

Checklist

Documentation

Release note

kwannoel commented Dec 1, 2024 • edited Loading

fuyufjh commented Dec 4, 2024 • edited Loading

kwannoel commented Dec 6, 2024

kwannoel commented Dec 9, 2024

kwannoel commented Dec 9, 2024

kwannoel commented Dec 11, 2024

kwannoel commented Dec 11, 2024 • edited Loading

kwannoel commented Dec 27, 2024

kwannoel commented Dec 1, 2024 •

edited

Loading

kwannoel commented Dec 1, 2024 •

edited

Loading

fuyufjh commented Dec 4, 2024 •

edited

Loading

kwannoel commented Dec 11, 2024 •

edited

Loading