-
Notifications
You must be signed in to change notification settings - Fork 595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(stream): concurrently fetch row from storage and refill cache #19629
base: 12-13-add_bench_with_join_type_and_cache_workload
Are you sure you want to change the base?
perf(stream): concurrently fetch row from storage and refill cache #19629
Conversation
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
120103d
to
b547dc8
Compare
Benchmarking this in RWC does not show decrease in mem utilization. There's probably something wrong with the implementation. Writing a memory benchmark first. |
Added benchmark: #19712. Continuing investigation. |
b547dc8
to
fb55b51
Compare
Works well. See PR description. Memory utilization plateaus after the threshold set. |
efda035
to
5d72e75
Compare
8766e99
to
3f8c1a6
Compare
3f8c1a6
to
bf1ac94
Compare
f5188c8
to
bd82fe3
Compare
We don't have any issues supporting this for Currently we:
So the upfront IO cost is: Then the total latency for cache missed will be After this PR: Currently I workaround this by reading all the degrees into memory first. The size of the degrees shouldn't be too bad, since we just need to store the degrees. Then the total latency for cache missed will be However, given that cache miss is expected to seldom occur, I think this is a reasonable trade-off to make. Will benchmark this approach against a nexmark query with Another approach is to concurrently do point get, and write to the degree state table, since the point get reference to the degree table is short-lived. However, point get does not have prefetch interface yet. The second issue is tolerating inconsistency. Previously we buffered all matched rows into memory. If there's a mismatch in the number of rows in the degree table rows and the matched table rows, we will compare their pk. I think it's possible to support this with the point get approach. Alternatively, if tolerate inconsistency is set to true, we can always follow the old approach of refilling cache first. |
Will add a micro-bench for |
af75e8e
to
08ac933
Compare
63656aa
to
f21aa13
Compare
c82596b
to
05946a0
Compare
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
Currently we read all rows matching a key, encode them, and store them in an in-memory data structure. We add this to the cache. Then we iterate each row from the in-memory data structure, decode them. This means we have to incur IO latency of reading all rows first. We also have to incur the memory cost of buffering all matching records.
Instead of doing this, we can iterate the rows, concurrently refill cache for them. Then we don't have to wait for IO to finish for ALL rows. We also avoid OOM, since while refilling cache, if we notice the number of rows exceeds some threshold, we can stop refilling.
Further, if we tolerate inconsistency, we will revert to the old strategy of reading all records into cache first, so we will preserve that logic.
This is done for all join types in the hash join executor.
Handling degree table matches and updates
Before this PR we:
However, in this PR we concurrently refill cache and handle matches. This means that 1&2 happen concurrently, which means we hold an immutable and mutable reference to the underlying match side degree table. This will be rejected by the borrow checker.
To solve this, we read from the match side degree table, keeping all the degrees in-memory. Only after that is complete, we start concurrently doing cache refill of the match side, and also updating degrees of match side. In this way the lifetimes of the mutable and immutable reference to match side degree table won't overlap.
Benchmark results
With a 30,000 record limit for each key, here's the comparison (using in-memory statestore, inner join):
You can observe how after 30,000, we don't do cache refill anymore, and so the memory usage becomes capped, and we can avoid OOM.
Do also note that because this is an in-memory statestore, we can see significant runtime improvements. I expect that with block cache enabled, we can also see similar improvements, although may not be as high.
Checklist
./risedev check
(or alias,./risedev c
)Documentation
Release note
If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.