-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(storage): support online cache resize via risectl #19677
Conversation
Signed-off-by: MrCroxx <[email protected]>
Signed-off-by: MrCroxx <[email protected]>
Signed-off-by: MrCroxx <[email protected]>
Signed-off-by: MrCroxx <[email protected]>
@@ -42,13 +50,53 @@ impl ConfigService for ConfigServiceImpl { | |||
}; | |||
Ok(Response::new(show_config_response)) | |||
} | |||
|
|||
async fn resize_cache( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_meta_cache_memory_usage_ratio and get_block_cache_memory_usage_ratio will be inaccurate after this PR. We should change HummockMemoryCollector
to use cache.capacity()
instead.
if let Some(meta_cache) = &self.meta_cache | ||
&& req.meta_cache_capacity > 0 | ||
{ | ||
match meta_cache.memory().resize(req.meta_cache_capacity as _) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, the resize operation is not persistent, which means if the CN restarts, the cache capacity will be back to the configured values from the toml config. Will this be a problem? For example, if we have 2 CNs in the cluster and after resize_cache, 1 CN restarts for some reason while the other one doesn't. I think it depends on when we are going to use resize_cache. If it is mainly for testing or perf tunning and after the tuning we will update the configs, this is fine. But if we rely on resize_cache in production, that will not be ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it depends on how we define resize.
Is this a permanent operation or a temporary one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, I also think that multi-CN cache config inconsistencies are a concern, and it's not easy to detect. If we allow online resize cache size, then we need to make sure that the operation succeeds on all machines, and is persistent, otherwise this is a risk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, the resize operation is not persistent, which means if the CN restarts, the cache capacity will be back to the configured values from the toml config. Will this be a problem?
This feature just aims to resize the in-memory meta/data block cache without downtime. If there are restarts, it is fine to modify the persistent configuration directly. Modifying the per-node configuration can be (and maybe better be) achieved in the cloud control panel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM, thanks for the efforts
} | ||
}; | ||
|
||
let futures = worker_nodes.iter().map(|worker| async { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a question:
I'm considering a partial success. Do we need to provide a retry capability to perform an rpc retry for the wrong worker node, to avoid as much as possible inconsistencies in the configs of multiple CNs? (I believe this is an idempotent operation.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, it is okay to make the user responsible to retry it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: MrCroxx <[email protected]>
Signed-off-by: MrCroxx <[email protected]>
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
Usage:
Related foyer side PRs: foyer-rs/foyer#794 (Already merged 2 weeks ago, not related to the version bump in this PR.)
Checklist
./risedev check
(or alias,./risedev c
)Documentation
Release note
If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.