-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize creating cluster snapshot #7630
base: master
Are you sure you want to change the base?
Parallelize creating cluster snapshot #7630
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: macsko The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Wow, cool feature! |
cc: @x13n |
7368ae8
to
f621876
Compare
f621876
to
76eb858
Compare
Added. If the flag is > 1, the snapshot is created using a new way. |
Overall LGTM @x13n Could you take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Out of curiosity, did you test it only on the Go benchmarks or in an actual cluster? I wonder how much it improves in practice.
76eb858
to
60477f7
Compare
60477f7
to
b53877b
Compare
Yes, I tested it, but forgot to write. Before the change, in real cluster, for 1k nodes and 40k pods scale, a single |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Creating cluster snapshots using
SetClusterState
can take a long time, especially in large clusters with plenty of pods. Fortunately, we can parallelize the work done per each node. This PR does it by callingworkqueue.ParallelizeUntil
capped at 16 workers.I verified the performance and it gives me near 7 times faster execution for 1k nodes, 40k pods scenario, but it depends on the machine's CPU.
In the future, basic store used for scale down can be also changed similarly.
Which issue(s) this PR fixes:
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: