Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K8SSAND-1042 ⁃ Feature request: Add option to allow all pods to start in parallel #230

Closed
rchernobelskiy opened this issue Nov 8, 2021 · 14 comments
Assignees
Labels
done Issues in the state 'done' enhancement New feature or request zh:Assess/Investigate

Comments

@rchernobelskiy
Copy link
Contributor

rchernobelskiy commented Nov 8, 2021

Currently, when resuming a stopped cluster, all the cassandra pods start up sequentially because the ips for the pods change and cassandra can only join one node at a time.

When using static IPs however, there is no concern about the IPs changing and therefore all the pods can start up in parallel.

An option to start all pods in parallel will significantly reduce the time to resume a large stopped cluster.

┆Issue is synchronized with this Jira Task by Unito
┆friendlyId: K8SSAND-1042
┆priority: Medium

@rchernobelskiy rchernobelskiy added the enhancement New feature or request label Nov 8, 2021
@sync-by-unito sync-by-unito bot changed the title Feature request: Add option to allow all pods to start in parallel K8SSAND-1042 ⁃ Feature request: Add option to allow all pods to start in parallel Nov 8, 2021
@jimdickinson
Copy link
Contributor

I think we'll have to commit something here so that we can toggle on an implementation using static IPs so that this feature can be tested?

@bradfordcp
Copy link
Member

I'm curious how we could detect if the cluster is using static IPs or not. Just a boolean in the spec? I assume there is a sidecar or something that handles setting up the appropriate addresses and routing.

@rchernobelskiy
Copy link
Contributor Author

rchernobelskiy commented Apr 12, 2022

I'm curious how we could detect if the cluster is using static IPs or not. Just a boolean in the spec? I assume there is a sidecar or something that handles setting up the appropriate addresses and routing.

Yep that's what I was thinking, something like parallelResume: true. And yeah, a sidecar is handling the IP and route configuration.

Alternatively, we could add a flag something like useVirtualNetwork: true, and this would, in addition to starting pods in parallel, add the sidecars that enable the virtual network. Though this kind of addition to the operator would be somewhat more involved.

@jsanda
Copy link
Contributor

jsanda commented Apr 12, 2022

Let me ask the obvious, What are the risks of starting in parallel if static IPs are not used?

@jsanda
Copy link
Contributor

jsanda commented Apr 19, 2022

Please add your planning poker estimate with ZenHub @burmanm

@bradfordcp
Copy link
Member

I assume this would fall under the spec.networking key.

@bradfordcp
Copy link
Member

bradfordcp commented Apr 20, 2022

Do we still need to start seed nodes first before parallel starting the rest of the nodes?

@adejanovski
Copy link
Contributor

Do we still need to start seed nodes first before parallel starting the rest of the nodes?

If we start the seed nodes first (one by one), it should allow us to start other nodes in parallel even if we're not using static IPs. These nodes will then be able to connect to the cluster through the seeds and broadcast their new IP address.
The scenario that Cassandra doesn't deal well with is concurrent range movements, which will not be the case here.

@adejanovski
Copy link
Contributor

@bradfordcp, can we move the ticket to the product backlog or does it require a design session?

@burmanm
Copy link
Contributor

burmanm commented Mar 5, 2024

@rchernobelskiy Is this still necessary feature?

@adejanovski adejanovski added the assess Issues in the state 'assess' label Mar 5, 2024
@rchernobelskiy
Copy link
Contributor Author

From my personal perspective I still believe it would be a good feature to have.

@adejanovski
Copy link
Contributor

I agree, there have been multiple incidents that were due to nodes which are already part of the ring being blocked from starting by cass-operator because another node was bootstrapping (which can take a while).

What we need to identify is if a node had previously bootstrapped, and allow it to start concurrently with other nodes in that case if we have at least one available seed node.
We should detail this process a little bit to more precisely list the conditions that need to be met to enable this behavior.

@burmanm burmanm moved this from Assess/Investigate to In Progress in K8ssandra Jul 9, 2024
@adejanovski adejanovski added in-progress Issues in the state 'in-progress' and removed assess Issues in the state 'assess' labels Jul 9, 2024
@burmanm burmanm self-assigned this Jul 9, 2024
@burmanm
Copy link
Contributor

burmanm commented Jul 11, 2024

Solved in #673

@burmanm burmanm closed this as completed Jul 11, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in K8ssandra Jul 11, 2024
@adejanovski adejanovski added done Issues in the state 'done' and removed in-progress Issues in the state 'in-progress' labels Jul 11, 2024
@rchernobelskiy
Copy link
Contributor Author

Nice, cc @berndocklin and @Liwanshi we should look at adding this to Astra, it'll significantly reduce the time to resume a large stopped cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
done Issues in the state 'done' enhancement New feature or request zh:Assess/Investigate
Projects
No open projects
Archived in project
Development

No branches or pull requests

6 participants