Add MinReadySeconds to 5 in StatefulSets to prevent sudden double res… #647

burmanm · 2024-05-02T12:03:07Z

What this PR does:
Adds a MinReadySeconds check for the StatefulSet.

Which issue(s) this PR fixes:
Fixes #648
Fixes #382

Checklist

Changes manually tested
Automated Tests added/updated
Documentation added/updated
CHANGELOG.md updated (not required for documentation PRs)
CLA Signed: DataStax CLA

adejanovski

Issue: Sadly this doesn't prevent the multiple restarts at once.
What probably happens is consecutive updates to the sts, with the second one happening after the first pod has been restarted.
We end up with one pod finishing its restart from the first change, and another one going through the second change.

I'm unsure what we can do against this, but minReadySeconds doesn't help.

KeepItSimpleStupid · 2024-05-09T16:01:00Z

Thanks @burmanm & @adejanovski for these experiments after our exchange in k8ssandra/k8ssandra#1432 : we were about to submit a Pull Request ;)

I believe the ability to configure MinReadySeconds is still valuable even if it doesn't solve our issue, but maybe default it to 0 (the k8s default if not specified) instead of 5s ?

Concerning the issue, we first tried to workaround it by disabling Reaper & Medusa on our clusters before a K8ssandra Operator/Cass Operator upgrade (Helm-chart based) to minimize the sources of changes for the StatefulSet, and we enable them again after the rollout... but it's not bulletproof.

Interestingly, I noticed our other k8s stateful workloads use different strategies than the one used by cass-operator (StatefulSet + spec.updateStrategy.type = RollingUpdate), so I wonder if they didn't face similar issues :

Hashicorp Vault uses spec.updateStrategy.type = OnDelete and leaves to the user the action to delete the old pods : see here
Strimzi uses a custom resource named StrimziPodSet to manage its Kafka pods by itself : https://github.com/strimzi/proposals/blob/main/031-statefulset-removal.md

Maybe a way to solve this issue would be for cass-operator to switch to spec.updateStrategy.type = OnDelete and control the rollout completely ?

burmanm · 2024-05-14T09:19:37Z

Well, there are two (three) things here. First of all, preventing a single StatefulSet (per rack - not per cluster / dc, those should be guarded by cass-operator, see below) from restarting multiple ones is relatively simple PR:

diff --git a/pkg/reconciliation/construct_statefulset.go b/pkg/reconciliation/construct_statefulset.go
index e48f3ce..a5443a7 100644
--- a/pkg/reconciliation/construct_statefulset.go
+++ b/pkg/reconciliation/construct_statefulset.go
@@ -142,7 +142,7 @@ func newStatefulSetForCassandraDatacenter(
                        },
                        Replicas:             &replicaCountInt32,
                        ServiceName:          dc.GetAllPodsServiceName(),
-                       PodManagementPolicy:  appsv1.ParallelPodManagement,
+                       PodManagementPolicy:  appsv1.OrderedReadyPodManagement,
                        Template:             *template,
                        VolumeClaimTemplates: volumeClaimTemplates,
                },

The bigger issue is that how is this happening in the first place when not editing StatefulSets directly (as no one should). I haven't been able to trigger this issue (and the reproducer by @adejanovski does neither) with just CassandraDatacenter changes. Editing StS directly bypasses the reconcile logic in cass-operator which makes it odd. MinReadySeconds itself should have prevented that since the controller-runtime cache would not be updated instantly in any case and we should be requeueing when an update happens. I don't think MinReadySeconds = 0 is necessarily even a good setting since all of these can cause small concurrency issues.

And to be fair, I don't even know if MinReadySeconds does not stop this behavior. Perhaps it works correctly in this case and the "reproducer" is simply wrong (since it's not testing real behavior but something else). Sort of like we can't stop user from deleting all the pods at the same time if they really want to.

The third problem is our use PodDisruptionPolicy, which does not know about the replication factors or availability requirements, so it can't actually enforce a sane cluster structure. If it knew things are always quorum for example, it could enforce that. Also, it only targets the cluster wide labeling, not for example per-rack settings. That's a much more complex issue to solve.

KeepItSimpleStupid · 2024-05-14T11:06:28Z

Hi !

About the first point :

Well, there are two (three) things here. First of all, preventing a single StatefulSet (per rack - not per cluster / dc, those should be guarded by cass-operator, see below) from restarting multiple ones is relatively simple PR:

I believe that .spec.podManagementPolicy = Parallel affects only scaling operations, not the updates as documented here.

However, there seems to be .spec.updateStrategy.rollingUpdate.maxUnavailable in alpha since k8s 1.24, but I imagine it's equal to 1 by default if not set or if the feature gate is not enabled 🤷‍♂️

KeepItSimpleStupid · 2024-05-14T11:16:58Z

The bigger issue is that how is this happening in the first place when not editing StatefulSets directly (as no one should). I haven't been able to trigger this issue (and the reproducer by @adejanovski does neither) with just CassandraDatacenter changes. Editing StS directly bypasses the reconcile logic in cass-operator which makes it odd. MinReadySeconds itself should have prevented that since the controller-runtime cache would not be updated instantly in any case and we should be requeueing when an update happens. I don't think MinReadySeconds = 0 is necessarily even a good setting since all of these can cause small concurrency issues.

And to be fair, I don't even know if MinReadySeconds does not stop this behavior. Perhaps it works correctly in this case and the "reproducer" is simply wrong (since it's not testing real behavior but something else). Sort of like we can't stop user from deleting all the pods at the same time if they really want to.

Maybe a good test would be to set MinReadySeconds to a ridiculously long delay to ensure when all events/webhooks are treated (almost) sequentially, the issue doesn't appear anymore ?
We could test it here with our own clusters which have been often hit by the issue, so that's why I was suggesting to set it to 0 by default (which is also the k8s default if not set), so you can merge the PR without any breaking change for the community, but allowing us to test more our assumptions ;)

KeepItSimpleStupid · 2024-05-14T11:21:33Z

The third problem is our use PodDisruptionPolicy, which does not know about the replication factors or availability requirements, so it can't actually enforce a sane cluster structure. If it knew things are always quorum for example, it could enforce that. Also, it only targets the cluster wide labeling, not for example per-rack settings. That's a much more complex issue to solve.

Just to state it, we encounter the issue with our single dc/single rack clusters, RF 3 which are always requested with quorum CL :/

burmanm · 2024-05-14T11:35:21Z

I believe that .spec.podManagementPolicy = Parallel affects only scaling operations, not the updates as documented here.

Well, that's what the documentation says, but when I tested it I got different behavior.

However, there seems to be .spec.updateStrategy.rollingUpdate.maxUnavailable in alpha since k8s 1.24, but I imagine it's equal to 1 by default if not set or if the feature gate is not enabled 🤷‍♂️

The documentation says in this one that it requires a feature flag to be enabled (it defaults to false) thus making it kinda useless.

Just to state it, we encounter the issue with our single dc/single rack clusters, RF 3 which are always requested with quorum CL :/

The current setup is only setting minAvailable.

burmanm · 2024-05-14T11:36:53Z

The current setup is only setting minAvailable.

And also to this, PDB is for eviction policy, I don't think StS controller cares about it.

burmanm · 2024-05-14T12:47:24Z

In case you still wanted to test this PR, the easiest way is to use gh pr checkout.

So gh pr checkout 647 and then depending on the target cluster, make docker-build docker-logger-build and push the images somewhere (running from our source requires build of system-logger image also), or with kind: make docker-kind docker-logger-kind to get the correct images.

And then make deploy will deploy the cass-operator to cass-operator namespace in a namespace scope (I should probably make this cluster-scope for testing purposes, but in this PR it's still namespace scope for testing).

KeepItSimpleStupid · 2024-05-14T12:49:31Z

There is an issue opened & a PR opened on the k8s repo
Statefulset with podManagementPolicy=Parallel ignores minReadySeconds on statefulset rollout update
kubernetes/kubernetes#119234
kubernetes/kubernetes#123975

burmanm · 2024-05-14T13:00:09Z

Awesome, so I wasn't going crazy when reading the statefulset controller code. In that case, my latest commit should hopefully prevent the double change from happening. There's also updates for the k8ssandra-operator upgrade process in here that would further block any accidental restarts:

k8ssandra/k8ssandra-operator#1298

KeepItSimpleStupid · 2024-05-14T14:35:05Z

Nice ! I'll try to find some time this week to test all these changes :)

burmanm · 2024-05-14T15:47:29Z

Yeah, I'll need to fix some more tomorrow, I made a mistake with the canary upgrade process braking these rules. So I need to find another way for the StS prevention.

As for OrderedReady, I didn't feel like it adds a lot of extra time to expanding clusters (seconds per node and not more than that, so irrelevant). I'll probably investigate it more to see if we should also apply that, since if there's no real downside, then we could prevent the behavior on the StS level also.

burmanm · 2024-05-17T06:18:05Z

I've moved other than the MinReadySeconds / OrderedPodManagement change to another PR: #654

…nReadySeconds in StatefulSets. Set the default MinReadySeconds to 5 in StatefulSets to prevent sudden double restarts from statefulset controller.

burmanm requested a review from a team as a code owner May 2, 2024 12:03

burmanm force-pushed the minready_seconds branch from 298ae07 to 4bf7fbc Compare May 2, 2024 12:53

adejanovski requested changes May 3, 2024

View reviewed changes

burmanm force-pushed the minready_seconds branch from 730563f to f5687d6 Compare May 14, 2024 12:48

burmanm force-pushed the minready_seconds branch from f5687d6 to 4fec4f1 Compare June 20, 2024 07:02

adejanovski approved these changes Jun 25, 2024

View reviewed changes

burmanm added 3 commits June 25, 2024 13:40

Add Spec.MindReadySeconds to allow setting a non-default value for Mi…

6c1eb47

…nReadySeconds in StatefulSets. Set the default MinReadySeconds to 5 in StatefulSets to prevent sudden double restarts from statefulset controller.

Modify unit test to ensure value is set

1a903e5

Add additional test that the change is actually applied

3c8c058

burmanm force-pushed the minready_seconds branch from 0d25456 to 3c8c058 Compare June 25, 2024 10:40

burmanm merged commit ddda8e1 into k8ssandra:master Jun 27, 2024
35 of 38 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MinReadySeconds to 5 in StatefulSets to prevent sudden double res… #647

Add MinReadySeconds to 5 in StatefulSets to prevent sudden double res… #647

burmanm commented May 2, 2024 •

edited

Loading

adejanovski left a comment

KeepItSimpleStupid commented May 9, 2024

burmanm commented May 14, 2024

KeepItSimpleStupid commented May 14, 2024

KeepItSimpleStupid commented May 14, 2024

KeepItSimpleStupid commented May 14, 2024

burmanm commented May 14, 2024

burmanm commented May 14, 2024

burmanm commented May 14, 2024

KeepItSimpleStupid commented May 14, 2024

burmanm commented May 14, 2024

KeepItSimpleStupid commented May 14, 2024

burmanm commented May 14, 2024

burmanm commented May 17, 2024

Add MinReadySeconds to 5 in StatefulSets to prevent sudden double res… #647

Add MinReadySeconds to 5 in StatefulSets to prevent sudden double res… #647

Conversation

burmanm commented May 2, 2024 • edited Loading

adejanovski left a comment

Choose a reason for hiding this comment

KeepItSimpleStupid commented May 9, 2024

burmanm commented May 14, 2024

KeepItSimpleStupid commented May 14, 2024

KeepItSimpleStupid commented May 14, 2024

KeepItSimpleStupid commented May 14, 2024

burmanm commented May 14, 2024

burmanm commented May 14, 2024

burmanm commented May 14, 2024

KeepItSimpleStupid commented May 14, 2024

burmanm commented May 14, 2024

KeepItSimpleStupid commented May 14, 2024

burmanm commented May 14, 2024

burmanm commented May 17, 2024

burmanm commented May 2, 2024 •

edited

Loading