-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MinReadySeconds to 5 in StatefulSets to prevent sudden double res… #647
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue: Sadly this doesn't prevent the multiple restarts at once.
What probably happens is consecutive updates to the sts, with the second one happening after the first pod has been restarted.
We end up with one pod finishing its restart from the first change, and another one going through the second change.
I'm unsure what we can do against this, but minReadySeconds
doesn't help.
Thanks @burmanm & @adejanovski for these experiments after our exchange in k8ssandra/k8ssandra#1432 : we were about to submit a Pull Request ;) I believe the ability to configure MinReadySeconds is still valuable even if it doesn't solve our issue, but maybe default it to 0 (the k8s default if not specified) instead of 5s ? Concerning the issue, we first tried to workaround it by disabling Reaper & Medusa on our clusters before a K8ssandra Operator/Cass Operator upgrade (Helm-chart based) to minimize the sources of changes for the StatefulSet, and we enable them again after the rollout... but it's not bulletproof. Interestingly, I noticed our other k8s stateful workloads use different strategies than the one used by cass-operator (StatefulSet +
Maybe a way to solve this issue would be for cass-operator to switch to |
Well, there are two (three) things here. First of all, preventing a single StatefulSet (per rack - not per cluster / dc, those should be guarded by cass-operator, see below) from restarting multiple ones is relatively simple PR: diff --git a/pkg/reconciliation/construct_statefulset.go b/pkg/reconciliation/construct_statefulset.go
index e48f3ce..a5443a7 100644
--- a/pkg/reconciliation/construct_statefulset.go
+++ b/pkg/reconciliation/construct_statefulset.go
@@ -142,7 +142,7 @@ func newStatefulSetForCassandraDatacenter(
},
Replicas: &replicaCountInt32,
ServiceName: dc.GetAllPodsServiceName(),
- PodManagementPolicy: appsv1.ParallelPodManagement,
+ PodManagementPolicy: appsv1.OrderedReadyPodManagement,
Template: *template,
VolumeClaimTemplates: volumeClaimTemplates,
}, The bigger issue is that how is this happening in the first place when not editing StatefulSets directly (as no one should). I haven't been able to trigger this issue (and the reproducer by @adejanovski does neither) with just CassandraDatacenter changes. Editing StS directly bypasses the reconcile logic in cass-operator which makes it odd. MinReadySeconds itself should have prevented that since the controller-runtime cache would not be updated instantly in any case and we should be requeueing when an update happens. I don't think MinReadySeconds = 0 is necessarily even a good setting since all of these can cause small concurrency issues. And to be fair, I don't even know if MinReadySeconds does not stop this behavior. Perhaps it works correctly in this case and the "reproducer" is simply wrong (since it's not testing real behavior but something else). Sort of like we can't stop user from deleting all the pods at the same time if they really want to. The third problem is our use PodDisruptionPolicy, which does not know about the replication factors or availability requirements, so it can't actually enforce a sane cluster structure. If it knew things are always quorum for example, it could enforce that. Also, it only targets the cluster wide labeling, not for example per-rack settings. That's a much more complex issue to solve. |
Hi ! About the first point :
I believe that However, there seems to be |
Maybe a good test would be to set |
Just to state it, we encounter the issue with our single dc/single rack clusters, RF 3 which are always requested with quorum CL :/ |
Well, that's what the documentation says, but when I tested it I got different behavior.
The documentation says in this one that it requires a feature flag to be enabled (it defaults to false) thus making it kinda useless.
The current setup is only setting minAvailable. |
And also to this, PDB is for eviction policy, I don't think StS controller cares about it. |
In case you still wanted to test this PR, the easiest way is to use So And then |
There is an issue opened & a PR opened on the k8s repo |
Awesome, so I wasn't going crazy when reading the statefulset controller code. In that case, my latest commit should hopefully prevent the double change from happening. There's also updates for the k8ssandra-operator upgrade process in here that would further block any accidental restarts: |
Nice ! I'll try to find some time this week to test all these changes :) |
Yeah, I'll need to fix some more tomorrow, I made a mistake with the canary upgrade process braking these rules. So I need to find another way for the StS prevention. As for OrderedReady, I didn't feel like it adds a lot of extra time to expanding clusters (seconds per node and not more than that, so irrelevant). I'll probably investigate it more to see if we should also apply that, since if there's no real downside, then we could prevent the behavior on the StS level also. |
I've moved other than the MinReadySeconds / OrderedPodManagement change to another PR: #654 |
f5687d6
to
4fec4f1
Compare
…nReadySeconds in StatefulSets. Set the default MinReadySeconds to 5 in StatefulSets to prevent sudden double restarts from statefulset controller.
0d25456
to
3c8c058
Compare
What this PR does:
Adds a MinReadySeconds check for the StatefulSet.
Which issue(s) this PR fixes:
Fixes #648
Fixes #382
Checklist