Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploying your own VPA with leader election enabled in GKE conflicts with the GKE system component #7461

Open
raywainman opened this issue Nov 4, 2024 · 25 comments
Labels
area/vertical-pod-autoscaler kind/bug Categorizes issue or PR as related to a bug.

Comments

@raywainman
Copy link
Contributor

raywainman commented Nov 4, 2024

Which component are you using?:

vertical-pod-autoscaler

What version of the component are you using?:

Component version: vertical-pod-autoscaler-1.2.0+

Leader election functionality was added in #6985 and is turned off by default

What k8s version are you using (kubectl version)?:

Any version 1.27+

What environment is this in?:

GKE

What did you expect to happen?:

The self-deployed VPA recommender and the GKE implementation of HPA to continue working.

What happened instead?:

Both the self-deployed VPA recommender and the GKE version use a lease called vpa-recommender in kube-system.

If you deploy your own VPA recommender, it might "steal" the lease and prevent the GKE implementation of HPA.

How to reproduce it (as minimally and precisely as possible):

  • Create a cluster
  • Deploy your own vpa-recommender (eg. using vpa-up.sh). Make sure leader election is enabled (leader-elect=true).
  • Do something to disrupt the control plane (for example upgrade the version).
  • See if your self deployed version of the vpa-recommender has grabbed the lease, if so HPA could stop working.

Anything else?

This is due to the unfortunate naming collision between GKE's system controller (also called vpa-recommender and the one provided here)

@raywainman raywainman added the kind/bug Categorizes issue or PR as related to a bug. label Nov 4, 2024
@raywainman
Copy link
Contributor Author

cc @adrianmoisey

Some thoughts on mitigating this:

  • Document the flags very carefully to encourage folks not to use the default when deploying on GKE.
  • Change the default here which could possibly break some users.

@adrianmoisey
Copy link
Member

I see two paths forward to fixing this:

  1. GKE stops using vpa-recommender as a lease name
  2. VPA stops using vpa-recommender as a lease name

I can't speak for what that lease is being used for in GKE, but I can only assume that changing that lease is difficult or impossible in GKE.

Given that the lease(s) in VPA are only used for VPA components, and running multiple recommenders and updaters for a brief period isn't that worst thing in the world, my vote is that we change the default lease name in the VPA.

Any VPA configured with the lease enabled will only be running multiple pods for a short period of time, which should be fine.

It's obviously not an amazing path forward, but may be worth doing.

I'm curios what @voelzmo and @kwiesmueller think, as they may be the ones approving that controversial PR.

@adrianmoisey
Copy link
Member

adrianmoisey commented Nov 5, 2024

my vote is that we change the default lease name in the VPA.

If we do go this path, i suggest we also make PRs into 3rd party Helm charts to ensure they support the new default name. Some of them hardcode the lease name:

@kwiesmueller
Copy link
Member

I'm not sure if changing the default in OSS is the right approach because GKE claims it too.
Having documentation (GKE should maybe have a docs page for that too) seems like it should be enough.
Is there anything preventing GKE users from deploying a custom VPA with a different lease name?

cc. @raywainman

@adrianmoisey
Copy link
Member

I'm not sure if changing the default in OSS is the right approach because GKE claims it too.

My concern is that if someone follows the happy path of deployment, and enables the leases without reading documentation, then GKE breaks in ways that are not obvious.

It took us 4 days and countless messages with GCP support before we found what GCP was doing with the vpa-recommender lease (being that the GKE HPA uses it).
It's also not possible (from what I understand) for a user to self-troubleshoot this problem (ie: GKE users can't see the logs of the HPA controller).

The pain of renaming the default OSS lease is pretty small.

@adrianmoisey
Copy link
Member

/area vertical-pod-autoscaler

@raywainman
Copy link
Contributor Author

Just thinking out loud - what is the worst that can happen when changing a lease name?

On upgrade, we would have two vpa-recommender pods (one from old version, one from new version) that believes they are the leader.

For a short period of time, both of them will try to compute and apply VPA recommendations. Off the top of my head, these should all generally be the same recommendations, we will just have the possibility of racing/duplication.

Is there any other risk anyone else can think of?

@adrianmoisey
Copy link
Member

Just thinking out loud - what is the worst that can happen when changing a lease name?

Yup! That is my thought too.

For a short period of time, both of them will try to compute and apply VPA recommendations. Off the top of my head, these should all generally be the same recommendations, we will just have the possibility of racing/duplication.

Correct! My understanding is that the bad thing that could happen is that 2 VPA recommenders should be giving roughly the same recommendation, for a very short period of time, after which only 1 will be active.

And, the VPA recommender's --recommender-interval flag is set to 1 minute by default, lowering that risk. There may be a few other parts I may be forgetting, such as the OOM detector, but I don't think that's a big deal either.

@raywainman
Copy link
Contributor Author

@ialidzhikov what do you think?

My vote is to change it now before the next release (we could even consider a cherrypick here) given that:

  • This feature is fairly new, we have an opportunity to fix this before it gains wider adoption.
  • Leader election is off by default.
  • We can try and be loud in the release notes about this change to warn users in case they see strange behavior.
  • The theoretical impact of having two leaders running temporarily seems quite minimal.

@raywainman
Copy link
Contributor Author

cc @voelzmo as well

@raywainman
Copy link
Contributor Author

raywainman commented Nov 6, 2024

Can maybe something go wrong with the checkpoint?

If the new version leader writes a checkpoint that the old version leader doesn't understand perhaps?

EDIT: Looks like we have support for this case.

@adrianmoisey
Copy link
Member

@ialidzhikov what do you think?

My vote is to change it now before the next release (we could even consider a cherrypick here) given that:

  • This feature is fairly new, we have an opportunity to fix this before it gains wider adoption.
  • Leader election is off by default.
  • We can try and be loud in the release notes about this change to warn users in case they see strange behavior.
  • The theoretical impact of having two leaders running temporarily seems quite minimal.

I agree with all these points.
Also to note, the VPA documentation isn't very good about telling people how to run the VPA, the upside being that no documentation was really updated explaining the lease feature (besides the update to the parameters: 211ecdc). I recently added #7463, but that's a few days old.

If we're going to change the lease name, I think now is the perfect time to do so.

@ialidzhikov
Copy link
Contributor

ialidzhikov commented Nov 7, 2024

Hi folks,

% docker run registry.k8s.io/autoscaling/vpa-recommender:1.2.1 --help

      --leader-elect                                         Start a leader election client and gain leadership before executing the main loop. Enable this when running replicated components for high availability.
      --leader-elect-lease-duration duration                 The duration that non-leader candidates will wait after observing a leadership renewal until attempting to acquire leadership of a led but unrenewed leader slot. This is effectively the maximum duration that a leader can be stopped before it is replaced by another candidate. This is only applicable if leader election is enabled. (default 15s)
      --leader-elect-renew-deadline duration                 The interval between attempts by the acting master to renew a leadership slot before it stops leading. This must be less than the lease duration. This is only applicable if leader election is enabled. (default 10s)
      --leader-elect-resource-lock string                    The type of resource object that is used for locking during leader election. Supported options are 'leases', 'endpointsleases' and 'configmapsleases'. (default "leases")
      --leader-elect-resource-name string                    The name of resource object that is used for locking during leader election. (default "vpa-recommender")
      --leader-elect-resource-namespace string               The namespace of resource object that is used for locking during leader election. (default "kube-system")
      --leader-elect-retry-period duration                   The duration the clients should wait between attempting acquisition and renewal of a leadership. This is only applicable if leader election is enabled. (default 2s)

The leader election resource name and namespace are configurable. If you are installing a custom VPA to a GKE cluster, please use the --leader-elect-resource-namespace and --leader-elect-resource-name flags to avoid using the same lease the GKE's VPA is using.

@ialidzhikov
Copy link
Contributor

I am also open to rename the default lease name in the VPA repo to avoid this issue.

@kwiesmueller
Copy link
Member

Alright, as it's still new and it's an easy way to avoid issues before too many people use it. I guess changing it now is fine.
I am a little worried about breaking people that already use it, so not sure about backporting it (doesn't seem correct under semantic versioning).

@raywainman
Copy link
Contributor Author

We'll need to update this: https://github.com/cowboysysop/charts/blob/master/charts/vertical-pod-autoscaler/templates/recommender/clusterrole.yaml#L141-L142 which was added in cowboysysop/charts#705.

If we combine it with a release then this should be less painful.

@raywainman
Copy link
Contributor Author

Opened cowboysysop/charts#781 to update the Helm chart.

adrianmoisey added a commit to adrianmoisey/helm-charts that referenced this issue Dec 3, 2024
…se" to avoid conflicts

See kubernetes/autoscaler#7461.

The default lock name can sometimes conflict with a managed deployment of VPA/HPA which can cause issues on your cluster.

Here we are proposing changing the flag value in the Helm chart to a new name (`vpa-recommender-lease`). We are keeping the old resource name in the RBAC to avoid disruptions or issues during upgrade.
@adrianmoisey
Copy link
Member

A summery of where we are with this:

  1. The lease name default has been changed - Change default lease name for vpa-recommender to "vpa-recommender-lease" #7498
  2. A PR for cowboysysop has been made, but not yet merged - Change default lease name for vpa-recommender to "vpa-recommender-lease" to avoid conflicts cowboysysop/charts#781 ⚠️
  3. A PR for stevehipwell has been made, but not yet merged - Change default lease name for vpa-recommender to "vpa-recommender-lease" to avoid conflicts stevehipwell/helm-charts#1090 ⚠️

We should try get the bottom two items done before cutting a release of the VPA

@voelzmo
Copy link
Contributor

voelzmo commented Dec 3, 2024

We should try get the bottom two items done before cutting a release of the VPA

I don't think we need to. Helm charts decide when/how to pick up new upstream releases on their own, right? Is there any reason why a vpa release does depend on changes made in helm charts?

@adrianmoisey
Copy link
Member

We should try get the bottom two items done before cutting a release of the VPA

I don't think we need to. Helm charts decide when/how to pick up new upstream releases on their own, right? Is there any reason why a vpa release does depend on changes made in helm charts?

There is certainly no hard dependency, but I worry that if someone decides to bump the Helm chart before merging in the lease rename PRs, then users of that Helm chart may have a broken VPA.

I don't think we have to wait, but I think it would be a nice thing to do (with some friendly nudges to the Helm chart maintainers if PRs aren't merged).

Basically, we're breaking a promise we had with them, let's try handle the name change as gracefully as possible.

@stevehipwell
Copy link
Contributor

@adrianmoisey speaking as the maintainer of the stevehipwell Helm chart; the charts shouldn't be modified until the release is out, as long as the CHANGELOG documents the change to lease (and maybe a warning about requiring both for the update) it will be captured.

On to the wider issue here, is there a reason why the lease name wasn't made an optional arg? This would have the least impact on all of the existing VPA users, most of whom are unlikely to be on GCP? And also is there a reason why if the change was being made that it wasn't made consistently across all the components? Nothing makes long term maintenance harder than unexpected inconsistency.

@adrianmoisey
Copy link
Member

speaking as the maintainer of the stevehipwell Helm chart; the charts shouldn't be modified until the release is out, as long as the CHANGELOG documents the change to lease (and maybe a warning about requiring both for the update) it will be captured.

Good to know, thanks!

On to the wider issue here, is there a reason why the lease name wasn't made an optional arg?

The lease functionality was defaulted to "off", and the lease name was an arg, with a default set.

And also is there a reason why if the change was being made that it wasn't made consistently across all the components?

What do you mean? The lease was rolled out for all 3 VPA components.

@stevehipwell
Copy link
Contributor

@adrianmoisey sorry I missed the original state of the system regarding existing args, but that leads me to question why a code change was required at all? I can see why support for using an alternate lease name would be useful, but it looks to me like this could be a manifest only change? And I'm not sure I understand why the default needs to change rather than document the edge case? My Helm chart can (and will) easily support modifying the default lease names.

What do you mean? The lease was rolled out for all 3 VPA components.

I can only see a change to the default name for the recommender in #7498, are there other PRs for the other components to add the -lease suffix?

@adrianmoisey
Copy link
Member

sorry I missed the original state of the system regarding existing args, but that leads me to question why a code change was required at all? I can see why support for using an alternate lease name would be useful, but it looks to me like this could be a manifest only change? And I'm not sure I understand why the default needs to change rather than document the edge case? My Helm chart can (and will) easily support modifying the default lease names.

There were a few factors into the decision:

  1. When using this lease name on GKE, the side-effects are very surprising and not easily to debug or troubleshoot
  2. The lease is currently off by default, and a few new feature, we assume not many people will be using it yet
  3. Changing the default doesn't cause harm for the recommender. Worst case is that if multiple recommenders were run, there's a small possibility that you get subtly different recommendations. We believe that this won't happen during a deployment rollout

We basically concluded that the pain of changing the lease name now was lower than the pain of someone having troubles when running on GKE.

I can only see a change to the default name for the recommender in #7498, are there other PRs for the other components to add the -lease suffix?

Those names don't conflict with GKE, so there's no need to change them.

@stevehipwell
Copy link
Contributor

I can see that it's easier, but is it correct? This looks like a GKE issue and what's to stop them from modifying their deployment to change their lease name if they're tracking the changes in this repo blindly? For managed components they should, at a minimum have a unique lease defined for this very reason.

Those names don't conflict with GKE, so there's no need to change them.

I think this is a mistake if you're making a change, skipping changing the others lowers the complexity today (one off write cost) but increases it for the life of the project (perpetual read cost). I would suggest that the lease names should be aligned across all components.

Anway, and please don't take my feedback the wrong way, I'll make sure to update the Helm chart when the next VPA release is out. I'll likely make the lease name configurable so even if GKE takes the new name there will be a way to work around it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/vertical-pod-autoscaler kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

7 participants