Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added webhook-annotations flag to VPA admission-controller #7472

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

wcarlsen
Copy link
Contributor

@wcarlsen wcarlsen commented Nov 7, 2024

What type of PR is this?

/kind bug
/kind feature

What this PR does / why we need it:

The VPA admission controller when deployed is bundled with a certificate. The responsibility of certificate renewal can be automatically handled with cert-manager, upon renewal the admission controller reloads the certificate, but the caBundle on the mutatingwebhook managed by the admission controller isn't updated.

Cert-manager comes with a ca-injector that exactly handles updating the CA certificates for mutating webhooks via an annotation cert-manager.io/inject-ca-from: kube-system/vpa-cert-by-certmanager see ca-injector docs. Thanks to @abstrask for coming up with the idea.

Which issue(s) this PR fixes:

Related to PR #7454 and issue #6665

Special notes for your reviewer:

I still think the PR #7454 is in the end the right approach, but this at least offer a low hanging alternative solution.

Does this PR introduce a user-facing change?

Added webhook-annotations flag to VPA admission-controller. This is specially useful for having cert-managers ca-injector automatically updating  the CA certificates on the mutating webhook.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. labels Nov 7, 2024
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 7, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @wcarlsen. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 7, 2024
@adrianmoisey
Copy link
Member

/ok-to-test

Forgive me for not understanding, but if cert-manager is handling the CA bundle in the mutating webhook, doesn't the VPA also need to reload when the CA bundle changes?

Or is this handling the case when the CA stays static, and only the certificates themselves are rotating?

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 7, 2024
@wcarlsen
Copy link
Contributor Author

wcarlsen commented Nov 7, 2024

/ok-to-test

Forgive me for not understanding, but if cert-manager is handling the CA bundle in the mutating webhook, doesn't the VPA also need to reload when the CA bundle changes?

Or is this handling the case when the CA stays static, and only the certificates themselves are rotating?

It already has this capability as far as I understand

or am I missing something here?

@adrianmoisey
Copy link
Member

At the moment it only reloads the cert and server key. If the CA bundle changes, my understanding is that it won't reload.

If cert-manager is rotating the CA too (I don't know if it does), then this PR depends on #7454.

If cert-manager is not rotating the CA, and is only rotating the server cert and key, then this doesn't depend on #7454

But, I'm not too familiar with this flow, so I could be wrong.

@adrianmoisey
Copy link
Member

Ok, I had a quick look at the code.

The VPA uses clientCaFile in order to register the webhook.
It uses client-ca-file and tls-cert-file to server HTTPS traffic.

This PR will allow cert-manager to manage both CA and certificate/key. But cert-manager will be responsible for writing it into the webhook's configuration and VPA will be responsible for reloading the certificate/key for itself.

That makes me think that this change will conflict with #7454, since setting --reload-cert=true will mean that CA bundle and certificate/key will be handled by VPA.
I guess the VPA needs a reload-cert and a reload-ca option

@wcarlsen
Copy link
Contributor Author

wcarlsen commented Nov 7, 2024

Ok, I had a quick look at the code.

The VPA uses clientCaFile in order to register the webhook. It uses client-ca-file and tls-cert-file to server HTTPS traffic.

This PR will allow cert-manager to manage both CA and certificate/key. But cert-manager will be responsible for writing it into the webhook's configuration and VPA will be responsible for reloading the certificate/key for itself.

That makes me think that this change will conflict with #7454, since setting --reload-cert=true will mean that CA bundle and certificate/key will be handled by VPA. I guess the VPA needs a reload-cert and a reload-ca option

At least now as soon as a certificate is renewed, by e.g. cert-manager, you get TLS handshake error and all pods with resources managed by a VPA resource if restarted will end up restarting endlessly. And according to my testing setting that annotation fixes that the admission controller do not really fully manage the lifecycle of the webhook.

My take is that #7454 is the long term goal and the right solution in the end, but annotating the webhook offer an easier reviewable solution right now.

But sure if one wanted split responsibility of updating the webhook it could make sense to introduce reload-ca option too.

@adrianmoisey
Copy link
Member

It seems like #7454 will solve this issue, since when cert-manager rotates the cert and ca, VPA will update the webhook and it will reload the certs.

Any reason to not wait for that change?

@wcarlsen
Copy link
Contributor Author

wcarlsen commented Nov 7, 2024

It seems like #7454 will solve this issue, since when cert-manager rotates the cert and ca, VPA will update the webhook and it will reload the certs.

Any reason to not wait for that change?

You are right and I'm totally cool with that. It is just hard for me to predict where we are at in terms of progress on that PR and a release. The experience is not great once you finally hit the edge case and it is kind of hard to debug. But it can be mitigated by manually adding the ca-injector annotation, since it won't be reconciled by the admission controller and then wait for an upstream fix.

You are welcome to close the PR unless we can up with another argument on why it would be nice with an annotation flag.

@adrianmoisey
Copy link
Member

You are right and I'm totally cool with that. It is just hard for me to predict where we are at in terms of progress on that PR and a release.

I'm unsure about a release, but I don't think this PR will shortcut the other one and be released first.
That other PR is next on my list to try get merged.

You are welcome to close the PR unless we can up with another argument on why it would be nice with an annotation flag.

I agree that an annotation may be useful, I just don't see this use case as being one of them

@wcarlsen
Copy link
Contributor Author

wcarlsen commented Nov 7, 2024

You are right and I'm totally cool with that. It is just hard for me to predict where we are at in terms of progress on that PR and a release.

I'm unsure about a release, but I don't think this PR will shortcut the other one and be released first. That other PR is next on my list to try get merged.

That is awesome news.

You are welcome to close the PR unless we can up with another argument on why it would be nice with an annotation flag.

I agree that an annotation may be useful, I just don't see this use case as being one of them

I will let that being up to you to decide and I'm totally fine with updating the release notes section too. No pressure and thanks 👍

@voelzmo
Copy link
Contributor

voelzmo commented Dec 3, 2024

@wcarlsen A side-note to unblock you with this use-case: Start up the admission-controller with the additional flag --register-webhook=false and manage the webhook registration yourself. This way, you can add all the necessary annotations and just have cert-manager do its thing.

We're doing something similar in our platform and it works like a charm.

For context: I consider this self-registration as some kind of odd feature that tries to improve the batteries-included experience for people who don't have a way to automatically generate certs (especially in combination with our hack/vpa-up.sh script. I wasn't really thinking these two things would/should be used in production envs.

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. area/balancer area/cluster-autoscaler area/provider/alicloud Issues or PRs related to the AliCloud cloud provider implementation area/provider/aws Issues or PRs related to aws provider area/provider/azure Issues or PRs related to azure provider area/provider/cluster-api Issues or PRs related to Cluster API provider and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Dec 4, 2024
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 6, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wcarlsen
Once this PR has been reviewed and has the lgtm label, please assign raywainman for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Dec 6, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 6, 2024
@wcarlsen
Copy link
Contributor Author

wcarlsen commented Dec 6, 2024

@wcarlsen A side-note to unblock you with this use-case: Start up the admission-controller with the additional flag --register-webhook=false and manage the webhook registration yourself. This way, you can add all the necessary annotations and just have cert-manager do its thing.

We're doing something similar in our platform and it works like a charm.

For context: I consider this self-registration as some kind of odd feature that tries to improve the batteries-included experience for people who don't have a way to automatically generate certs (especially in combination with our hack/vpa-up.sh script. I wasn't really thinking these two things would/should be used in production envs.

Brilliant. Thanks blowing life into this idea!

@adrianmoisey
Copy link
Member

/remove-area balancer
/remove-area cluster-autoscaler
/remove-area provider/alicloud
/remove-area provider/aws
/remove-area provider/azure
/remove-area provider/cluster-api
/remove-area provider/digitalocean
/remove-area provider/equinixmetal
/remove-area provider/externalgrpc
/remove-area provider/gce
/remove-area provider/hetzner
/remove-area provider/ionoscloud
/remove-area provider/kwok
/remove-area provider/linode
/remove-area provider/magnum
/remove-kind bug

@k8s-ci-robot k8s-ci-robot removed area/balancer area/cluster-autoscaler area/provider/alicloud Issues or PRs related to the AliCloud cloud provider implementation area/provider/aws Issues or PRs related to aws provider area/provider/azure Issues or PRs related to azure provider area/provider/cluster-api Issues or PRs related to Cluster API provider area/provider/digitalocean Issues or PRs related to digitalocean provider area/provider/equinixmetal Issues or PRs related to the Equinix Metal cloud provider for Cluster Autoscaler area/provider/externalgrpc Issues or PRs related to the External gRPC provider area/provider/gce area/provider/hetzner Issues or PRs related to Hetzner provider area/provider/ionoscloud area/provider/kwok Issues or PRs related to the kwok cloud provider for Cluster Autoscaler area/provider/linode Issues or PRs related to linode provider area/provider/magnum Issues or PRs related to the Magnum cloud provider for Cluster Autoscaler kind/bug Categorizes issue or PR as related to a bug. labels Dec 6, 2024
@adrianmoisey
Copy link
Member

/remove-area provider/oci
/remove-area provider/rancher

@k8s-ci-robot k8s-ci-robot removed area/provider/oci Issues or PRs related to oci provider area/provider/rancher labels Dec 6, 2024
@wcarlsen
Copy link
Contributor Author

Since #7454 seems to be somewhat stuck in valid discussions, this offers a relatively easy to review alternative solution to the issue.

Yes I can prevent the admission controller from registering the webhook with --register-webhook=false and just roll it myself, but I would anyways end up with setting an annotation for cert-manager to update the caBundle with cert-manager.io/inject-ca-from and I now need to be aware of potential drift related to the webhook in future releases.

I would very much prefer #7454, but until then can we please have webhook annotations to mitigate this.

@adrianmoisey
Copy link
Member

Since #7454 seems to be somewhat stuck in valid discussions

Progress is happening over there, and should hopefully be concluded soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/vertical-pod-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants