-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong VPA is attempting to update pod resources and failing #7499
Comments
/area vertical-pod-autoscaler |
I believe this issue would happen with the VPA deployed from OSS as well. The way VPA knows which pods to target is that it will fetch the CronJob object: And it fetches the From your example above, you are not setting any labels in your So now the two VPAs are essentially racing and competing against each-other. To fix this, you'll need to set some labels in your two CronJob templates to make sure you are differentiating them from each-other. Hope that helps :) |
@raywainman Yep, that seems to be the issue! In the real environment where we were doing experimentation, the Not sure if y'all are considering this a bug, though. If not, at a minimum it would be helpful if this behavior was documented somewhere clearly so that it doesn't confuse others in the same way. |
That's a great idea, it's an easy pitfall. I thought we had something documenting this but I actually couldn't find anything. Let me see if I can put together a PR with this. Thanks @WesCossick! |
/kind documentation |
@raywainman I can make a PR for this, but are we sure it's not a bug? shouldn't we log a warning to let users know their VPA will target everything? |
That reminds me... #6460 was made a long time ago that fixes this problem. |
I'm not sure if this qualifies as a bug, but the issue is that the function: https://github.com/kubernetes/autoscaler/blob/master/vertical-pod-autoscaler/pkg/utils/vpa/api.go#L163 is designed to identify the parent of a pod. However, the behavior is different for CronJobs. In the case of CronJobs, they create Jobs, which in turn create Pods. As a result, the |
I ran into this while using VPA 1.2.0 trying to provide recommendations for CronJobs and using apiVersion: batch/v1
kind: CronJob
metadata:
name: hello
spec:
schedule: "*/3 * * * *"
jobTemplate:
spec:
template:
metadata:
labels:
app: hello
spec:
containers:
- name: hello
image: busybox:1.28
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
- |
tries=0
while [ $tries -lt 30 ]; do
echo "Hello, World $tries!"
tries=$((tries+1))
sleep 6
done
restartPolicy: OnFailure
---
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: hello
spec:
targetRef:
apiVersion: "apps/v1"
kind: "CronJob"
name: "hello"
updatePolicy:
updateMode: "Initial"
minReplicas: 1
I had to specify the Job labels like @raywainman said in order for it to start working :/. Seems like a bug/docs issue to me if reproducible. |
You're right, thanks for clarifying that. The VPA is definitely expecting a CronJob to have |
If that's the case, we should at least emit a log ( and maybe throw an error? ) |
Oh, this is interesting (and a bit unexpected, tbh): the CronJobController doesn't use the labels to get the Jobs to reconcile, but instead uses just the Nevertheless, I also would expect some docs in the official CronJob docs and I guess we can also add a reference to this in the VPA docs, doubling down on disjoint labels for separate CronJobs. WDYT? |
I tried to reproduce this error, and I couldn't. It seems to be this part of code: https://github.com/kubernetes/kubernetes/blob/8a39b6062060ffb4fba52f53c011bd00428db194/pkg/registry/batch/job/strategy.go#L223-L226 Based on this, it seems that the CronJob <-> Job relationship in Kubernetes is sound (as you had mentioned with the OwnerRef). And the Job <-> Pod relationship is also sound, due to these labelSelectors being auto-generated. (It does seem that if a user sets This makes me assume that for the VPA to grab the labels of a CronJob the approach that Kubernetes is using, which makes me think that this is a bug. |
Oooh, great find on the auto-generated labels and selectors on Job! There even is good documentation about this on the Job. So that means from the CronJob docs perspective it makes sense to allow the user leaving this empty, but this doesn't work for VPA. If it is empty, you will at least get a condition in the VPA status
And if then users configure the same labels on different CronJobs, this is still fine from the CronJob and Job controller's perspective, it is just VPA requiring unique labels per CronJob. So this is definitely a VPA topic. I'm still wondering why #6460 wouldn't help here? Shouldn't it detect the case where the labels match, but following the |
I think I've figured out what's going on. This piece of code grabs the labels from a CronJob's Job template, and assuming that the Pods will be discoverable by using those labels. Since there may not be any labels on the CronJob's spec, the labelSelector can be empty. Later on, when This then uses those labelSelectors to find a match: https://github.com/bskiba/autoscaler/blob/35359b902ba00931fcad5be7c72a03bd7be39f7f/vertical-pod-autoscaler/pkg/utils/vpa/api.go#L99-L110 Since the labelSelector that came from the CronJob is empty, nothing is matched. I think we could fix this if we found an active Job, and used its labelSelector to find Pods. I assume the implementation may end up being a little gross though. |
still, after #6460, we also check the |
Not to my understanding. Since we're matching a Pod to the labels discovered by the VPA (which is empty), no match happens. I think the purpose of #6460 is to filter out Pods that were matched, but shouldn't have been matched. I think the CronJob situation is opposite, no pods are being matched, since the labelSelector can't be discovered. |
Which component are you using?:
vertical-pod-autoscaler
What version of the component are you using?:
Component version: Not sure; using the one auto-installed by GKE.
What k8s version are you using (
kubectl version
)?:kubectl version
OutputWhat environment is this in?:
Google Cloud Platform's GKE.
What did you expect to happen?:
For the resources to be updated by the correct VPA.
What happened instead?:
The wrong VPA is updating pod resources and failing.
How to reproduce it (as minimally and precisely as possible):
Create the following two VPAs:
Create the following two CronJobs:
Using
describe pod
, see the following annotations on one of the cronjob's pods:When this happens, the other cronjob's pods look correct and have their CPU and memory correctly set by the VPA:
Which one is mixed up changes based on the order in which you create the VPAs it seems.
Anything else we need to know?:
I pared down the two
CronJob
resources to simplify them.The text was updated successfully, but these errors were encountered: