-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRA: consolidate pre-submit jobs #33942
Conversation
The -kubetest2 variants were added as part of kubernetes#33550 without considering whether it makes sense to run them automatically. They shouldn't because that doesn't add test coverage and is expensive.
@@ -4322,11 +4327,12 @@ presubmits: | |||
limits: | |||
cpu: 4 | |||
memory: 6Gi | |||
- name: pull-kubernetes-node-e2e-crio-cgrpv1-dra-kubetest2 | |||
- name: pull-kubernetes-node-e2e-crio-cgrpv1-dra-kubetest2 # experimental alternative to pull-kubernetes-node-e2e-crio-cgrpv1-dra |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elieser1101: what is the status of these kubetest2 variants?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can see they don't differ much from the correspondent kubernetes_e2e.py variants. I'd propose to drop kubernetes_e2e.py jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment the kubetest2 variants are not working because of the --label-filter
not suported. I started to look at the fix on kubetests2 kubernetes-sigs/kubetest2#285
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elieser1101 Both kubetest2 jobs are "green" in the test grid:
- https://testgrid.k8s.io/sig-node-cri-o#pr-node-kubelet-crio-cgrpv1-dra-kubetest2
- https://testgrid.k8s.io/sig-node-cri-o#pr-node-kubelet-crio-cgrpv2-dra-kubetest2
Do you mean they're not working correctly / as expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah both are green but its because test are not actually runring, we can see that from the build.log
Ran 0 of 718 Specs in 0.023 seconds
SUCCESS! -- 0 Passed | 0 Failed | 0 Pending | 718 Skipped
PASS
@@ -4276,7 +4278,10 @@ presubmits: | |||
skip_branches: | |||
- release-\d+\.\d+ # per-release image | |||
always_run: false | |||
run_if_changed: /dra/|/dynamicresources/|/resourceclaim/|/resourceclass/|/podscheduling/|/resourceclaimtemplate/|/dynamic-resource-allocation/|/pkg/apis/resource/|/api/resource/|/test/e2e_node/dra_test.go | |||
# Automatically testing with one container runtime is sufficient. | |||
# containerd was picked because it's apparently more common. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bart0sh: does choosing containerd make sense?
Which runtime is more stable in your experience?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cri-o just a little bit more stable lately, but it can and most probably will change if we drop one of the runtimes as you're suggesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we keep this job as is and stop automatically triggering pull-kubernetes-node-e2e-crio-cgrpv1-dra
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cri-o just a little bit more stable lately, but it can and most probably will change if we drop one of the runtimes as you're suggesting.
Why would dropping testing of one runtime change how stable the other runtime is?
Sufficient for what?
Sufficient for knowing that kubelet's interaction with a runtime isn't completely broken.
Can we keep this job as is and stop automatically triggering pull-kubernetes-node-e2e-crio-cgrpv1-dra instead?
Sure, if that one was more stable in the past. The goal is to avoid flakes that aren't caused by our code changes, so whatever runtime is more stable is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I added those jobs my idea was to cover all available runtimes and distros
That makes perfect sense for periodic testing. That's how we reach assurance that what we will release will work. But we don't need that level of confidence for every single PR that we merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we don't need that level of confidence for every single PR that we merge.
What's wrong with having an ability to see a failure before merging the code causing it? And why we don't need to test it for containerd, cos and ubuntu, but need it for cri-o and fedora? Where is a border line between pr and ci jobs in this sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For pre-submit: what's wrong is that this causes real, significant costs for the projects without being justified by the benefit that it brings. We cannot afford to run everything because "there might be failures".
For periodic: here it is more reasonable to expand the scope of platform coverage because we control how often we test. But also here the question arises whether we expect problems on one platform that don't occur on another and whether it should be Kubernetes that tests that or the downstream consumer of Kubernetes. Even cloud providers are expected to run their own testing of Kubernetes these days, not the Kubernetes project itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to argue about costs without having any information about it. However, having one more job triggered on the changes to a tiny subset of codebase wouldn't make a lot of difference in my opinion. So, I'm still for keeping both jobs active.
Let's see what other reviewers say about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to argue about costs without having any information about it
Then just trust Ben when he says that automatically triggered pre-submit jobs shouldn't exist at all or only under very special circumstances. He is SIG testing and infra lead, he knows about the costs.
having one more job triggered on the changes to a tiny subset of codebase
I think you haven't seen all PRs where this got triggered. I got asked several times about them in the context of PRs where they shouldn't have been started.
It's not necessary to cover all possible runtime variants during pre-submit for all PRs. One of them should be sufficient, the rest can be covered during periodic testing. This cuts down CI costs and noise in unrelated PRs that happen to trigger the run_if_changed.
These pre-submit checks are more useful when someone actually modifies source code and less useful for other changes (Go dependencies, Markdown).
cb517c8
to
229f05c
Compare
/assign @klueska |
/priority important-soon We keep running these extra jobs unnecessarily (kubernetes/kubernetes#129213, kubernetes/kubernetes#129195). kubernetes/kubernetes#129103, kubernetes/kubernetes#129103, etc.). |
I am gonna land this, we can iterate if needed. want to stop the bleeding! /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dims, pohly The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@pohly: Updated the
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This reduces the number of jobs which get triggered automatically to save costs and avoid noise in unrelated PRs. See kubernetes/community#8196 for a suggested policy and rationale.