Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRA: consolidate pre-submit jobs #33942

Merged
merged 3 commits into from
Dec 14, 2024

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Dec 13, 2024

This reduces the number of jobs which get triggered automatically to save costs and avoid noise in unrelated PRs. See kubernetes/community#8196 for a suggested policy and rationale.

The -kubetest2 variants were added as part of
kubernetes#33550 without considering whether
it makes sense to run them automatically. They shouldn't because that doesn't
add test coverage and is expensive.
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 13, 2024
@k8s-ci-robot k8s-ci-robot added area/config Issues or PRs related to code in /config area/jobs sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Dec 13, 2024
@@ -4322,11 +4327,12 @@ presubmits:
limits:
cpu: 4
memory: 6Gi
- name: pull-kubernetes-node-e2e-crio-cgrpv1-dra-kubetest2
- name: pull-kubernetes-node-e2e-crio-cgrpv1-dra-kubetest2 # experimental alternative to pull-kubernetes-node-e2e-crio-cgrpv1-dra
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elieser1101: what is the status of these kubetest2 variants?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see they don't differ much from the correspondent kubernetes_e2e.py variants. I'd propose to drop kubernetes_e2e.py jobs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment the kubetest2 variants are not working because of the --label-filter not suported. I started to look at the fix on kubetests2 kubernetes-sigs/kubetest2#285

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah both are green but its because test are not actually runring, we can see that from the build.log

Ran 0 of 718 Specs in 0.023 seconds
SUCCESS! -- 0 Passed | 0 Failed | 0 Pending | 718 Skipped
PASS

@@ -4276,7 +4278,10 @@ presubmits:
skip_branches:
- release-\d+\.\d+ # per-release image
always_run: false
run_if_changed: /dra/|/dynamicresources/|/resourceclaim/|/resourceclass/|/podscheduling/|/resourceclaimtemplate/|/dynamic-resource-allocation/|/pkg/apis/resource/|/api/resource/|/test/e2e_node/dra_test.go
# Automatically testing with one container runtime is sufficient.
# containerd was picked because it's apparently more common.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bart0sh: does choosing containerd make sense?

Which runtime is more stable in your experience?

Copy link
Contributor

@bart0sh bart0sh Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cri-o just a little bit more stable lately, but it can and most probably will change if we drop one of the runtimes as you're suggesting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pohly

Automatically testing with one container runtime is sufficient.

Sufficient for what?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep this job as is and stop automatically triggering pull-kubernetes-node-e2e-crio-cgrpv1-dra instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cri-o just a little bit more stable lately, but it can and most probably will change if we drop one of the runtimes as you're suggesting.

Why would dropping testing of one runtime change how stable the other runtime is?

Sufficient for what?

Sufficient for knowing that kubelet's interaction with a runtime isn't completely broken.

Can we keep this job as is and stop automatically triggering pull-kubernetes-node-e2e-crio-cgrpv1-dra instead?

Sure, if that one was more stable in the past. The goal is to avoid flakes that aren't caused by our code changes, so whatever runtime is more stable is better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I added those jobs my idea was to cover all available runtimes and distros

That makes perfect sense for periodic testing. That's how we reach assurance that what we will release will work. But we don't need that level of confidence for every single PR that we merge.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we don't need that level of confidence for every single PR that we merge.

What's wrong with having an ability to see a failure before merging the code causing it? And why we don't need to test it for containerd, cos and ubuntu, but need it for cri-o and fedora? Where is a border line between pr and ci jobs in this sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For pre-submit: what's wrong is that this causes real, significant costs for the projects without being justified by the benefit that it brings. We cannot afford to run everything because "there might be failures".

For periodic: here it is more reasonable to expand the scope of platform coverage because we control how often we test. But also here the question arises whether we expect problems on one platform that don't occur on another and whether it should be Kubernetes that tests that or the downstream consumer of Kubernetes. Even cloud providers are expected to run their own testing of Kubernetes these days, not the Kubernetes project itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to argue about costs without having any information about it. However, having one more job triggered on the changes to a tiny subset of codebase wouldn't make a lot of difference in my opinion. So, I'm still for keeping both jobs active.

Let's see what other reviewers say about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to argue about costs without having any information about it

Then just trust Ben when he says that automatically triggered pre-submit jobs shouldn't exist at all or only under very special circumstances. He is SIG testing and infra lead, he knows about the costs.

having one more job triggered on the changes to a tiny subset of codebase

I think you haven't seen all PRs where this got triggered. I got asked several times about them in the context of PRs where they shouldn't have been started.

It's not necessary to cover all possible runtime variants during pre-submit for
all PRs. One of them should be sufficient, the rest can be covered during
periodic testing. This cuts down CI costs and noise in unrelated PRs that
happen to trigger the run_if_changed.
These pre-submit checks are more useful when someone actually modifies source
code and less useful for other changes (Go dependencies, Markdown).
@pohly pohly force-pushed the dra-job-consolidation branch from cb517c8 to 229f05c Compare December 13, 2024 13:51
@pohly
Copy link
Contributor Author

pohly commented Dec 14, 2024

/assign @klueska

@pohly
Copy link
Contributor Author

pohly commented Dec 14, 2024

/priority important-soon

We keep running these extra jobs unnecessarily (kubernetes/kubernetes#129213, kubernetes/kubernetes#129195). kubernetes/kubernetes#129103, kubernetes/kubernetes#129103, etc.).

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Dec 14, 2024
@dims
Copy link
Member

dims commented Dec 14, 2024

I am gonna land this, we can iterate if needed. want to stop the bleeding!

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 14, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dims, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 14, 2024
@k8s-ci-robot k8s-ci-robot merged commit e1d92e6 into kubernetes:master Dec 14, 2024
6 of 7 checks passed
@k8s-ci-robot
Copy link
Contributor

@pohly: Updated the job-config configmap in namespace default at cluster test-infra-trusted using the following files:

  • key sig-node-presubmit.yaml using file config/jobs/kubernetes/sig-node/sig-node-presubmit.yaml

In response to this:

This reduces the number of jobs which get triggered automatically to save costs and avoid noise in unrelated PRs. See kubernetes/community#8196 for a suggested policy and rationale.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/config Issues or PRs related to code in /config area/jobs cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.

6 participants