OTA-541: enhancements/update/do-not-block-on-degraded: New enhancement proposal #1719

wking · 2024-11-25T19:56:16Z

The cluster-version operator (CVO) uses an update-mode when transitioning between releases, where the manifest operands are sorted into a task-node graph, and the CVO walks the graph reconciling. Since 4.1, the cluster-version operator has blocked during update and reconcile modes (but not during install mode) on Degraded=True ClusterOperator. This enhancement proposes ignoring Degraded when deciding whether to block on a ClusterOperator manifest.

The goal of blocking on manifests with sad resources is to avoid further destabilization. For example, if we have not reconciled a namespace manifest or ServiceAccount RoleBinding, there's no point in trying to update the consuming operator Deployment. Or if we are unable to update the Kube-API-server operator, we don't want to inject unsupported kubelet skew by asking the machine-config operator to update nodes.

However, blocking the update on a sad resource has the downside that later manifest-graph task-nodes are not reconciled, while the CVO waits for the sad resource to return to happiness. We maximize safety by blocking when progress would be risky, while continuing when progress would be safe, and possibly helpful.

Our expirience with Degraded=True blocks turns up cases where blocking is not helpful, so this enhancement proposes no longer blocking on that condition. We will conditinue to block on Available=False ClusterOperator, or when the ClusterOperator versions have not yet reached the values requested by the ClusterOperator's release manifest.

openshift-ci-robot · 2024-11-25T19:56:20Z

@wking: This pull request references OTA-541 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

The cluster-version operator (CVO) uses an update-mode when transitioning between releases, where the manifest operands are sorted into a task-node graph, and the CVO walks the graph reconciling. Since 4.1, the cluster-version operator has blocked during update and reconcile modes (but not during install mode) on Degraded=True ClusterOperator. This enhancement proposes ignoring Degraded when deciding whether to block on a ClusterOperator manifest.

The goal of blocking on manifests with sad resources is to avoid further destabilization. For example, if we have not reconciled a namespace manifest or ServiceAccount RoleBinding, there's no point in trying to update the consuming operator Deployment. Or if we are unable to update the Kube-API-server operator, we don't want to inject unsupported kubelet skew by asking the machine-config operator to update nodes.

However, blocking the update on a sad resource has the downside that later manifest-graph task-nodes are not reconciled, while the CVO waits for the sad resource to return to happiness. We maximize safety by blocking when progress would be risky, while continuing when progress would be safe, and possibly helpful.

Our expirience with Degraded=True blocks turns up cases where blocking is not helpful, so this enhancement proposes no longer blocking on that condition. We will conditinue to block on Available=False ClusterOperator, or when the ClusterOperator versions have not yet reached the values requested by the ClusterOperator's release manifest.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

wking · 2024-11-25T21:31:16Z

enhancements/update/do-not-block-on-degraded-true-clusteroperators.md

+## Proposal
+
+The cluster-version operator currently has [a mode switch][cvo-degraded-mode-switch] that makes `Degraded` ClusterOperator a non-blocking condition that is still proagated through to `Failing`.
+This enhancement proposes making that an unconditional `UpdateEffectReport`, regardless of the CVO's current mode (installing, updating, reconciling, etc.).


openshift/cluster-version-operator#482 is in flight with this change, if folks want to test pre-merge.

petr-muller · 2024-11-26T12:30:17Z

/cc

dev-guide/cluster-version-operator/user/reconciliation.md

enhancements/update/do-not-block-on-degraded-true-clusteroperators.md

jiajliu · 2024-12-03T01:52:19Z

enhancements/update/do-not-block-on-degraded-true-clusteroperators.md

+
+### Goals
+
+ClusterVersion updates will no longer block on ClusterOperators solely based on `Degraded=True`.


does it mean, if no operator is unavailable, then the upgrade should always complete?

ClusterOperators aren't the only CVO-manifested resources, and if something else breaks like we fail to reconcile a RoleBinding or whatever, that will block further update progress. And for ClusterOperators, we'll still block on status.versions not being as far along as the manifest claimed, in addition to blocking if Available isn't True. Personally, status.versions seems like the main thing that's relevant, e.g. a component coming after the Kube API server knows it can use 4.18 APIs if the Kube API server has declared 4.18 versions. As an example of what the 4.18 Kube API server asks the CVO to wait on:

$ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.18.0-rc.0-x86_64 Extracted release payload from digest sha256:054e75395dd0879e8c29cd059cf6b782742123177a303910bf78f28880431d1c created at 2024-12-02T21:11:00Z $ yaml2json <manifests/0000_20_kube-apiserver-operator_07_clusteroperator.yaml | jq -c '.status.versions[]' {"name":"operator","version":"4.18.0-rc.0"} {"name":"raw-internal","version":"4.18.0-rc.0"} {"name":"kube-apiserver","version":"1.31.3"}

A recent example of this being useful is openshift/machine-config-operator#4637, which got the CVO to block until the MCO had rolled out a single-arch -> multi-arch transition, without the MCO needing to touch its Degraded or Available conditions to slow the CVO down.

so could I say, if failing=true for an upgrade, the reason should not be ClusterOperatorDegraded.

enhancements/update/do-not-block-on-degraded-true-clusteroperators.md

…ew enhancement proposal The cluster-version operator (CVO) uses an update-mode when transitioning between releases, where the manifest operands are sorted into a task-node graph, and the CVO walks the graph reconciling. Since 4.1, the cluster-version operator has blocked during update and reconcile modes (but not during install mode) on Degraded=True ClusterOperator. This enhancement proposes ignoring Degraded when deciding whether to block on a ClusterOperator manifest. The goal of blocking on manifests with sad resources is to avoid further destabilization. For example, if we have not reconciled a namespace manifest or ServiceAccount RoleBinding, there's no point in trying to update the consuming operator Deployment. Or if we are unable to update the Kube-API-server operator, we don't want to inject unsupported kubelet skew by asking the machine-config operator to update nodes. However, blocking the update on a sad resource has the downside that later manifest-graph task-nodes are not reconciled, while the CVO waits for the sad resource to return to happiness. We maximize safety by blocking when progress would be risky, while continuing when progress would be safe, and possibly helpful. Our expirience with Degraded=True blocks turns up cases where blocking is not helpful, so this enhancement proposes no longer blocking on that condition. We will conditinue to block on Available=False ClusterOperator, or when the ClusterOperator versions have not yet reached the values requested by the ClusterOperator's release manifest.

openshift-ci · 2024-12-03T20:51:52Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: petr-muller
Once this PR has been reviewed and has the lgtm label, please assign pratikmahajan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

~~dev-guide/cluster-version-operator/OWNERS~~ [petr-muller]
enhancements/update/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2024-12-03T21:16:13Z

@wking: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

DavidHurta · 2024-12-06T14:00:55Z

enhancements/update/do-not-block-on-degraded-true-clusteroperators.md

+
+## Proposal
+
+The cluster-version operator currently has [a mode switch][cvo-degraded-mode-switch] that makes `Degraded` ClusterOperator a non-blocking condition that is still proagated through to `Failing`.


nit:

Suggested change

The cluster-version operator currently has [a mode switch][cvo-degraded-mode-switch] that makes `Degraded` ClusterOperator a non-blocking condition that is still proagated through to `Failing`.

The cluster-version operator currently has [a mode switch][cvo-degraded-mode-switch] that makes `Degraded` ClusterOperator a non-blocking condition that is still propagated through to `Failing`.

DavidHurta · 2024-12-06T14:00:59Z

enhancements/update/do-not-block-on-degraded-true-clusteroperators.md

+However, blocking the update on a sad resource has the downside that later manifest-graph task-nodes are not reconciled, while the CVO waits for the sad resource to return to happiness.
+We maximize safety by blocking when progress would be risky, while continuing when progress would be safe, and possibly helpful.
+
+Our expirience with `Degraded=True` blocks turns up cases like:


nit:

Suggested change

Our expirience with `Degraded=True` blocks turns up cases like:

Our experience with `Degraded=True` blocks turns up cases like:

DavidHurta · 2024-12-06T14:01:44Z

enhancements/update/do-not-block-on-degraded-true-clusteroperators.md

+
+## Upgrade / Downgrade Strategy
+
+This enhancement only affects the cluster-version operator's internal processing of longstanding ClusterOperator APIs, so there are no skew or compatability issues.


nit:

Suggested change

This enhancement only affects the cluster-version operator's internal processing of longstanding ClusterOperator APIs, so there are no skew or compatability issues.

This enhancement only affects the cluster-version operator's internal processing of longstanding ClusterOperator APIs, so there are no skew or compatibility issues.

DavidHurta · 2024-12-06T14:02:01Z

enhancements/update/do-not-block-on-degraded-true-clusteroperators.md

+
+## Version Skew Strategy
+
+This enhancement only affects the cluster-version operator's internal processing of longstanding ClusterOperator APIs, so there are no skew or compatability issues.


nit:

Suggested change

This enhancement only affects the cluster-version operator's internal processing of longstanding ClusterOperator APIs, so there are no skew or compatability issues.

This enhancement only affects the cluster-version operator's internal processing of longstanding ClusterOperator APIs, so there are no skew or compatibility issues.

DavidHurta · 2024-12-06T14:12:28Z

enhancements/update/do-not-block-on-degraded-true-clusteroperators.md

+## Support Procedures
+
+This enhancement is a small pivot in how the cluster-version operator processes ClusterOperator manifests during updates.
+As discussed in [the *Drawbacks* section](#drawbacks), we do not expect cluster admins open support cases related to this change.


nit:

Suggested change

As discussed in [the *Drawbacks* section](#drawbacks), we do not expect cluster admins open support cases related to this change.

As discussed in [the *Drawbacks* section](#drawbacks), we do not expect cluster admins to open support cases related to this change.

DavidHurta · 2024-12-06T14:20:51Z

enhancements/update/do-not-block-on-degraded-true-clusteroperators.md

+
+As discussed in [the *Risks* section](#risks-and-mitigations), the main drawback is changing behavior that we've had in place for many years.
+But we do not expect much customer pushback based on "hey, my update completed?!  I expected it to stick on this sad component...".
+We do expect it to reduce customer frustration when they want the update to complete, but for reasons like administrative siloes do no have the ability to recover a component from minor degradation themselves.


nit:

Suggested change

We do expect it to reduce customer frustration when they want the update to complete, but for reasons like administrative siloes do no have the ability to recover a component from minor degradation themselves.

We do expect it to reduce customer frustration when they want the update to complete, but for reasons like administrative siloes do not have the ability to recover a component from minor degradation themselves.

DavidHurta · 2024-12-06T14:46:50Z

enhancements/update/do-not-block-on-degraded-true-clusteroperators.md

+
+## Test Plan
+
+**Note:** *Section not required until targeted at a release.*


The enhancement and the tracking card OTA-541 are not targeted at a release. However, changes in the dev-guide/cluster-version-operator/user/reconciliation.md file suggest that the enhancement is targeted at the 4.19 release, and thus the Test Plan section should be addressed.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 25, 2024

openshift-ci bot requested review from DavidHurta and PratikMahajan November 25, 2024 19:56

wking force-pushed the do-not-block-updates-on-ClusterOperator-degraded branch from b0c8d2e to 69eca53 Compare November 25, 2024 20:55

wking commented Nov 25, 2024

View reviewed changes

openshift-ci bot requested a review from petr-muller November 26, 2024 12:30

petr-muller approved these changes Nov 26, 2024

View reviewed changes

dev-guide/cluster-version-operator/user/reconciliation.md Show resolved Hide resolved

enhancements/update/do-not-block-on-degraded-true-clusteroperators.md Outdated Show resolved Hide resolved

wking force-pushed the do-not-block-updates-on-ClusterOperator-degraded branch from 69eca53 to 11f8243 Compare November 26, 2024 18:29

jiajliu reviewed Dec 3, 2024

View reviewed changes

enhancements/update/do-not-block-on-degraded-true-clusteroperators.md Outdated Show resolved Hide resolved

jiajliu reviewed Dec 3, 2024

View reviewed changes

enhancements/update/do-not-block-on-degraded-true-clusteroperators.md Outdated Show resolved Hide resolved

wking force-pushed the do-not-block-updates-on-ClusterOperator-degraded branch from 11f8243 to e10df2a Compare December 3, 2024 20:51

DavidHurta reviewed Dec 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OTA-541: enhancements/update/do-not-block-on-degraded: New enhancement proposal #1719

OTA-541: enhancements/update/do-not-block-on-degraded: New enhancement proposal #1719

wking commented Nov 25, 2024

openshift-ci-robot commented Nov 25, 2024 •

edited by openshift-ci bot

Loading

wking Nov 25, 2024

petr-muller commented Nov 26, 2024

jiajliu Dec 3, 2024

wking Dec 3, 2024

jiajliu Dec 4, 2024

openshift-ci bot commented Dec 3, 2024

openshift-ci bot commented Dec 3, 2024

DavidHurta Dec 6, 2024

DavidHurta Dec 6, 2024

DavidHurta Dec 6, 2024

DavidHurta Dec 6, 2024

DavidHurta Dec 6, 2024

DavidHurta Dec 6, 2024

DavidHurta Dec 6, 2024 •

edited

Loading


		### Goals

		ClusterVersion updates will no longer block on ClusterOperators solely based on `Degraded=True`.


		## Proposal

		The cluster-version operator currently has [a mode switch][cvo-degraded-mode-switch] that makes `Degraded` ClusterOperator a non-blocking condition that is still proagated through to `Failing`.

	Our expirience with `Degraded=True` blocks turns up cases like:
	Our experience with `Degraded=True` blocks turns up cases like:


		## Upgrade / Downgrade Strategy

		This enhancement only affects the cluster-version operator's internal processing of longstanding ClusterOperator APIs, so there are no skew or compatability issues.


		## Version Skew Strategy

		This enhancement only affects the cluster-version operator's internal processing of longstanding ClusterOperator APIs, so there are no skew or compatability issues.

	As discussed in [the Drawbacks section](#drawbacks), we do not expect cluster admins open support cases related to this change.
	As discussed in [the Drawbacks section](#drawbacks), we do not expect cluster admins to open support cases related to this change.

	We do expect it to reduce customer frustration when they want the update to complete, but for reasons like administrative siloes do no have the ability to recover a component from minor degradation themselves.
	We do expect it to reduce customer frustration when they want the update to complete, but for reasons like administrative siloes do not have the ability to recover a component from minor degradation themselves.


		## Test Plan

		Note: Section not required until targeted at a release.

OTA-541: enhancements/update/do-not-block-on-degraded: New enhancement proposal #1719

Are you sure you want to change the base?

OTA-541: enhancements/update/do-not-block-on-degraded: New enhancement proposal #1719

Conversation

wking commented Nov 25, 2024

openshift-ci-robot commented Nov 25, 2024 • edited by openshift-ci bot Loading

Choose a reason for hiding this comment

petr-muller commented Nov 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci bot commented Dec 3, 2024

openshift-ci bot commented Dec 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DavidHurta Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

openshift-ci-robot commented Nov 25, 2024 •

edited by openshift-ci bot

Loading

DavidHurta Dec 6, 2024 •

edited

Loading