Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check status of all the core pods for microshift #4009

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

praveenkumar
Copy link
Member

In past we observed having kube api access doesn't mean all the required service pods are running and cluster is working as expected. This PR adds a list of core namespace for microshift preset and make sure all the pods in that namespace is running before letting user to know to consume the cluster.

Fixes: Issue #3852

@openshift-ci openshift-ci bot requested review from cfergeau and gbraad February 1, 2024 07:58
Copy link

openshift-ci bot commented Feb 1, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from praveenkumar. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pkg/crc/cluster/cluster.go Outdated Show resolved Hide resolved
}

func podRunningForNamespace(ocConfig oc.Config, namespace string) bool {
stdin, stderr, err := ocConfig.WithFailFast().RunOcCommand("get", "pods", "-n", namespace, "--field-selector=status.phase!=Running")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why !=Running? I would have expected =Running?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cfergeau If I use ==Running that means it will show all the running pods, what we want the pods which are not in Running phase in that specific namespace so we can reiterate in retry function.

$ kubectl get pod -n kube-system --field-selector=status.phase!=Running 
NAME                                       READY   STATUS    RESTARTS   AGE
csi-snapshot-controller-85cc4fd76b-xznzw   1/1     Pending   0          45h

pkg/crc/cluster/cluster.go Outdated Show resolved Hide resolved
if !podRunningForNamespace(ocConfig, namespace) {
logging.Debugf("Pods in %s namespace are not running", namespace)
return &errors.RetriableError{Err: fmt.Errorf("pods in %s namespace are not running", namespace)}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fwiw, this is a bit wasteful as we'll try again and again the same namespaces even if we already found running pods. Maybe this can be done with a map? map keys are namespaces, iterate over the keys. When there are running pods in the namespace, remove it from the map?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it has bit of benefit in case some pod goes to reconciliation state (like in one iteration it is running but in second it is in pending state.) It is not full proof solution (with k8s context it is never going to be) but should be good for initial feedback if core pods are running.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the OpenShift case, once the oc get co check succeeds once, we retry 2 more times and we only decide the cluster is good when the oc get co check succeeds 3 times in a row. If you want to handle " in one iteration it is running but in second it is in pending state" it would be nice to have a consistent approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the OpenShift case, once the oc get co check succeeds once, we retry 2 more times and we only decide the cluster is good when the oc get co check succeeds 3 times in a row. If you want to handle " in one iteration it is running but in second it is in pending state" it would be nice to have a consistent approach.

yes in case of openshift we can iterate over all the clusteroperator at once because those are not namespace specific resource. Here we are not able to have a single call which provide use all the pods status in core namespaces otherwise I would've use same logic. So now we iterate over namespace by namespace and check the pods status.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not questioning the way the iterations are done, I was reacting to

it has bit of benefit in case some pod goes to reconciliation state (like in one iteration it is running but in second it is in pending state.)

For an OpenShift cluster, we roughly do iterate over a isClusterReady() function until it returns true. Once it returns true, we still run it 2 times in case the cluster was ready, but in a transient/conciliation state.
If reconciliation is something you want to try to handle better, I would use the same approach as for OpenShift for consistency, cluster is not ready before isClusterReady() succeeded 3 times in a row.

pkg/crc/cluster/cluster.go Show resolved Hide resolved
In past we observed having kube api access doesn't mean all the required
service pods are running and cluster is working as expected. This PR
adds a list of core namespace for microshift preset and make sure all
the pods in that namespace is running before letting user to know to
consume the cluster.
return errors.Retry(ctx, 2*time.Minute, waitForPods, 2*time.Second)
}

func podRunningForNamespace(ocConfig oc.Config, namespace string) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allPodsRunning(ocConfig oc.Config, namespace string) bool or checkAllPodsRunning is more descriptive/accurate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be in namespace context so checkAllPodsRunningInNamespace or allPodsRunningForNamespace ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a namespace argument, we don't have non-namespace function this could be confused with, so I don't think it's really useful to mention Namespace in the function name. It's more something for an api doc comment if you think it's important to inform API users that it will only iterate over a single namespace.

if !podRunningForNamespace(ocConfig, namespace) {
logging.Debugf("Pods in %s namespace are not running", namespace)
return &errors.RetriableError{Err: fmt.Errorf("pods in %s namespace are not running", namespace)}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the OpenShift case, once the oc get co check succeeds once, we retry 2 more times and we only decide the cluster is good when the oc get co check succeeds 3 times in a row. If you want to handle " in one iteration it is running but in second it is in pending state" it would be nice to have a consistent approach.

@praveenkumar praveenkumar added the has-to-be-in-release This PR need to go in coming release. label Feb 12, 2024
@praveenkumar praveenkumar removed the has-to-be-in-release This PR need to go in coming release. label Feb 21, 2024
@gbraad
Copy link
Contributor

gbraad commented Apr 10, 2024

This PR adds a list of core namespace for microshift preset

This is not microshift specific, as it might also be able to solve issues with the readiness of the OCP and OKD preset.

@gbraad
Copy link
Contributor

gbraad commented Apr 10, 2024

has-to-be-in-release

What was the reasoning behind this label, as the fixed issue: #3852 is merely an enhancement.

@praveenkumar
Copy link
Member Author

/hold

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants