Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC: block operator upgrade when detecting outdated vms #3171

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dasionov
Copy link
Contributor

@dasionov dasionov commented Nov 20, 2024

This PR introduces the logic necessary for blocking hco-operator upgrade when we detect that there are outdated vms in the cluster

step 1: list all virt-controller pods
step 2: loop over the pods and perform http request to the /metrics endpoint to extract all the metrics data
step 3: use regex to make sure no one is using rhel8 machine types
step 4: set upgradeable false when we do detect rhel8 machines

the logic right now is simple and straight forward, and it can be further improved

Release note:

None

@dasionov dasionov marked this pull request as draft November 20, 2024 15:49
@kubevirt-bot kubevirt-bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M labels Nov 20, 2024
@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tiraboschi for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dasionov
Copy link
Contributor Author

@kubevirt-bot kubevirt-bot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Nov 20, 2024
@dasionov dasionov force-pushed the block_upgrade_for_rhel8_vms branch 2 times, most recently from 74747cc to d4ec71b Compare November 20, 2024 15:57
@coveralls
Copy link
Collaborator

coveralls commented Nov 20, 2024

Pull Request Test Coverage Report for Build 11952461865

Details

  • 14 of 83 (16.87%) changed or added relevant lines in 1 file are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-0.5%) to 71.412%

Changes Missing Coverage Covered Lines Changed/Added Lines %
controllers/hyperconverged/hyperconverged_controller.go 14 83 16.87%
Files with Coverage Reduction New Missed Lines %
controllers/operands/quickStart.go 1 82.14%
Totals Coverage Status
Change from base Build 11951479920: -0.5%
Covered Lines: 6045
Relevant Lines: 8465

💛 - Coveralls

@dasionov dasionov force-pushed the block_upgrade_for_rhel8_vms branch from d4ec71b to ed2b6ed Compare November 20, 2024 16:03
Copy link
Collaborator

@nunnatsa nunnatsa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's a POC, but still have several comments inline.

@@ -1330,6 +1340,66 @@ func (r *ReconcileHyperConverged) deleteObj(req *common.HcoRequest, obj client.O
return removed, nil
}

func (r *ReconcileHyperConverged) EvaluateUpgradeEligibility(req *common.HcoRequest) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is called only locally and should not be exposed.

Please rename:

Suggested change
func (r *ReconcileHyperConverged) EvaluateUpgradeEligibility(req *common.HcoRequest) error {
func (r *ReconcileHyperConverged) evaluateUpgradeEligibility(req *common.HcoRequest) error {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines +1344 to +1361
podList := &corev1.PodList{}
listOpts := []client.ListOption{
client.InNamespace(req.Namespace),
client.MatchingLabels{"kubevirt.io": "virt-controller"},
}

if err := r.client.List(req.Ctx, podList, listOpts...); err != nil {
req.Logger.Info("Failed to list virt-controller pods", "namespace", req.Namespace, "error", err)
return fmt.Errorf("failed to list virt-controller pods: %w", err)
}

if len(podList.Items) == 0 {
req.Logger.Info("No virt-controller pods found", "namespace", req.Namespace)
return nil
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: does each pod collects only several VM info? In another words: can we query the service instead of the pod itself?

Copy link
Contributor Author

@dasionov dasionov Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have a service for virt-controller as far as i know, only for virt-api
but i assume i can query just one of the pods since they all should have the same metrics data

Comment on lines 1366 to 1431
for _, pod := range podList.Items {
if pod.Status.PodIP == "" {
continue
}
metricsURL := fmt.Sprintf("https://%s:%d/metrics", pod.Status.PodIP, 8443)

resp, err := httpClient.Get(metricsURL)
if err != nil {
req.Logger.Info("Failed to query metrics from pod", "pod", pod.Name, "error", err)
continue
}

if resp.StatusCode != http.StatusOK {
req.Logger.Info("Metrics endpoint returned non-200 status", "pod", pod.Name, "status", resp.StatusCode)
resp.Body.Close()
continue
}

body, err := io.ReadAll(resp.Body)
resp.Body.Close()
if err != nil {
req.Logger.Info("Failed to read metrics response from pod", "pod", pod.Name, "error", err)
continue
}

virtMetricsData := string(body)
rhel8Regex := regexp.MustCompile(`.*rhel8.*`)
if rhel8Regex.MatchString(virtMetricsData) {
req.Logger.Info("Detected outdated machine type in metrics", "pod", pod.Name, "matched", rhel8Regex.FindString(virtMetricsData))
req.Upgradeable = false
return nil
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually, we'll need to do it a-synchronized

}

virtMetricsData := string(body)
rhel8Regex := regexp.MustCompile(`.*rhel8.*`)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never create regex within a loop. creating a regex variable is an expensive operation. The value of this regex will never changed during runtime. Please take it out of this method as a regular variable, that will be initialized only once.

just do:

var rhel8Regex = regexp.MustCompile(`.*rhel8.*`)

just above the method.

Copy link
Contributor Author

@dasionov dasionov Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the regex will be created each reconcile loop then, that's ok?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it will be defined in the package scope, out of the method, then no. It will only be initialized once, on boot.

Copy link
Collaborator

@nunnatsa nunnatsa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's a POC, but still have several comments inline.

}

if resp.StatusCode != http.StatusOK {
req.Logger.Info("Metrics endpoint returned non-200 status", "pod", pod.Name, "status", resp.StatusCode)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the term "non-200" in an error message.

WDYT about "Metrics endpoint returned error" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good


if resp.StatusCode != http.StatusOK {
req.Logger.Info("Metrics endpoint returned non-200 status", "pod", pod.Name, "status", resp.StatusCode)
resp.Body.Close()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use defer instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i did use that at first, but got a warning said that it's not safe to use defer in a loop, and it might cause a resources leek

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. so we can move the loop content into a new method. That will work.

}

body, err := io.ReadAll(resp.Body)
resp.Body.Close()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use defer instead

continue
}

body, err := io.ReadAll(resp.Body)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to use a scanner. The body can be big.

You can do something like

scanner := bufio.NewScanner(resp.Body)
...
for scanner.Scan() {
	if rhel8Regex.MatchString(scanner.Text()) {
		req.Logger.Info("Detected outdated machine type in metrics", "pod", pod.Name, "matched", rhel8Regex.FindString(virtMetricsData))
		req.Upgradeable = false
		return nil
	}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@dasionov dasionov force-pushed the block_upgrade_for_rhel8_vms branch from ed2b6ed to a15db2d Compare November 21, 2024 11:06
@nunnatsa
Copy link
Collaborator

Another comment I forgot to write: The current HTTP client code does not use any type of timeout. Use a context (derived from req.Context) with 3-seconds timeout. Sending HTTP request w/o timeout may be risky.

@dasionov dasionov force-pushed the block_upgrade_for_rhel8_vms branch from a15db2d to eefac66 Compare November 21, 2024 11:33
@dasionov dasionov force-pushed the block_upgrade_for_rhel8_vms branch from eefac66 to dd7f177 Compare November 21, 2024 11:38
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dco-signoff: yes Indicates the PR's author has DCO signed all their commits. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants