Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(#11571): add health script for batch/CronJob #15272

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

polarAli
Copy link

@polarAli polarAli commented Aug 29, 2023

Fixes [ISSUE #11571]

Cronjob fails but doesn't cause argo app to show as unhealthy, so in an overview of apps we can't understand what apps need attention about their cronjobs.
I use cronjob status fields with the following algorithm:

  • Progressing: When the active field is set
  • Healthy: When the lastScheduleTime is null
  • Healthy: When the lastScheduleTime and lastSuccessfulTime are set and the lastSuccessfulTime is newer
  • Degraded: When lastScheduleTime is set but lastSuccessfulTime is null or it is older

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this does not need to be in the release notes.
  • The title of the PR states what changed and the related issues number (used for the release note).
  • The title of the PR conforms to the Toolchain Guide
  • I've included "Closes [ISSUE #]" or "Fixes [ISSUE #]" in the description to automatically close the associated issue.
  • I've updated both the CLI and UI to expose my feature, or I plan to submit a second PR with them.
  • Does this PR require documentation updates?
  • I've updated documentation as required by this PR.
  • Optional. My organization is added to USERS.md.
  • I have signed off all my commits as required by DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My build is green (troubleshooting builds).
  • My new feature complies with the feature status guidelines.
  • I have added a brief description of why this PR is necessary and/or what this PR solves.

@codecov
Copy link

codecov bot commented Aug 29, 2023

Codecov Report

Patch and project coverage have no change.

Comparison is base (b60861b) 49.87% compared to head (99d253a) 49.87%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #15272   +/-   ##
=======================================
  Coverage   49.87%   49.87%           
=======================================
  Files         263      263           
  Lines       45193    45193           
=======================================
  Hits        22538    22538           
  Misses      20437    20437           
  Partials     2218     2218           

see 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@polarAli polarAli force-pushed the patch-1 branch 3 times, most recently from 12316bf to 5ddb613 Compare August 29, 2023 09:11
@sboschman
Copy link
Contributor

Wondering if 'job is running' (hs.status = "Progressing") should be surfaced to the overall Application health status or a running job should just be 'healthy'. Jobs (or f.e. an Argo Workflow) can be long running processes. Triggering an unhealthy alert for a long running job is not ideal.

@juzov-billie
Copy link

getting attempt to index a non-table object(nil) with key 'match'

@polarAli
Copy link
Author

Wondering if 'job is running' (hs.status = "Progressing") should be surfaced to the overall Application health status or a running job should just be 'healthy'. Jobs (or f.e. an Argo Workflow) can be long running processes. Triggering an unhealthy alert for a long running job is not ideal.

I understand what you mean, so I can set the app status to healthy when there is an active job. But I prefer to have some configuration for it if possible, for example, overriding this behavior with an annotation on the CronJob

@polarAli
Copy link
Author

polarAli commented Sep 23, 2023

getting attempt to index a non-table object(nil) with key 'match'

I think that's because the Lua standard libraries are disabled by default. I have faced that issue when I used this script as resource customization in argo cm. The problem was solved by enabling those libraries.
I have to use standard libraries to parse iso-formatted datetime objects in the object status.

@sboschman
Copy link
Contributor

Wondering if 'job is running' (hs.status = "Progressing") should be surfaced to the overall Application health status or a running job should just be 'healthy'. Jobs (or f.e. an Argo Workflow) can be long running processes. Triggering an unhealthy alert for a long running job is not ideal.

I understand what you mean, so I can set the app status to healthy when there is an active job. But I prefer to have some configuration for it if possible, for example, overriding this behavior with an annotation on the CronJob

Perhaps an annotation with a duration as value would suffice? This duration defines after which period a CronJob is marked as progressing. No annotation present is the same as duration 0s, which means report the CronJob as progressing immediately if it has an active job. If you happen to have a longer running job, which triggers an unhealthy alert, you can hopefully get some sort of baseline for the job duration and add the annotation accordingly.

@h-mavrodiev
Copy link

I get an issue, where the Argo health check is faster than kubernetes status update. So even after successful job execution, for a second the application goes degraded and then back to healthy.
I also removed the status extraction and do the comparision directly by
elseif obj.status.lastSuccessfulTime ~= nil and obj.status.lastScheduleTime <= obj.status.lastSuccessfulTime then .

@adrianmiron
Copy link

I would also be interested in this feature going live without having to "hack" the behaviour for Cronjobs in the values file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants