You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm thinking it would be better (and more in-pair with prometheus metrics/labels conventions) to have job status and nodes states as labels instead of separate metrics. Since we are measuring the same thing.
From Prometheus page:
Use labels to differentiate the characteristics of the thing that is being measured:
api_http_requests_total - differentiate request types: type="create|update|delete"
api_request_duration_seconds - differentiate request stages: stage="extract|transform|load"
It would make it easier to show totals as well (now we can't really easily show totals, because we don't know all metrics name in the beginning - in my case, failed/error metrics are not present, because none of my nodes are in that state yet).
Thinking about it, it would be good to set default values for metrics to 0 in case the metric doesn't have a value / doesn't exists. From prometheus page:
Avoid missing metrics
Time series that are not present until something happens are difficult to deal with, as the usual simple operations are no longer sufficient to correctly handle them. To avoid this, export 0 (or NaN, if 0 would be misleading) for any time series you know may exist in advance.
Most Prometheus client libraries (including Go, Java, and Python) will automatically export a 0 for you for metrics with no labels.
It would also good to have an 'up' metrics, something like slurm_up, with value of 0 if scrape of any of slurm commands would be unsuccessful (prometheus documentation). In that case, one can set an alert if slurm_up == 0; alert('Slurm is not responding').
Nothing critical, I just though I would let you know.
Thanks for the great exporter!
The text was updated successfully, but these errors were encountered:
This suggestion and my addition below would mean a new version - a break from the old metrics and dashboard. I am willing to do work on this and could fork from here, or enhance for a new release... .
You can easily aggregate/slice/dice in promql. In fact, I think you might be able to do just those three metrics as everything else could be aggregated and calculated from them.
We run a test cluster too, and I differentiate between the two using Prometheus job name. I've added a custom variable to the dashboard to select/set job.
I'm thinking it would be better (and more in-pair with prometheus metrics/labels conventions) to have job status and nodes states as labels instead of separate metrics. Since we are measuring the same thing.
From Prometheus page:
It would make it easier to show totals as well (now we can't really easily show totals, because we don't know all metrics name in the beginning - in my case, failed/error metrics are not present, because none of my nodes are in that state yet).
Thinking about it, it would be good to set default values for metrics to 0 in case the metric doesn't have a value / doesn't exists. From prometheus page:
It would also good to have an 'up' metrics, something like slurm_up, with value of 0 if scrape of any of slurm commands would be unsuccessful (prometheus documentation). In that case, one can set an alert
if slurm_up == 0; alert('Slurm is not responding')
.Nothing critical, I just though I would let you know.
Thanks for the great exporter!
The text was updated successfully, but these errors were encountered: