Queue and nodes status as labels #5

matejzero · 2018-02-16T08:47:48Z

I'm thinking it would be better (and more in-pair with prometheus metrics/labels conventions) to have job status and nodes states as labels instead of separate metrics. Since we are measuring the same thing.

From Prometheus page:

Use labels to differentiate the characteristics of the thing that is being measured:

api_http_requests_total - differentiate request types: type="create|update|delete"
api_request_duration_seconds - differentiate request stages: stage="extract|transform|load"

It would make it easier to show totals as well (now we can't really easily show totals, because we don't know all metrics name in the beginning - in my case, failed/error metrics are not present, because none of my nodes are in that state yet).

Thinking about it, it would be good to set default values for metrics to 0 in case the metric doesn't have a value / doesn't exists. From prometheus page:

Avoid missing metrics

Time series that are not present until something happens are difficult to deal with, as the usual simple operations are no longer sufficient to correctly handle them. To avoid this, export 0 (or NaN, if 0 would be misleading) for any time series you know may exist in advance.

Most Prometheus client libraries (including Go, Java, and Python) will automatically export a 0 for you for metrics with no labels.

It would also good to have an 'up' metrics, something like slurm_up, with value of 0 if scrape of any of slurm commands would be unsuccessful (prometheus documentation). In that case, one can set an alert if slurm_up == 0; alert('Slurm is not responding').

Nothing critical, I just though I would let you know.

Thanks for the great exporter!

The text was updated successfully, but these errors were encountered:

bmcgough · 2020-11-18T17:59:36Z

Agreed - thank you for this useful exporter!

This suggestion and my addition below would mean a new version - a break from the old metrics and dashboard. I am willing to do work on this and could fork from here, or enhance for a new release... .

I would do user and account as labels as well.

job /cpu ex:

slurm_jobs{state="pending",partition="default",account="alice",user="bob") 23
slurm_cpus{state="idle",partition="default"} 53

Node state ex:

slurm_nodes{state="drain",partition="default"} 7

You can easily aggregate/slice/dice in promql. In fact, I think you might be able to do just those three metrics as everything else could be aggregated and calculated from them.

We run a test cluster too, and I differentiate between the two using Prometheus job name. I've added a custom variable to the dashboard to select/set job.

mtds self-assigned this Feb 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Queue and nodes status as labels #5

Queue and nodes status as labels #5

matejzero commented Feb 16, 2018 •

edited

Loading

bmcgough commented Nov 18, 2020

Queue and nodes status as labels #5

Queue and nodes status as labels #5

Comments

matejzero commented Feb 16, 2018 • edited Loading

bmcgough commented Nov 18, 2020

matejzero commented Feb 16, 2018 •

edited

Loading