Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queue and nodes status as labels #5

Open
matejzero opened this issue Feb 16, 2018 · 1 comment
Open

Queue and nodes status as labels #5

matejzero opened this issue Feb 16, 2018 · 1 comment
Assignees

Comments

@matejzero
Copy link

matejzero commented Feb 16, 2018

I'm thinking it would be better (and more in-pair with prometheus metrics/labels conventions) to have job status and nodes states as labels instead of separate metrics. Since we are measuring the same thing.

From Prometheus page:

Use labels to differentiate the characteristics of the thing that is being measured:

api_http_requests_total - differentiate request types: type="create|update|delete"
api_request_duration_seconds - differentiate request stages: stage="extract|transform|load"

It would make it easier to show totals as well (now we can't really easily show totals, because we don't know all metrics name in the beginning - in my case, failed/error metrics are not present, because none of my nodes are in that state yet).

Thinking about it, it would be good to set default values for metrics to 0 in case the metric doesn't have a value / doesn't exists. From prometheus page:

Avoid missing metrics

Time series that are not present until something happens are difficult to deal with, as the usual simple operations are no longer sufficient to correctly handle them. To avoid this, export 0 (or NaN, if 0 would be misleading) for any time series you know may exist in advance.

Most Prometheus client libraries (including Go, Java, and Python) will automatically export a 0 for you for metrics with no labels.

It would also good to have an 'up' metrics, something like slurm_up, with value of 0 if scrape of any of slurm commands would be unsuccessful (prometheus documentation). In that case, one can set an alert if slurm_up == 0; alert('Slurm is not responding').

Nothing critical, I just though I would let you know.

Thanks for the great exporter!

@mtds mtds self-assigned this Feb 16, 2018
@bmcgough
Copy link

Agreed - thank you for this useful exporter!

This suggestion and my addition below would mean a new version - a break from the old metrics and dashboard. I am willing to do work on this and could fork from here, or enhance for a new release... .

I would do user and account as labels as well.

job /cpu ex:

slurm_jobs{state="pending",partition="default",account="alice",user="bob") 23
slurm_cpus{state="idle",partition="default"} 53

Node state ex:

slurm_nodes{state="drain",partition="default"} 7

You can easily aggregate/slice/dice in promql. In fact, I think you might be able to do just those three metrics as everything else could be aggregated and calculated from them.

We run a test cluster too, and I differentiate between the two using Prometheus job name. I've added a custom variable to the dashboard to select/set job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants