Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm exporter crashes on Slurm 20.11.8 #67

Open
DImuthuUpe opened this issue Jan 23, 2022 · 15 comments
Open

Slurm exporter crashes on Slurm 20.11.8 #67

DImuthuUpe opened this issue Jan 23, 2022 · 15 comments

Comments

@DImuthuUpe
Copy link

When I try to run curl http://localhost:8080/metrics on the latest build of the exporter, I see the following error message. Is there a fix for this?

panic: runtime error: index out of range [4] with length 4

goroutine 12 [running]:
main.ParseNodeMetrics(0xc0003c6000, 0x1f9, 0x600, 0x0)
/opt/prometheus-slurm-exporter/node.go:56 +0x6d6
main.NodeGetMetrics(0x0)
/opt/prometheus-slurm-exporter/node.go:40 +0x2a
main.(*NodeCollector).Collect(0xc0000ab710, 0xc0001a2660)
/opt/prometheus-slurm-exporter/node.go:128 +0x37
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
/root/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:443 +0x12b
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
/root/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:454 +0x5ce

@wmoore28
Copy link

Are you using -gpus-acct? I found that gpus.go needs to be updated.

- --format=Allocgres
+ --format=AllocTRES

Here's the error if you run the sacct command as is:

> sacct -a -X --format=Allocgres --state=RUNNING --noheader --parsable2
sacct: fatal: AllocGRES is deprecated, please use AllocTRES

@itzsimpl
Copy link

itzsimpl commented Mar 4, 2022

If you are using --gpus-acct try #73.

@JaderGiacon
Copy link

JaderGiacon commented Mar 22, 2022

Hi all,

I am having the same issue. I tried both solutions (@wmoore28 and @itzsimpl).
My slurm version is 21.08.5

Does anyone have any workaround for the panic array length?
The process/service is killed with this panic.

PS.: The got test *.go runs fine for all test

Thanks

@itzsimpl
Copy link

itzsimpl commented Mar 22, 2022

Reviewing the error listed in the first post this has nothing to do with gpus.go, but with node.go. Try running the command sinfo -h -N -O NodeList,AllocMem,Memory,CPUsState,StateLong on your system and examine the output. This is what node.go executes and then parses. Compare your output with what is in sinfo_mem.txt.

@DImuthuUpe
Copy link
Author

This fixed it in my environment. I can send PR if it fixes the issue globally
DImuthuUpe@3adffd8

@JaderGiacon
Copy link

JaderGiacon commented Mar 23, 2022

Hi All,

Thanks for the quick reply.
The output of the command: sinfo -h -N -O NodeList,AllocMem,Memory,CPUsState,StateLong
Just some entries, because we have hundreds instances in cloud as well:

Please, consider the space rendered as a 'tab'

cloudsmp-r548-40 0 756878 0/48/0/48 idle~
cloudsmp-r548-41 0 756878 0/48/0/48 idle~
cloudsmp-r548-42 0 756878 0/48/0/48 idle~

node02 386016 386385 48/0/0/48 allocated
node03 386000 386385 48/0/0/48 allocated
node03 386000 386385 48/0/0/48 allocated
node03 386000 386385 48/0/0/48 allocated

Comparing with sinfo_mem.txt it seems to be equal (5 columns)

Second test was run the @DImuthuUpe command:
sinfo -h -N -O "NodeList: ,AllocMem: ,Memory: ,CPUsState: ,StateLong:"

The difference from the previous command it is a single space instead of a tab

cloudsmp-r548-37 0 756878 0/48/0/48 idle~
cloudsmp-r548-38 0 756878 0/48/0/48 idle~
cloudsmp-r548-39 0 756878 0/48/0/48 idle~
cloudsmp-r548-40 0 756878 0/48/0/48 idle~

node01 135744 386385 37/11/0/48 mixed
node01 135744 386385 37/11/0/48 mixed
node01 135744 386385 37/11/0/48 mixed

I have updated the node.go as suggested by @DImuthuUpe, compiled and run in foreground:
bin/prometheus-slurm-exporter -gpus-acct
INFO[0000] Starting Server: :8080 source="main.go:59"
INFO[0000] GPUs Accounting: true source="main.go:60"
FATA[0026] exit status 1 source="gpus.go:101"

Then I amended the sacct command as suggested by @wmoore28 and it worked fine.

In resume, there are two changes::
1 - nodes.go as suggested by @DImuthuUpe
2 - gpus.go for who use -gpus-acct parameter as suggested by @wmoore28

A good improvement could be an automated test to sacct command (in this case for gpus):
sacct -a -X --format=Allocgres --state=RUNNING --noheader --parsable2
sacct: fatal: AllocGRES is deprecated, please use AllocTRES

Thanks everyone for the quick help.
I will keep an eye on new versions available here with fixes applied.
I am pretty sure @vpenso will find a way to keep back compatibility in slurm versions.

Thanks again!

@itzsimpl
Copy link

@JaderGiacon I have integrated the fix from @DImuthuUpe, and also patched gpus.go as there was an issue when gres does not use gpuType. Could you perhaps test if it works for you?

@DImuthuUpe
Copy link
Author

Thanks, @itzsimpl for integrating the fix

@JaderGiacon
Copy link

Hi @itzsimpl,

I have compiled the version present in your github (https://github.com/itzsimpl/prometheus-slurm-exporter) and it worked fine.
The process is not being killed and the gpus information is coming well. Also, the go test worked fine.

Thank you so much for it!

@mtds
Copy link
Collaborator

mtds commented Mar 29, 2022

@ALL : I have merged the updated PR #73 into the development branch

The master branch will be kept backward compatible to Slurm version (up to) 18.x.

@Sebbo94BY
Copy link

Hello!
Are there any news and plans when this fix from the development branch will be released on the master/main branch?

@gbeyer3
Copy link

gbeyer3 commented Jul 12, 2022

@Sebi94nbg -- @vpenso and @mtds seem to have gone off line. Another bug fix about 20+ character node names with PR that was offered in MAY has also not been merged into main. 🤷

@mtds
Copy link
Collaborator

mtds commented Jul 14, 2022

Hello! Are there any news and plans when this fix from the development branch will be released on the master/main branch?

Possibly choosing the name development for the other branch was not the right one. The PR #73 was not merged
into master because we would like to keep backward compatible for the version of Slurm we are still using (18.x)
and this PR is not meant for old versions.

Additional PRs, which requires new and/or updated functionalities of Slurm will be merged only into the development
branch. If you are using newer version of Slurm (from 20.x onwards) our suggestion is to skip the master branch
and only use development.

@Sebbo94BY
Copy link

Thanks for your feedback and information.

It's unfortunately not enough to only have a proper branch name here.

We need a tagged release, so that we can pick a specific version of this project to ensure, that the exporter is always the same version. Simply using the branch could lead to different installations and behaviours, when someone pushes / merges changes into this branch.

That's the reason, why we currently for example use this branch, but define a specific commit to avoid different version deployments.

Currently, you're simply incrementing your release version. What about these release tags?

  • Releases 0.x => Slurm 18.x
  • Releases 1.x => Slurm 20.x

Then you would keep the backward compatibility and users could decide, which version / release they need or want.

In addition, you may also want to rename the development branch to something like slurm-20.x.

@imoes
Copy link

imoes commented May 10, 2023

I have still the same Problem with slurm 22.05.5.1. The node exporter crashed after few moments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants