-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slurm exporter crashes on Slurm 20.11.8 #67
Comments
Are you using -gpus-acct? I found that gpus.go needs to be updated.
Here's the error if you run the sacct command as is:
|
If you are using |
Reviewing the error listed in the first post this has nothing to do with |
This fixed it in my environment. I can send PR if it fixes the issue globally |
Hi All, Thanks for the quick reply. Please, consider the space rendered as a 'tab' cloudsmp-r548-40 0 756878 0/48/0/48 idle~ node02 386016 386385 48/0/0/48 allocated Comparing with sinfo_mem.txt it seems to be equal (5 columns) Second test was run the @DImuthuUpe command: The difference from the previous command it is a single space instead of a tab cloudsmp-r548-37 0 756878 0/48/0/48 idle~ node01 135744 386385 37/11/0/48 mixed I have updated the node.go as suggested by @DImuthuUpe, compiled and run in foreground: Then I amended the sacct command as suggested by @wmoore28 and it worked fine. In resume, there are two changes:: A good improvement could be an automated test to sacct command (in this case for gpus): Thanks everyone for the quick help. Thanks again! |
@JaderGiacon I have integrated the fix from @DImuthuUpe, and also patched |
Thanks, @itzsimpl for integrating the fix |
Hi @itzsimpl, I have compiled the version present in your github (https://github.com/itzsimpl/prometheus-slurm-exporter) and it worked fine. Thank you so much for it! |
@ALL : I have merged the updated PR #73 into the development branch The master branch will be kept backward compatible to Slurm version (up to) 18.x. |
Hello! |
Possibly choosing the name development for the other branch was not the right one. The PR #73 was not merged Additional PRs, which requires new and/or updated functionalities of Slurm will be merged only into the development |
Thanks for your feedback and information. It's unfortunately not enough to only have a proper branch name here. We need a tagged release, so that we can pick a specific version of this project to ensure, that the exporter is always the same version. Simply using the branch could lead to different installations and behaviours, when someone pushes / merges changes into this branch. That's the reason, why we currently for example use this branch, but define a specific commit to avoid different version deployments. Currently, you're simply incrementing your release version. What about these release tags?
Then you would keep the backward compatibility and users could decide, which version / release they need or want. In addition, you may also want to rename the |
I have still the same Problem with slurm 22.05.5.1. The node exporter crashed after few moments. |
When I try to run curl http://localhost:8080/metrics on the latest build of the exporter, I see the following error message. Is there a fix for this?
panic: runtime error: index out of range [4] with length 4
goroutine 12 [running]:
main.ParseNodeMetrics(0xc0003c6000, 0x1f9, 0x600, 0x0)
/opt/prometheus-slurm-exporter/node.go:56 +0x6d6
main.NodeGetMetrics(0x0)
/opt/prometheus-slurm-exporter/node.go:40 +0x2a
main.(*NodeCollector).Collect(0xc0000ab710, 0xc0001a2660)
/opt/prometheus-slurm-exporter/node.go:128 +0x37
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
/root/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:443 +0x12b
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
/root/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/registry.go:454 +0x5ce
The text was updated successfully, but these errors were encountered: