Skip to content

Commit

Permalink
DEVELOPMENT: adjusted to merge gres_gpu branch
Browse files Browse the repository at this point in the history
  • Loading branch information
mtds committed Feb 3, 2021
2 parents 00a7dee + e9f7aef commit b6639a5
Show file tree
Hide file tree
Showing 4 changed files with 174 additions and 18 deletions.
10 changes: 8 additions & 2 deletions DEVELOPMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ go mod download
Build the exporter:

```bash
go build -o bin/prometheus-slurm-exporter {main,accounts,cpus,nodes,partitions,queue,scheduler,users}.go
go build -o bin/prometheus-slurm-exporter {main,accounts,cpus,gpus,partitions,nodes,queue,scheduler,users}.go
```

Run all tests included in `_test.go` files:
Expand All @@ -44,6 +44,13 @@ Start the exporter (foreground), and query all metrics:
```bash
bin/prometheus-slurm-exporter
...

If you wish to run the exporter on a different port, or the default port (8080) is already in use, run with the following argument:

```bash
bin/prometheus-slurm-exporter --listen-address="0.0.0.0:<port>"
...
# query all metrics (default port)
curl http://localhost:8080/metrics
```
Expand All @@ -56,4 +63,3 @@ References:
* [Metric Types](https://prometheus.io/docs/concepts/metric_types/)
* [Writing Exporters](https://prometheus.io/docs/instrumenting/writing_exporters/)
* [Available Exporters](https://prometheus.io/docs/instrumenting/exporters/)

28 changes: 18 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,19 @@ Prometheus collector and exporter for metrics extracted from the [Slurm](https:/
* **Other**: CPUs which are unavailable for use at the moment.
* **Total**: total number of CPUs.

- [Information extracted from the SLURM **sinfo** command](https://slurm.schedmd.com/sinfo.html)
- Information extracted from the SLURM [**sinfo**](https://slurm.schedmd.com/sinfo.html) command.
- [Slurm CPU Management User and Administrator Guide](https://slurm.schedmd.com/cpu_management.html)

### State of the GPUs

* **Allocated**: GPUs which have been allocated to a job.
* **Other**: GPUs which are unavailable for use at the moment.
* **Total**: total number of GPUs.
* **Utilization**: total GPU utiliazation on the cluster.

- Information extracted from the SLURM [**sinfo**](https://slurm.schedmd.com/sinfo.html) and [**sacct**](https://slurm.schedmd.com/sacct.html) command.
- [Slurm GRES scheduling](https://slurm.schedmd.com/gres.html)

### State of the Nodes

* **Allocated**: nodes which has been allocated to one or more jobs.
Expand All @@ -29,7 +39,7 @@ Prometheus collector and exporter for metrics extracted from the [Slurm](https:/
* **Mixed**: nodes which have some of their CPUs ALLOCATED while others are IDLE.
* **Resv**: these nodes are in an advanced reservation and not generally available.

[Information extracted from the SLURM **sinfo** command](https://slurm.schedmd.com/sinfo.html)
- Information extracted from the SLURM [**sinfo**](https://slurm.schedmd.com/sinfo.html) command.

### Status of the Jobs

Expand All @@ -46,7 +56,7 @@ Prometheus collector and exporter for metrics extracted from the [Slurm](https:/
* **PREEMPTED**: Jobs terminated due to preemption.
* **NODE_FAIL**: Jobs terminated due to failure of one or more allocated nodes.

[Information extracted from the SLURM **squeue** command](https://slurm.schedmd.com/squeue.html)
- Information extracted from the SLURM [**squeue**](https://slurm.schedmd.com/squeue.html) command.

### State of the Partitions

Expand All @@ -62,7 +72,7 @@ The following information about jobs are also extracted via [squeue](https://slu

### Scheduler Information

* **Server Thread count**: The number of current active ``slurmctld`` threads.
* **Server Thread count**: The number of current active ``slurmctld`` threads.
* **Queue size**: The length of the scheduler queue.
* **DBD Agent queue size**: The length of the message queue for _SlurmDBD_.
* **Last cycle**: Time in microseconds for last scheduling cycle.
Expand All @@ -75,19 +85,19 @@ The following information about jobs are also extracted via [squeue](https://slu
* **(Backfill) Total Backfilled Jobs** (since last stats cycle start): number of jobs started thanks to backfilling since last time stats where reset.
* **(Backfill) Total backfilled heterogeneous Job components**: number of heterogeneous job components started thanks to backfilling since last Slurm start.

[Information extracted from the SLURM **sdiag** command](https://slurm.schedmd.com/sdiag.html)
- Information extracted from the SLURM [**sdiag**](https://slurm.schedmd.com/sdiag.html) command.

*DBD Agent queue size*: it is particularly important to keep track of it, since an increasing number of messages
counted with this parameter almost always indicates three issues:
* the _SlurmDBD_ daemon is down;
* the _SlurmDBD_ daemon is down;
* the database is either down or unreachable;
* the status of the Slurm accounting DB may be inconsistent (e.g. ``sreport`` missing data, weird utilization of the cluster, etc.).


## Installation

* Read [DEVELOPMENT.md](DEVELOPMENT.md) in order to build the Prometheus Slurm Exporter. After a successful build copy the executable
`bin/prometheus-slurm-exporter` to a node with access to the Slurm command-line interface.
`bin/prometheus-slurm-exporter` to a node with access to the Slurm command-line interface.

* A [Systemd Unit][sdu] file to run the executable as service is available in [lib/systemd/prometheus-slurm-exporter.service](lib/systemd/prometheus-slurm-exporter.service).

Expand All @@ -104,7 +114,7 @@ scrape_configs:
#
# SLURM resource manager:
#
#
- job_name: 'my_slurm_exporter'
scrape_interval: 30s
Expand Down Expand Up @@ -151,5 +161,3 @@ This is free software: you can redistribute it and/or modify it under the terms
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.


141 changes: 141 additions & 0 deletions gpus.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
/* Copyright 2020 Joeri Hermans, Victor Penso, Matteo Dessalvi
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>. */

package main

import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/common/log"
"io/ioutil"
"os/exec"
"strings"
"strconv"
)

type GPUsMetrics struct {
alloc float64
idle float64
total float64
utilization float64
}

func GPUsGetMetrics() *GPUsMetrics {
return ParseGPUsMetrics()
}

func ParseAllocatedGPUs() float64 {
var num_gpus = 0.0

args := []string{"-a", "-X", "--format=Allocgres", "--state=RUNNING", "--noheader", "--parsable2"}
output := string(Execute("sacct", args))
if len(output) > 0 {
for _, line := range strings.Split(output, "\n") {
if len(line) > 0 {
line = strings.Trim(line, "\"")
descriptor := strings.TrimPrefix(line, "gpu:")
job_gpus, _ := strconv.ParseFloat(descriptor, 64)
num_gpus += job_gpus
}
}
}

return num_gpus
}

func ParseTotalGPUs() float64 {
var num_gpus = 0.0

args := []string{"-h", "-o \"%n %G\""}
output := string(Execute("sinfo", args))
if len(output) > 0 {
for _, line := range strings.Split(output, "\n") {
if len(line) > 0 {
line = strings.Trim(line, "\"")
descriptor := strings.Fields(line)[1]
descriptor = strings.TrimPrefix(descriptor, "gpu:")
descriptor = strings.Split(descriptor, "(")[0]
node_gpus, _ := strconv.ParseFloat(descriptor, 64)
num_gpus += node_gpus
}
}
}

return num_gpus
}

func ParseGPUsMetrics() *GPUsMetrics {
var gm GPUsMetrics
total_gpus := ParseTotalGPUs()
allocated_gpus := ParseAllocatedGPUs()
gm.alloc = allocated_gpus
gm.idle = total_gpus - allocated_gpus
gm.total = total_gpus
gm.utilization = allocated_gpus / total_gpus
return &gm
}

// Execute the sinfo command and return its output
func Execute(command string, arguments []string) []byte {
cmd := exec.Command(command, arguments...)
stdout, err := cmd.StdoutPipe()
if err != nil {
log.Fatal(err)
}
if err := cmd.Start(); err != nil {
log.Fatal(err)
}
out, _ := ioutil.ReadAll(stdout)
if err := cmd.Wait(); err != nil {
log.Fatal(err)
}
return out
}

/*
* Implement the Prometheus Collector interface and feed the
* Slurm scheduler metrics into it.
* https://godoc.org/github.com/prometheus/client_golang/prometheus#Collector
*/

func NewGPUsCollector() *GPUsCollector {
return &GPUsCollector{
alloc: prometheus.NewDesc("slurm_gpus_alloc", "Allocated GPUs", nil, nil),
idle: prometheus.NewDesc("slurm_gpus_idle", "Idle GPUs", nil, nil),
total: prometheus.NewDesc("slurm_gpus_total", "Total GPUs", nil, nil),
utilization: prometheus.NewDesc("slurm_gpus_utilization", "Total GPU utilization", nil, nil),
}
}

type GPUsCollector struct {
alloc *prometheus.Desc
idle *prometheus.Desc
total *prometheus.Desc
utilization *prometheus.Desc
}

// Send all metric descriptions
func (cc *GPUsCollector) Describe(ch chan<- *prometheus.Desc) {
ch <- cc.alloc
ch <- cc.idle
ch <- cc.total
ch <- cc.utilization
}
func (cc *GPUsCollector) Collect(ch chan<- prometheus.Metric) {
cm := GPUsGetMetrics()
ch <- prometheus.MustNewConstMetric(cc.alloc, prometheus.GaugeValue, cm.alloc)
ch <- prometheus.MustNewConstMetric(cc.idle, prometheus.GaugeValue, cm.idle)
ch <- prometheus.MustNewConstMetric(cc.total, prometheus.GaugeValue, cm.total)
ch <- prometheus.MustNewConstMetric(cc.utilization, prometheus.GaugeValue, cm.utilization)
}
13 changes: 7 additions & 6 deletions main.go
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
/* Copyright 2017-2020 Victor Penso, Matteo Dessalvi
/* Copyright 2017-2020 Victor Penso, Matteo Dessalvi, Joeri Hermans
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
Expand All @@ -25,13 +25,14 @@ import (

func init() {
// Metrics have to be registered to be exposed
prometheus.MustRegister(NewSchedulerCollector()) // from scheduler.go
prometheus.MustRegister(NewQueueCollector()) // from queue.go
prometheus.MustRegister(NewNodesCollector()) // from nodes.go
prometheus.MustRegister(NewCPUsCollector()) // from cpus.go
prometheus.MustRegister(NewAccountsCollector()) // from accounts.go
prometheus.MustRegister(NewUsersCollector()) // from users.go
prometheus.MustRegister(NewCPUsCollector()) // from cpus.go
prometheus.MustRegister(NewGPUsCollector()) // from gpus.go
prometheus.MustRegister(NewNodesCollector()) // from nodes.go
prometheus.MustRegister(NewPartitionsCollector()) // from partitions.go
prometheus.MustRegister(NewQueueCollector()) // from queue.go
prometheus.MustRegister(NewSchedulerCollector()) // from scheduler.go
prometheus.MustRegister(NewUsersCollector()) // from users.go
}

var listenAddress = flag.String(
Expand Down

0 comments on commit b6639a5

Please sign in to comment.