Docker compose cluster for testing Slurm
- docker (25.x.x+ with cgroupsv2 or 24.x.x with cgroupsv1)
- IPv6 must be configured in docker: https://docs.docker.com/config/daemon/ipv6/
- docker-compose-plugin v2.18.1+
- ssh (client)
- jq
- python3
- python3-daemon
net.ipv4.tcp_max_syn_backlog=4096
net.core.netdev_max_backlog=1000
net.core.somaxconn=15000
# Force gc to clean-up quickly
net.ipv4.neigh.default.gc_interval = 3600
# Set ARP cache entry timeout
net.ipv4.neigh.default.gc_stale_time = 3600
# Setup DNS threshold for arp
net.ipv4.neigh.default.gc_thresh3 = 8096
net.ipv4.neigh.default.gc_thresh2 = 4048
net.ipv4.neigh.default.gc_thresh1 = 1024
# Increase map count for elasticsearch
vm.max_map_count=262144
# Avoid running out of file descriptors
fs.file-max=10000000
fs.inotify.max_user_instances=65535
fs.inotify.max_user_watches=1048576
#Request kernel max number of cgroups
fs.inotify.max_user_instances=65535
Make sure the host machine is running CgroupV2 and not hybrid mode: https://slurm.schedmd.com/faq.html#cgroupv2
Add these settings to the docker configuration: /etc/docker/daemon.json
{
"exec-opts": [
"native.cgroupdriver=systemd"
],
"features": {
"buildkit": true
},
"experimental": true,
"cgroup-parent": "docker.slice",
"default-cgroupns-mode": "host",
"storage-driver": "overlay2"
}
Configure systemd to allow docker to run in it's own slice to avoid systemd conflicting with it:
/etc/systemd/system/docker.slice:
[Unit]
Description=docker slice
Before=slices.target
[Slice]
CPUAccounting=true
MemoryAccounting=true
Delegate=yes
/usr/lib/systemd/system/docker.service.d/local.conf:
[Service]
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
Delegate=yes
Activate the changes:
systemctl daemon-reload
systemctl restart docker.slice docker
Maria Database Node:
- db
Slurm Management Nodes:
- mgmtnode
- mgmtnode2
- slurmdbd
Compute Nodes:
- node[00-09]
Login Nodes:
- login
Nginx Proxy node:
- proxy
Rest API Nodes:
- rest
Kibana (Only supports IPv4):
Elasticsearch:
Grafana:
- View http://localhost:3000/
- User: admin
- Password: admin
Open On-Demand:
- View http://localhost:8081/
- User: {user name - "fred" or "wilma"}
- Password: password
Open XDMoD:
Proxy:
- Auth REST API http://localhost:8080/auth
- Query REST API http://localhost:8080/slurm/
Each cluster must have a unique class B subnet.
Default IPv4 is SUBNET="10.11". Default IPv6 is SUBNET6="2001:db8:1:1::".
Custom node lists may be provided by setting NODELIST to point to a file containing list of nodes for the cluster or modifing the default generated "nodelist" file in the scaleout directory.
The node list follows the following format with one node per line:
${HOSTNAME} ${CLUSTERNAME} ${IPv4} ${IPv6}
Example line:
node00 scaleout 10.11.5.0 2001:db8:1:1::5:0
Note that the service nodes can not be changed and will always be placed into the following subnets:
${SUBNET}.1.0/24 $ {SUBNET6}1:0/122
To specify an explicit version of Slurm to be compiled and installed:
export SLURM_RELEASE=slurm-$version
Make sure to call make clean
after to invalidate all the caches with the
prior release.
git submodule update --init --force --remote --recursive
make build
make
make clean
make cloud
Note: cloud mode will run in the foreground.
make nocache
make stop
make clean
make uninstall
make bash
make HOST=node1 bash
ssh-keygen -f "/home/$(whoami)/.ssh/known_hosts" -R "10.11.1.5" 2>/dev/null
ssh -o StrictHostKeyChecking=no -l fred 10.11.1.5 -X #use 'password'
Federation mode will create multiple Slurm clusters with nodes and slurmctld daemons. Other nodes will be shared, such as login and slurmdbd.
To create multiple federation clusters:
export FEDERATION="taco burrito quesadilla"
echo "FederationParameters=fed_display" >> scaleout/slurm/slurm.conf
truncate -s0 scaleout/nodelist
make clean
make build
make
Configure Slurm for multiple federation clusters:
make HOST=quesadilla-mgmtnode bash
sacctmgr add federation scaleout clusters=taco,burrito,quesadilla
Notify slurmdbd to use federation after building cluster:
export FEDERATION="taco burrito quesadilla"
make HOST=taco-mgmtnode bash
sacctmgr add federation scaleout cluster=taco,burrito,quesadilla
export FEDERATION="taco burrito quesadilla"
make uninstall
truncate -s0 scaleout/nodelist
The number of CPU threads on the host are multiplied by the number of nodes. Do not attempt to use computationally intensive applications.
ERROR: Pool overlaps with other one on this address space
or
failed to prepare ${HASH}: max depth exceeded
ERROR: Service 'slurmdbd' failed to build : Build failed
Call this:
make clean
docker network prune -f
sudo systemctl restart docker
make save
make load
make HOST=scaleout_mgmtnode_1 bash
bash /etc/cron.hourly/dump_xdmod.sh
exit
make bash
exec bash /etc/cron.hourly/dump_xdmod.sh
make HOST=xdmod bash
sudo -u xdmod -- /usr/bin/xdmod-shredder -r scaleout -f slurm -i /xdmod/data.csv
sudo -u xdmod -- /usr/bin/xdmod-ingestor
exit
This is will only disable attempts to build and start the container.
export DISABLE_XDMOD=1
The Linux kernel has a hard limit of 65535 cgroups total. Stacking large number of jobs or scaleout instances may result in the following error:
error: proctrack_g_create: No space left on device
When this happens, fewers jobs must be run as this a kernel limitation.