-
Notifications
You must be signed in to change notification settings - Fork 640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting nvidia-device-plugin container CrashLoopBackOff | version v0.14.0 | container runtime : containerd #406
Comments
Hi @DineshwarSingh, could you comment on how the device plugin is configured / installed. Note that the device plugin also requires that the NVIDIA Container Toolkit be installed on the system and be configured as a runtime class in Containerd. Have you installed the toolkit and configured Containerd to use it as a runtime? |
Hi @elezar, Regards, |
@DineshwarSingh how is the Device Plugin deployed? What are the contents of your Containerd |
@elezar device plugin is deployed using helm version v0.14.0. [plugins."io.containerd.grpc.v1.cri".containerd.runtimes] |
Thanks for the information. I assume that Could you also provide the contents of your |
Hi @elezar, disable-require = false [nvidia-container-cli] [nvidia-container-runtime] #Specify the runtimes to consider. This list is processed in order and the PATH mode = "auto"
========================================================================= |
Hi @elezar, Thanks, |
I am also seeing this in my environment. My config and output look the same as above. Happy to provide more details. |
The following packages are installed.
Contianerd has been verified with the following.
The plugin was installed with the following manifest. # https://github.com/NVIDIA/k8s-device-plugin#enabling-gpu-support-in-kubernetes
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin
namespace: kube-system
spec:
selector:
matchLabels:
app.kubernetes.io/name: nvidia-device-plugin
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
app.kubernetes.io/name: nvidia-device-plugin
annotations: {}
spec:
priorityClassName: system-node-critical
securityContext: {}
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.14.0
imagePullPolicy: IfNotPresent
name: nvidia-device-plugin-ctr
env:
- name: NVIDIA_MIG_MONITOR_DEVICES
value: all
securityContext:
capabilities:
add:
- SYS_ADMIN
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
tolerations:
- key: CriticalAddonsOnly
operator: Exists
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia
---
apiVersion: v1
kind: Pod
metadata:
name: nbody-gpu-benchmark
namespace: default
spec:
restartPolicy: OnFailure
runtimeClassName: nvidia
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:nbody
args: ['nbody', '-gpu', '-benchmark']
resources:
limits:
nvidia.com/gpu: 1
env:
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all Using the following information to get started. https://docs.k3s.io/advanced#nvidia-container-runtime-support Let me know what other details are helpful. |
Note that using the CTR comman: Note that in an earlier comment you mention
Could you confirm that the |
Good eyes. It looks like it also is detected here. # grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
BinaryName = "/usr/bin/nvidia-container-runtime" Here is the full config, but how do I know if it is the non-default? version = 2
[plugins."io.containerd.internal.v1.opt"]
path = "/var/lib/rancher/k3s/agent/containerd"
[plugins."io.containerd.grpc.v1.cri"]
stream_server_address = "127.0.0.1"
stream_server_port = "10010"
enable_selinux = false
enable_unprivileged_ports = true
enable_unprivileged_icmp = true
sandbox_image = "rancher/mirrored-pause:3.6"
[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "overlayfs"
disable_snapshot_annotations = true
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
endpoint = ["https://registry.default.svc.cluster.local:5000"]
[plugins."io.containerd.grpc.v1.cri".registry.configs."registry.default.svc.cluster.local:5000".tls]
ca_file = "/etc/ssl/ca.pem"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
BinaryName = "/usr/bin/nvidia-container-runtime" |
Also note that the GPU benchmark pod also fails to schedule with the |
@zachfi since there is no The issue that the benchmark container is seeing is because tha device plugin is not reporting any |
❯ k -n kube-system logs nvidia-device-plugin-qxs5n
I0608 13:59:52.537232 1 main.go:154] Starting FS watcher.
I0608 13:59:52.537319 1 main.go:161] Starting OS watcher.
I0608 13:59:52.537643 1 main.go:176] Starting Plugins.
I0608 13:59:52.537659 1 main.go:234] Loading configuration.
I0608 13:59:52.537778 1 main.go:242] Updating config with default resource matching patterns.
I0608 13:59:52.537990 1 main.go:253]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": true,
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {}
}
}
I0608 13:59:52.538003 1 main.go:256] Retreiving plugins.
W0608 13:59:52.538273 1 factory.go:31] No valid resources detected, creating a null CDI handler
I0608 13:59:52.538319 1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0608 13:59:52.538344 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0608 13:59:52.538352 1 factory.go:115] Incompatible platform detected
E0608 13:59:52.538361 1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0608 13:59:52.538368 1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0608 13:59:52.538373 1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0608 13:59:52.538379 1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E0608 13:59:52.538500 1 main.go:123] error starting plugins: error creating plugin manager: unable to create plugin manager: platform detection failed |
Are you ALSO specifying the |
Amazing! This was the missing piece. Once I added this, the plugin deployed and registered the GPU and then the benchmark was able to run. Thank you for the assist @elezar. The daemonset example I had pulled didn't have this setting, so perhaps it is also missing from helm, and adding it would resolve the issue for @DineshwarSingh as well. |
@zachfi Hi! What parameter and where did you end up adding it? I have exactly the same problem (I install via Helm) |
@SergeSpinoza the device plugin needs to specify The helm deamonset template does define:
So deploying with |
@elezar Thanks. After I add runtimeClassName: nvidia I have other error: I have no runtimeclasses
I didn't really understand what I did wrong. I followed this instruction: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
and restart containerd
i get:
What did I miss? |
One thing I noticed today is that if you run What I notice in that output is that it does NOT say "No processes found", and I think you have a process running on the GPU currently. |
Also, one of the versions of the darmonset I tried was using |
@SergeSpinoza I have the same experience, I think it would be better if there's some quick checklist so we can check what are we missing. @elezar |
My environment and setup:
Mon Jun 12 21:52:43 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 Off| 00000000:4B:00.0 Off | Off |
| 32% 32C P0 51W / 450W| 0MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 Off| 00000000:B1:00.0 Off | Off |
| 30% 31C P0 46W / 450W| 0MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
4.how k8s-device-plugin deployed: basic features manifest, followed this link: https://github.com/NVIDIA/k8s-device-plugin#enabling-gpu-support-in-kubernetes
I have to admit that I am not fully understand how k8s-device-plugin works, and feel very thankful for your hard works, I really need to figure what's the root cause here in my case, any ideas are appreciated. |
Manual RuntimeClass creation in kubernetes cluster helped me. Manifest:
Docs: https://kubernetes.io/docs/concepts/containers/runtime-class/ |
I have this RuntimeClass, but no effect to me, could you please share your k8s-device-plugin daemonset manifest? Thanks |
My daemonset:
|
Any update on this issue? I am having the same problem. I have followed the documentation here: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html to update containerd config. I have run
I can confirm that when
I have deployed the daemonset nvidia plugin via Helm based on these instructions:
But the pod is crashing:
Running
I am able to run
What am I doing wrong? Edit: |
@simsicon few suggestions,
|
I believe I have figured it out. At least in my case.
Most of the tutorials out there were suggesting a k3d template instead of the k3s template. I thought that was wrong and I assumed that the k3s service should "detect" the nvidia container runtime. It does but it does not make it the default one. This template seems to work: https://github.com/skirsten/k3s/blob/f78a66b44e2ecbef64122be99a9aa9118a49d7e9/pkg/agent/templates/templates_linux.go#L10 But a simpler solution, in case you don't want to force every pod to have nvidia runtime, is to add |
apiVersion: apps/v1 |
I have used k8 device plugin damonset v0.14.0, and its working fine with the below version.
My host have ubuntu 22.4
|
Fix for me in
cat <<EOF | kubectl apply -f -
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia
EOF
kubectl patch daemonset -n kube-system nvidia-device-plugin --type='json' \
-p='[{"op": "add", "path": "/spec/template/spec/runtimeClassName", "value": "nvidia"}]' |
For k3s i fixed like this (snippet for impatient). More details: https://docs.k3s.io/advanced#configuring-containerd
|
Getting nvidia-device-plugin container CrashLoopBackOff error. Using K8-device-plugin version v0.14.0 and container runtime as containerd. Same is working fine with container runtime as dockerd.
Pod ErrorLog:
nvidia-smi output:
The text was updated successfully, but these errors were encountered: