From 6decc1563f285fdd2b641355790e447a4664288d Mon Sep 17 00:00:00 2001 From: Chip Zoller Date: Fri, 25 Oct 2024 14:27:26 -0400 Subject: [PATCH] [Docs] Add catalog of labels to README (#928) * fix gfd Signed-off-by: Chip Zoller * table Signed-off-by: Chip Zoller * format Signed-off-by: Chip Zoller * update Signed-off-by: Chip Zoller * add more vgpu Signed-off-by: chipzoller * correct GFD version statement Signed-off-by: chipzoller * lint/fix headings Signed-off-by: chipzoller * 0.16.1 => 0.16.2 Signed-off-by: chipzoller * update ToC Signed-off-by: chipzoller * linting Signed-off-by: chipzoller * no conf support Signed-off-by: chipzoller * fencing Signed-off-by: chipzoller * use latest cuda-sample tag Signed-off-by: chipzoller * nice link to NFD Signed-off-by: chipzoller * save Signed-off-by: chipzoller * update table Signed-off-by: chipzoller * GFD tweaks Signed-off-by: chipzoller * table value updates Signed-off-by: chipzoller * fix Signed-off-by: chipzoller * comments Signed-off-by: chipzoller --------- Signed-off-by: Chip Zoller Signed-off-by: chipzoller --- README.md | 381 ++++++++++++++++----------- docs/gpu-feature-discovery/README.md | 151 ++++++----- 2 files changed, 304 insertions(+), 228 deletions(-) diff --git a/README.md b/README.md index e4d772c7a..aff243459 100644 --- a/README.md +++ b/README.md @@ -21,9 +21,10 @@ - [With CUDA Time-Slicing](#with-cuda-time-slicing) - [With CUDA MPS](#with-cuda-mps) - [IMEX Support](#imex-support) +- [Catalog of Labels](#catalog-of-labels) - [Deployment via `helm`](#deployment-via-helm) - [Configuring the device plugin's `helm` chart](#configuring-the-device-plugins-helm-chart) - - [Passing configuration to the plugin via a `ConfigMap`.](#passing-configuration-to-the-plugin-via-a-configmap) + - [Passing configuration to the plugin via a `ConfigMap`](#passing-configuration-to-the-plugin-via-a-configmap) - [Single Config File Example](#single-config-file-example) - [Multiple Config File Example](#multiple-config-file-example) - [Updating Per-Node Configuration With a Node Label](#updating-per-node-configuration-with-a-node-label) @@ -46,29 +47,32 @@ ## About The NVIDIA device plugin for Kubernetes is a Daemonset that allows you to automatically: + - Expose the number of GPUs on each nodes of your cluster - Keep track of the health of your GPUs - Run GPU enabled containers in your Kubernetes cluster. This repository contains NVIDIA's official implementation of the [Kubernetes device plugin](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/). -As of v0.16.1 this repository also holds the implementation for GPU Feature Discovery labels, +As of v0.15.0 this repository also holds the implementation for GPU Feature Discovery labels, for further information on GPU Feature Discovery see [here](docs/gpu-feature-discovery/README.md). Please note that: + - The NVIDIA device plugin API is beta as of Kubernetes v1.10. - The NVIDIA device plugin is currently lacking - - Comprehensive GPU health checking features - - GPU cleanup features + - Comprehensive GPU health checking features + - GPU cleanup features - Support will only be provided for the official NVIDIA device plugin (and not for forks or other variants of this plugin). ## Prerequisites The list of prerequisites for running the NVIDIA device plugin is described below: -* NVIDIA drivers ~= 384.81 -* nvidia-docker >= 2.0 || nvidia-container-toolkit >= 1.7.0 (>= 1.11.0 to use integrated GPUs on Tegra-based systems) -* nvidia-container-runtime configured as the default low-level runtime -* Kubernetes version >= 1.10 + +- NVIDIA drivers ~= 384.81 +- nvidia-docker >= 2.0 || nvidia-container-toolkit >= 1.7.0 (>= 1.11.0 to use integrated GPUs on Tegra-based systems) +- nvidia-container-runtime configured as the default low-level runtime +- Kubernetes version >= 1.10 ## Quick Start @@ -86,11 +90,11 @@ Please see: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/in For instructions on installing and getting started with the NVIDIA Container Toolkit, refer to the [installation guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#installation-guide). - Also note the configuration instructions for: -* [`containerd`](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-containerd-for-kubernetes) -* [`CRI-O`](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-cri-o) -* [`docker` (Deprecated)](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-docker) + +- [`containerd`](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-containerd-for-kubernetes) +- [`CRI-O`](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-cri-o) +- [`docker` (Deprecated)](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-docker) Remembering to restart each runtime after applying the configuration changes. @@ -98,11 +102,13 @@ If the `nvidia` runtime should be set as the default runtime (required for `dock must also be included in the commands above. If this is not done, a RuntimeClass needs to be defined. ##### Notes on `CRI-O` configuration + When running `kubernetes` with `CRI-O`, add the config file to set the `nvidia-container-runtime` as the default low-level OCI runtime under `/etc/crio/crio.conf.d/99-nvidia.conf`. This will take priority over the default `crun` config file at `/etc/crio/crio.conf.d/10-crun.conf`: -``` + +```toml [crio] [crio.runtime] @@ -114,19 +120,25 @@ When running `kubernetes` with `CRI-O`, add the config file to set the runtime_path = "/usr/bin/nvidia-container-runtime" runtime_type = "oci" ``` + As stated in the linked documentation, this file can automatically be generated with the nvidia-ctk command: + +```shell +sudo nvidia-ctk runtime configure --runtime=crio --set-as-default --config=/etc/crio/crio.conf.d/99-nvidia.conf ``` -$ sudo nvidia-ctk runtime configure --runtime=crio --set-as-default --config=/etc/crio/crio.conf.d/99-nvidia.conf -``` + `CRI-O` uses `crun` as default low-level OCI runtime so `crun` needs to be added to the runtimes of the `nvidia-container-runtime` in the config file at `/etc/nvidia-container-runtime/config.toml`: -``` + +```toml [nvidia-container-runtime] runtimes = ["crun", "docker-runc", "runc"] ``` + And then restart `CRI-O`: -``` -$ sudo systemctl restart crio + +```shell +sudo systemctl restart crio ``` ### Enabling GPU Support in Kubernetes @@ -135,7 +147,7 @@ Once you have configured the options above on all the GPU nodes in your cluster, you can enable GPU support by deploying the following Daemonset: ```shell -$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.1/deployments/static/nvidia-device-plugin.yml +kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.2/deployments/static/nvidia-device-plugin.yml ``` **Note:** This is a simple static daemonset meant to demonstrate the basic @@ -148,8 +160,8 @@ production setting. With the daemonset deployed, NVIDIA GPUs can now be requested by a container using the `nvidia.com/gpu` resource type: -```yaml -$ cat <-rc.1`). Full releases will be listed without this. ### Configuring the device plugin's `helm` chart -The `helm` chart for the latest release of the plugin (`v0.16.1`) includes +The `helm` chart for the latest release of the plugin (`v0.16.2`) includes a number of customizable values. Prior to `v0.12.0` the most commonly used values were those that had direct @@ -637,9 +675,9 @@ case of the original values is then to override an option from the `ConfigMap` if desired. Both methods are discussed in more detail below. The full set of values that can be set are found here: -[here](https://github.com/NVIDIA/k8s-device-plugin/blob/v0.16.1/deployments/helm/nvidia-device-plugin/values.yaml). +[here](https://github.com/NVIDIA/k8s-device-plugin/blob/v0.16.2/deployments/helm/nvidia-device-plugin/values.yaml). -#### Passing configuration to the plugin via a `ConfigMap`. +#### Passing configuration to the plugin via a `ConfigMap` In general, we provide a mechanism to pass _multiple_ configuration files to to the plugin's `helm` chart, with the ability to choose which configuration @@ -649,6 +687,7 @@ In this way, a single chart can be used to deploy each component, but custom configurations can be applied to different nodes throughout the cluster. There are two ways to provide a `ConfigMap` for use by the plugin: + 1. Via an external reference to a pre-defined `ConfigMap` 1. As a set of named config files to build an integrated `ConfigMap` associated with the chart @@ -657,7 +696,8 @@ In both cases, the value `config.default` can be set to point to one of the named configs in the `ConfigMap` and provide a default configuration for nodes that have not been customized via a node label (more on this later). -##### Single Config File Example +##### Single Config File Example + As an example, create a valid config file on your local filesystem, such as the following: ```shell @@ -675,12 +715,13 @@ EOF ``` And deploy the device plugin via helm (pointing it at this config file and giving it a name): -``` -$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ - --version=0.16.1 \ - --namespace nvidia-device-plugin \ - --create-namespace \ - --set-file config.map.config=/tmp/dp-example-config0.yaml + +```shell +helm upgrade -i nvdp nvdp/nvidia-device-plugin \ + --version=0.16.2 \ + --namespace nvidia-device-plugin \ + --create-namespace \ + --set-file config.map.config=/tmp/dp-example-config0.yaml ``` Under the hood this will deploy a `ConfigMap` associated with the plugin and put @@ -690,22 +731,25 @@ applied when the plugin comes online. If you don’t want the plugin’s helm chart to create the `ConfigMap` for you, you can also point it at a pre-created `ConfigMap` as follows: + +```shell +kubectl create ns nvidia-device-plugin ``` -$ kubectl create ns nvidia-device-plugin -``` -``` -$ kubectl create cm -n nvidia-device-plugin nvidia-plugin-configs \ - --from-file=config=/tmp/dp-example-config0.yaml -``` + +```shell +kubectl create cm -n nvidia-device-plugin nvidia-plugin-configs \ + --from-file=config=/tmp/dp-example-config0.yaml ``` -$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ - --version=0.16.1 \ - --namespace nvidia-device-plugin \ - --create-namespace \ - --set config.name=nvidia-plugin-configs + +```shell +helm upgrade -i nvdp nvdp/nvidia-device-plugin \ + --version=0.16.2 \ + --namespace nvidia-device-plugin \ + --create-namespace \ + --set config.name=nvidia-plugin-configs ``` -##### Multiple Config File Example +##### Multiple Config File Example For multiple config files, the procedure is similar. @@ -726,32 +770,36 @@ EOF ``` And redeploy the device plugin via helm (pointing it at both configs with a specified default). -``` -$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ - --version=0.16.1 \ - --namespace nvidia-device-plugin \ - --create-namespace \ - --set config.default=config0 \ - --set-file config.map.config0=/tmp/dp-example-config0.yaml \ - --set-file config.map.config1=/tmp/dp-example-config1.yaml + +```shell +helm upgrade -i nvdp nvdp/nvidia-device-plugin \ + --version=0.16.2 \ + --namespace nvidia-device-plugin \ + --create-namespace \ + --set config.default=config0 \ + --set-file config.map.config0=/tmp/dp-example-config0.yaml \ + --set-file config.map.config1=/tmp/dp-example-config1.yaml ``` As before, this can also be done with a pre-created `ConfigMap` if desired: + +```shell +kubectl create ns nvidia-device-plugin ``` -$ kubectl create ns nvidia-device-plugin -``` -``` -$ kubectl create cm -n nvidia-device-plugin nvidia-plugin-configs \ - --from-file=config0=/tmp/dp-example-config0.yaml \ - --from-file=config1=/tmp/dp-example-config1.yaml -``` + +```shell +kubectl create cm -n nvidia-device-plugin nvidia-plugin-configs \ + --from-file=config0=/tmp/dp-example-config0.yaml \ + --from-file=config1=/tmp/dp-example-config1.yaml ``` -$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ - --version=0.16.1 \ - --namespace nvidia-device-plugin \ - --create-namespace \ - --set config.default=config0 \ - --set config.name=nvidia-plugin-configs + +```shell +helm upgrade -i nvdp nvdp/nvidia-device-plugin \ + --version=0.16.2 \ + --namespace nvidia-device-plugin \ + --create-namespace \ + --set config.default=config0 \ + --set config.name=nvidia-plugin-configs ``` **Note:** If the `config.default` flag is not explicitly set, then a default @@ -765,18 +813,20 @@ provided, it will be chosen as the default because there is no other option. With this setup, plugins on all nodes will have `config0` configured for them by default. However, the following label can be set to change which configuration is applied: -``` + +```shell kubectl label nodes –-overwrite \ - nvidia.com/device-plugin.config= + nvidia.com/device-plugin.config= ``` For example, applying a custom config for all nodes that have T4 GPUs installed on them might be: -``` + +```shell kubectl label node \ - --overwrite \ - --selector=nvidia.com/gpu.product=TESLA-T4 \ - nvidia.com/device-plugin.config=t4-config + --overwrite \ + --selector=nvidia.com/gpu.product=TESLA-T4 \ + nvidia.com/device-plugin.config=t4-config ``` **Note:** This label can be applied either _before_ or _after_ the plugin is @@ -831,31 +881,33 @@ runtimeClassName: ``` Please take a look in the -[`values.yaml`](https://github.com/NVIDIA/k8s-device-plugin/blob/v0.16.1/deployments/helm/nvidia-device-plugin/values.yaml) +[`values.yaml`](https://github.com/NVIDIA/k8s-device-plugin/blob/v0.16.2/deployments/helm/nvidia-device-plugin/values.yaml) file to see the full set of overridable parameters for the device plugin. Examples of setting these options include: Enabling compatibility with the `CPUManager` and running with a request for 100ms of CPU time and a limit of 512MB of memory. + ```shell -$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ - --version=0.16.1 \ - --namespace nvidia-device-plugin \ - --create-namespace \ - --set compatWithCPUManager=true \ - --set resources.requests.cpu=100m \ - --set resources.limits.memory=512Mi +helm upgrade -i nvdp nvdp/nvidia-device-plugin \ + --version=0.16.2 \ + --namespace nvidia-device-plugin \ + --create-namespace \ + --set compatWithCPUManager=true \ + --set resources.requests.cpu=100m \ + --set resources.limits.memory=512Mi ``` -Enabling compatibility with the `CPUManager` and the `mixed` `migStrategy` +Enabling compatibility with the `CPUManager` and the `mixed` `migStrategy`. + ```shell -$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ - --version=0.16.1 \ - --namespace nvidia-device-plugin \ - --create-namespace \ - --set compatWithCPUManager=true \ - --set migStrategy=mixed +helm upgrade -i nvdp nvdp/nvidia-device-plugin \ + --version=0.16.2 \ + --namespace nvidia-device-plugin \ + --create-namespace \ + --set compatWithCPUManager=true \ + --set migStrategy=mixed ``` #### Deploying with gpu-feature-discovery for automatic node labels @@ -864,16 +916,16 @@ As of `v0.12.0`, the device plugin's helm chart has integrated support to deploy [`gpu-feature-discovery`](https://github.com/NVIDIA/gpu-feature-discovery) (GFD). You can use GFD to automatically generate labels for the -set of GPUs available on a node. Under the hood, it leverages Node Feature -Discovery to perform this labeling. +set of GPUs available on a node. Under the hood, it leverages [Node Feature Discovery](https://kubernetes-sigs.github.io/node-feature-discovery/stable/get-started/index.html) to perform this labeling. To enable it, simply set `gfd.enabled=true` during helm install. + ```shell helm upgrade -i nvdp nvdp/nvidia-device-plugin \ - --version=0.16.1 \ - --namespace nvidia-device-plugin \ - --create-namespace \ - --set gfd.enabled=true + --version=0.16.2 \ + --namespace nvidia-device-plugin \ + --create-namespace \ + --set gfd.enabled=true ``` Under the hood this will also deploy @@ -883,8 +935,8 @@ your cluster and do not wish for it to be pulled in by this installation, you can disable it with `nfd.enabled=false`. In addition to the standard node labels applied by GFD, the following label -will also be included when deploying the plugin with the time-slicing extensions -described [above](#shared-access-to-gpus-with-cuda-time-slicing). +will also be included when deploying the plugin with the time-slicing or MPS extensions +described [above](#shared-access-to-gpus). ``` nvidia.com/.replicas = @@ -892,6 +944,7 @@ nvidia.com/.replicas = Additionally, the `nvidia.com/.product` will be modified as follows if `renameByDefault=false`. + ``` nvidia.com/.product = -SHARED ``` @@ -902,37 +955,38 @@ That is, the `SHARED` annotation ensures that a `nodeSelector` can be used to attract pods to nodes that have shared GPUs on them. Since having `renameByDefault=true` already encodes the fact that the resource is -shared on the resource name , there is no need to annotate the product +shared on the resource name, there is no need to annotate the product name with `SHARED`. Users can already find the shared resources they need by simply requesting it in their pod spec. Note: When running with `renameByDefault=false` and `migStrategy=single` both the MIG profile name and the new `SHARED` annotation will be appended to the product name, e.g.: + ``` nvidia.com/gpu.product = A100-SXM4-40GB-MIG-1g.5gb-SHARED ``` #### Deploying gpu-feature-discovery in standalone mode -As of v0.16.1, the device plugin's helm chart has integrated support to deploy -[`gpu-feature-discovery`](https://gitlab.com/nvidia/kubernetes/gpu-feature-discovery/-/tree/main) +As of v0.15.0, the device plugin's helm chart has integrated support to deploy +[`gpu-feature-discovery`](/docs/gpu-feature-discovery/README.md#overview) When gpu-feature-discovery in deploying standalone, begin by setting up the plugin's `helm` repository and updating it at follows: ```shell -$ helm repo add nvdp https://nvidia.github.io/k8s-device-plugin -$ helm repo update +helm repo add nvdp https://nvidia.github.io/k8s-device-plugin +helm repo update ``` -Then verify that the latest release (`v0.16.1`) of the plugin is available +Then verify that the latest release (`v0.16.2`) of the plugin is available (Note that this includes the GFD chart): ```shell -$ helm search repo nvdp --devel +helm search repo nvdp --devel NAME CHART VERSION APP VERSION DESCRIPTION -nvdp/nvidia-device-plugin 0.16.1 0.16.1 A Helm chart for ... +nvdp/nvidia-device-plugin 0.16.2 0.16.2 A Helm chart for ... ``` Once this repo is updated, you can begin installing packages from it to deploy @@ -941,8 +995,8 @@ the `gpu-feature-discovery` component in standalone mode. The most basic installation command without any options is then: ```shell -$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ - --version 0.16.1 \ +helm upgrade -i nvdp nvdp/nvidia-device-plugin \ + --version 0.16.2 \ --namespace gpu-feature-discovery \ --create-namespace \ --set devicePlugin.enabled=false @@ -952,8 +1006,8 @@ Disabling auto-deployment of NFD and running with a MIG strategy of 'mixed' in the default namespace. ```shell -$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ - --version=0.16.1 \ +helm upgrade -i nvdp nvdp/nvidia-device-plugin \ + --version=0.16.2 \ --set allowDefaultNamespace=true \ --set nfd.enabled=false \ --set migStrategy=mixed \ @@ -972,84 +1026,96 @@ The example below installs the same chart as the method above, except that it uses a direct URL to the `helm` chart instead of via the `helm` repo. Using the default values for the flags: + ```shell -$ helm upgrade -i nvdp \ - --namespace nvidia-device-plugin \ - --create-namespace \ - https://nvidia.github.io/k8s-device-plugin/stable/nvidia-device-plugin-0.16.1.tgz +helm upgrade -i nvdp \ + --namespace nvidia-device-plugin \ + --create-namespace \ + https://nvidia.github.io/k8s-device-plugin/stable/nvidia-device-plugin-0.16.2.tgz ``` ## Building and Running Locally The next sections are focused on building the device plugin locally and running it. It is intended purely for development and testing, and not required by most users. -It assumes you are pinning to the latest release tag (i.e. `v0.16.1`), but can +It assumes you are pinning to the latest release tag (i.e. `v0.16.2`), but can easily be modified to work with any available tag or branch. ### With Docker #### Build + Option 1, pull the prebuilt image from [Docker Hub](https://hub.docker.com/r/nvidia/k8s-device-plugin): + ```shell -$ docker pull nvcr.io/nvidia/k8s-device-plugin:v0.16.1 -$ docker tag nvcr.io/nvidia/k8s-device-plugin:v0.16.1 nvcr.io/nvidia/k8s-device-plugin:devel +docker pull nvcr.io/nvidia/k8s-device-plugin:v0.16.2 +docker tag nvcr.io/nvidia/k8s-device-plugin:v0.16.2 nvcr.io/nvidia/k8s-device-plugin:devel ``` Option 2, build without cloning the repository: + ```shell -$ docker build \ - -t nvcr.io/nvidia/k8s-device-plugin:devel \ - -f deployments/container/Dockerfile.ubuntu \ - https://github.com/NVIDIA/k8s-device-plugin.git#v0.16.1 +docker build \ + -t nvcr.io/nvidia/k8s-device-plugin:devel \ + -f deployments/container/Dockerfile.ubuntu \ + https://github.com/NVIDIA/k8s-device-plugin.git#v0.16.2 ``` Option 3, if you want to modify the code: + ```shell -$ git clone https://github.com/NVIDIA/k8s-device-plugin.git && cd k8s-device-plugin -$ docker build \ - -t nvcr.io/nvidia/k8s-device-plugin:devel \ - -f deployments/container/Dockerfile.ubuntu \ - . +git clone https://github.com/NVIDIA/k8s-device-plugin.git && cd k8s-device-plugin +docker build \ + -t nvcr.io/nvidia/k8s-device-plugin:devel \ + -f deployments/container/Dockerfile.ubuntu \ + . ``` #### Run + Without compatibility for the `CPUManager` static policy: + ```shell -$ docker run \ - -it \ - --security-opt=no-new-privileges \ - --cap-drop=ALL \ - --network=none \ - -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins \ - nvcr.io/nvidia/k8s-device-plugin:devel +docker run \ + -it \ + --security-opt=no-new-privileges \ + --cap-drop=ALL \ + --network=none \ + -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins \ + nvcr.io/nvidia/k8s-device-plugin:devel ``` With compatibility for the `CPUManager` static policy: + ```shell -$ docker run \ - -it \ - --privileged \ - --network=none \ - -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins \ - nvcr.io/nvidia/k8s-device-plugin:devel --pass-device-specs +docker run \ + -it \ + --privileged \ + --network=none \ + -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins \ + nvcr.io/nvidia/k8s-device-plugin:devel --pass-device-specs ``` ### Without Docker #### Build + ```shell -$ C_INCLUDE_PATH=/usr/local/cuda/include LIBRARY_PATH=/usr/local/cuda/lib64 go build +C_INCLUDE_PATH=/usr/local/cuda/include LIBRARY_PATH=/usr/local/cuda/lib64 go build ``` #### Run + Without compatibility for the `CPUManager` static policy: + ```shell -$ ./k8s-device-plugin +./k8s-device-plugin ``` With compatibility for the `CPUManager` static policy: + ```shell -$ ./k8s-device-plugin --pass-device-specs +./k8s-device-plugin --pass-device-specs ``` ## Changelog @@ -1057,10 +1123,11 @@ $ ./k8s-device-plugin --pass-device-specs See the [changelog](CHANGELOG.md) ## Issues and Contributing + [Checkout the Contributing document!](CONTRIBUTING.md) -* You can report a bug by [filing a new issue](https://github.com/NVIDIA/k8s-device-plugin/issues/new) -* You can contribute by opening a [pull request](https://help.github.com/articles/using-pull-requests/) +- You can report a bug by [filing a new issue](https://github.com/NVIDIA/k8s-device-plugin/issues/new) +- You can contribute by opening a [pull request](https://help.github.com/articles/using-pull-requests/) ### Versioning diff --git a/docs/gpu-feature-discovery/README.md b/docs/gpu-feature-discovery/README.md index 17978663d..8d7a90a11 100644 --- a/docs/gpu-feature-discovery/README.md +++ b/docs/gpu-feature-discovery/README.md @@ -26,8 +26,7 @@ NVIDIA GPU Feature Discovery for Kubernetes is a software component that allows you to automatically generate labels for the set of GPUs available on a node. -It leverages the [Node Feature -Discovery](https://github.com/kubernetes-sigs/node-feature-discovery) +It leverages the [Node Feature Discovery](https://github.com/kubernetes-sigs/node-feature-discovery) to perform this labeling. ## Beta Version @@ -40,14 +39,15 @@ to ease the transition. The list of prerequisites for running the NVIDIA GPU Feature Discovery is described below: -* nvidia-docker version > 2.0 (see how to [install](https://github.com/NVIDIA/nvidia-docker) -and it's [prerequisites](https://github.com/nvidia/nvidia-docker/wiki/Installation-\(version-2.0\)#prerequisites)) -* docker configured with nvidia as the [default runtime](https://github.com/NVIDIA/nvidia-docker/wiki/Advanced-topics#default-runtime). -* Kubernetes version >= 1.10 -* NVIDIA device plugin for Kubernetes (see how to [setup](https://github.com/NVIDIA/k8s-device-plugin)) -* NFD deployed on each node you want to label with the local source configured - * When deploying GPU feature discovery with helm (as described below) we provide a way to automatically deploy NFD for you - * To deploy NFD yourself, please see https://github.com/kubernetes-sigs/node-feature-discovery + +- nvidia-docker version > 2.0 (see how to [install](https://github.com/NVIDIA/nvidia-docker) +and its [prerequisites](https://github.com/nvidia/nvidia-docker/wiki/Installation-\(version-2.0\)#prerequisites)) +- docker configured with nvidia as the [default runtime](https://github.com/NVIDIA/nvidia-docker/wiki/Advanced-topics#default-runtime). +- Kubernetes version >= 1.10 +- NVIDIA device plugin for Kubernetes (see how to [setup](https://github.com/NVIDIA/k8s-device-plugin)) +- NFD deployed on each node you want to label with the local source configured + - When deploying GPU feature discovery with helm (as described below) we provide a way to automatically deploy NFD for you + - To deploy NFD yourself, please see https://github.com/kubernetes-sigs/node-feature-discovery ## Quick Start @@ -62,7 +62,7 @@ is running on every node you want to label. NVIDIA GPU Feature Discovery use the `local` source so be sure to mount volumes. See https://github.com/kubernetes-sigs/node-feature-discovery for more details. -You also need to configure the `Node Feature Discovery` to only expose vendor +You also need to configure the Node Feature Discovery to only expose vendor IDs in the PCI source. To do so, please refer to the Node Feature Discovery documentation. @@ -70,7 +70,7 @@ The following command will deploy NFD with the minimum required set of parameters to run `gpu-feature-discovery`. ```shell -kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.15.0/deployments/static/nfd.yaml +kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.2/deployments/static/nfd.yaml ``` **Note:** This is a simple static daemonset meant to demonstrate the basic @@ -94,7 +94,7 @@ or as a Job. #### Daemonset ```shell -kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.15.0/deployments/static/gpu-feature-discovery-daemonset.yaml +kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.2/deployments/static/gpu-feature-discovery-daemonset.yaml ``` **Note:** This is a simple static daemonset meant to demonstrate the basic @@ -108,21 +108,21 @@ You must change the `NODE_NAME` value in the template to match the name of the node you want to label: ```shell -$ export NODE_NAME= -$ curl https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.15.0/deployments/static/gpu-feature-discovery-job.yaml.template \ +export NODE_NAME= +curl https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.2/deployments/static/gpu-feature-discovery-job.yaml.template \ | sed "s/NODE_NAME/${NODE_NAME}/" > gpu-feature-discovery-job.yaml -$ kubectl apply -f gpu-feature-discovery-job.yaml +kubectl apply -f gpu-feature-discovery-job.yaml ``` **Note:** This method should only be used for testing and not deployed in a -productions setting. +production setting. ### Verifying Everything Works With both NFD and GFD deployed and running, you should now be able to see GPU related labels appearing on any nodes that have GPUs installed on them. -``` +```shell $ kubectl get nodes -o yaml apiVersion: v1 items: @@ -147,13 +147,13 @@ items: nvidia.com/gpu.product: A100-SXM4-40GB ... ... - ``` ## The GFD Command line interface Available options: -``` + +```shell gpu-feature-discovery: Usage: gpu-feature-discovery [--fail-on-init-error=] [--mig-strategy=] [--oneshot | --sleep-interval=] [--no-timestamp] [--output-file= | -o ] @@ -173,7 +173,6 @@ Options: Arguments: : none | single | mixed - ``` You can also use environment variables: @@ -191,25 +190,36 @@ Environment variables override the command line options if they conflict. ## Generated Labels -This is the list of the labels generated by NVIDIA GPU Feature Discovery and -their meaning: +Below is the list of the labels generated by NVIDIA GPU Feature Discovery and their meaning. +For a similar list of labels generated or used by the device plugin, see [here](/README.md#catalog-of-labels). + +> [!NOTE] +> Label values in Kubernetes are always of type string. The table's value type describes the type within string formatting. | Label Name | Value Type | Meaning | Example | | -------------------------------| ---------- |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -------------- | -| nvidia.com/cuda.driver.major | Integer | Major of the version of NVIDIA driver | 418 | -| nvidia.com/cuda.driver.minor | Integer | Minor of the version of NVIDIA driver | 30 | -| nvidia.com/cuda.driver.rev | Integer | Revision of the version of NVIDIA driver | 40 | -| nvidia.com/cuda.runtime.major | Integer | Major of the version of CUDA | 10 | -| nvidia.com/cuda.runtime.minor | Integer | Minor of the version of CUDA | 1 | -| nvidia.com/gfd.timestamp | Integer | Timestamp of the generated labels (optional) | 1555019244 | -| nvidia.com/gpu.compute.major | Integer | Major of the compute capabilities | 3 | -| nvidia.com/gpu.compute.minor | Integer | Minor of the compute capabilities | 3 | +| nvidia.com/cuda.driver.major | Integer | (Deprecated) Major of the version of NVIDIA driver | 550 | +| nvidia.com/cuda.driver.minor | Integer | (Deprecated) Minor of the version of NVIDIA driver | 107 | +| nvidia.com/cuda.driver.rev | Integer | (Deprecated) Revision of the version of NVIDIA driver | 02 | +| nvidia.com/cuda.driver-version.major | Integer | Major of the version of NVIDIA driver | 550 | +| nvidia.com/cuda.driver-version.minor | Integer | Minor of the version of NVIDIA driver | 107 | +| nvidia.com/cuda.driver-version.revision | Integer | Revision of the version of NVIDIA driver | 02 | +| nvidia.com/cuda.driver-version.full | Integer | Full version number of NVIDIA driver | 550.107.02 | +| nvidia.com/cuda.runtime.major | Integer | (Deprecated) Major of the version of CUDA | 12 | +| nvidia.com/cuda.runtime.minor | Integer | (Deprecated) Minor of the version of CUDA | 5 | +| nvidia.com/cuda.runtime-version.major | Integer | Major of the version of CUDA | 12 | +| nvidia.com/cuda.runtime-version.minor | Integer | Minor of the version of CUDA | 5 | +| nvidia.com/cuda.runtime-version.full | Integer | Full version number of CUDA | 12.5 | +| nvidia.com/gfd.timestamp | Integer | Timestamp of the generated labels (optional) | 1724632719 | +| nvidia.com/gpu.compute.major | Integer | Major of the compute capabilities | 7 | +| nvidia.com/gpu.compute.minor | Integer | Minor of the compute capabilities | 5 | | nvidia.com/gpu.count | Integer | Number of GPUs | 2 | -| nvidia.com/gpu.family | String | Architecture family of the GPU | kepler | -| nvidia.com/gpu.machine | String | Machine type | DGX-1 | -| nvidia.com/gpu.memory | Integer | Memory of the GPU in Mb | 2048 | -| nvidia.com/gpu.product | String | Model of the GPU | GeForce-GT-710 | -| nvidia.com/gpu.mode | String | Display or Compute Mode of the GPU. Details of the GPU modes can be found [here](https://docs.nvidia.com/grid/13.0/grid-gpumodeswitch-user-guide/index.html#compute-and-graphics-mode) | compute | +| nvidia.com/gpu.family | String | Architecture family of the GPU | turing | +| nvidia.com/gpu.machine | String | Machine type. If in a public cloud provider, value may be set to the instance type. | DGX-1 | +| nvidia.com/gpu.memory | Integer | Memory of the GPU in megabytes (MB) | 15360 | +| nvidia.com/gpu.product | String | Model of the GPU. May be modified by the device plugin if a sharing strategy is employed depending on the config. | Tesla-T4 | +| nvidia.com/gpu.replicas | String | Number of GPU replicas available. Will be equal to the number of physical GPUs unless some sharing strategy is employed in which case the GPU count will be multiplied by replicas. | 4 | +| nvidia.com/gpu.mode | String | Mode of the GPU. Can be either "compute" or "display". Details of the GPU modes can be found [here](https://docs.nvidia.com/grid/13.0/grid-gpumodeswitch-user-guide/index.html#compute-and-graphics-mode) | compute | | nvidia.com/gpu.clique | String | GPUFabric ClusterUUID + CliqueID | 7b968a6d-c8aa-45e1-9e07-e1e51be99c31.1 | | nvidia.com/gpu.imex-domain | String | IMEX domain Ip list(Hashed) + CliqueID | 79b326e7-d566-3483-c2a3-9b38fa5cb1c8.1 | @@ -229,7 +239,7 @@ is partitioned into 7 equal sized MIG devices (56 total). | nvidia.com/mig.strategy | String | MIG strategy in use | single | | nvidia.com/gpu.product (overridden) | String | Model of the GPU (with MIG info added) | A100-SXM4-40GB-MIG-1g.5gb | | nvidia.com/gpu.count (overridden) | Integer | Number of MIG devices | 56 | -| nvidia.com/gpu.memory (overridden) | Integer | Memory of each MIG device in Mb | 5120 | +| nvidia.com/gpu.memory (overridden) | Integer | Memory of each MIG device in megabytes (MB) | 5120 | | nvidia.com/gpu.multiprocessors | Integer | Number of Multiprocessors for MIG device | 14 | | nvidia.com/gpu.slices.gi | Integer | Number of GPU Instance slices | 1 | | nvidia.com/gpu.slices.ci | Integer | Number of Compute Instance slices | 1 | @@ -243,6 +253,7 @@ is partitioned into 7 equal sized MIG devices (56 total). With this strategy, a separate set of labels for each MIG device type is generated. The name of each MIG device type is defines as follows: + ``` MIG_TYPE=mig-g..gb e.g. MIG_TYPE=mig-3g.20gb @@ -252,7 +263,7 @@ e.g. MIG_TYPE=mig-3g.20gb | ------------------------------------ | ---------- | ---------------------------------------- | -------------- | | nvidia.com/mig.strategy | String | MIG strategy in use | mixed | | nvidia.com/MIG\_TYPE.count | Integer | Number of MIG devices of this type | 2 | -| nvidia.com/MIG\_TYPE.memory | Integer | Memory of MIG device type in Mb | 10240 | +| nvidia.com/MIG\_TYPE.memory | Integer | Memory of MIG device type in megabytes (MB) | 10240 | | nvidia.com/MIG\_TYPE.multiprocessors | Integer | Number of Multiprocessors for MIG device | 14 | | nvidia.com/MIG\_TYPE.slices.ci | Integer | Number of GPU Instance slices | 1 | | nvidia.com/MIG\_TYPE.slices.gi | Integer | Number of Compute Instance slices | 1 | @@ -264,53 +275,50 @@ e.g. MIG_TYPE=mig-3g.20gb ## Deployment via `helm` -The preferred method to deploy `gpu-feature-discovery` is as a daemonset using `helm`. +The preferred method to deploy GFD is as a daemonset using `helm`. Instructions for installing `helm` can be found [here](https://helm.sh/docs/intro/install/). -As of `v0.15.0`, the device plugin's helm chart has integrated support to deploy -[`gpu-feature-discovery`](https://gitlab.com/nvidia/kubernetes/gpu-feature-discovery/-/tree/main) +As of `v0.15.0`, the device plugin's helm chart has integrated support to deploy GFD. -When gpu-feature-discovery in deploying standalone, begin by setting up the -plugin's `helm` repository and updating it at follows: +To deploy GFD standalone, begin by setting up the plugin's `helm` repository and updating it as follows: ```shell -$ helm repo add nvdp https://nvidia.github.io/k8s-device-plugin -$ helm repo update +helm repo add nvdp https://nvidia.github.io/k8s-device-plugin +helm repo update ``` -Then verify that the latest release (`v0.15.0`) of the plugin is available -(Note that this includes the GFD chart): +Then verify that the latest release of the plugin is available +(Note that this includes GFD ): ```shell $ helm search repo nvdp --devel NAME CHART VERSION APP VERSION DESCRIPTION -nvdp/nvidia-device-plugin 0.15.0 0.15.0 A Helm chart for ... +nvdp/nvidia-device-plugin 0.16.2 0.16.2 A Helm chart for ... ``` -Once this repo is updated, you can begin installing packages from it to deploy -the `gpu-feature-discovery` component in standalone mode. +Once this repo is updated, you can begin installing packages from it to deploy GFD in standalone mode. The most basic installation command without any options is then: -``` -$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ - --version 0.15.0 \ +```shell +helm upgrade -i nvdp nvdp/nvidia-device-plugin \ + --version 0.16.2 \ --namespace gpu-feature-discovery \ --create-namespace \ --set devicePlugin.enabled=false ``` Disabling auto-deployment of NFD and running with a MIG strategy of 'mixed' in -the default namespace. +the default namespace: ```shell -$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \ - --version=0.15.0 \ - --set allowDefaultNamespace=true \ - --set nfd.enabled=false \ - --set migStrategy=mixed \ - --set devicePlugin.enabled=false +helm upgrade -i nvdp nvdp/nvidia-device-plugin \ + --version=0.16.2 \ + --set allowDefaultNamespace=true \ + --set nfd.enabled=false \ + --set migStrategy=mixed \ + --set devicePlugin.enabled=false ``` **Note:** You only need the to pass the `--devel` flag to `helm search repo` @@ -319,19 +327,19 @@ version (e.g. `-rc.1`). Full releases will be listed without this. ### Deploying via `helm install` with a direct URL to the `helm` package -If you prefer not to install from the `nvidia-device-plugin` `helm` repo, you can -run `helm install` directly against the tarball of the plugin's `helm` package. +If you prefer not to install from the `nvidia-device-plugin` helm repo, you can +run `helm install` directly against the tarball of the plugin's helm package. The example below installs the same chart as the method above, except that -it uses a direct URL to the `helm` chart instead of via the `helm` repo. +it uses a direct URL to the helm chart instead of via the helm repo. Using the default values for the flags: ```shell -$ helm upgrade -i nvdp \ - --namespace gpu-feature-discovery \ - --set devicePlugin.enabled=false \ - --create-namespace \ - https://nvidia.github.io/k8s-device-plugin/stable/nvidia-device-plugin-0.15.0.tgz +helm upgrade -i nvdp \ + --namespace gpu-feature-discovery \ + --set devicePlugin.enabled=false \ + --create-namespace \ + https://nvidia.github.io/k8s-device-plugin/stable/nvidia-device-plugin-0.16.2.tgz ``` ## Building and running locally on your native machine @@ -342,7 +350,7 @@ Download the source code: git clone https://github.com/NVIDIA/k8s-device-plugin ``` -Get dependies: +Get dependencies: ```shell make vendor @@ -350,11 +358,12 @@ make vendor Build it: -``` +```shell make build ``` Run it: -``` + +```shell ./gpu-feature-discovery --output=$(pwd)/gfd ```