Add MPS sharing section to README

Signed-off-by: Evan Lezar <[email protected]>
NVIDIA · Feb 8, 2024 · 41fa3b0 · 41fa3b0
1 parent f6fe2ce
commit 41fa3b0
Showing 1 changed file with 94 additions and 5 deletions.
diff --git a/README.md b/README.md
@@ -348,18 +348,107 @@ options outside of this section are shared.
   launch time. As described below, a `ConfigMap` can be used to point the
   plugin at a desired configuration file when deploying via `helm`.
 
-### Shared Access to GPUs with CUDA Time-Slicing
+### Shared Access to GPUs
 
 The NVIDIA device plugin allows oversubscription of GPUs through a set of
-extended options in its configuration file. Under the hood, CUDA time-slicing
-is used to allow workloads that land on oversubscribed GPUs to interleave with
-one another. However, nothing special is done to isolate workloads that are
+extended options in its configuration file. There are two flavors of sharing
+available: Time-Slicing and MPS.
+
+**Note:** The use of time-slicing and MPS are mutually exclusive.
+
+In the case of time-slicing, CUDA time-slicing is used to allow workloads sharing a GPU to
+interleave with each other. However, nothing special is done to isolate workloads that are
 granted replicas from the same underlying GPU, and each workload has access to
 the GPU memory and runs in the same fault-domain as of all the others (meaning
 if one workload crashes, they all do).
 
+In the case of MPS, a control daemon is used to manage access to the shared GPU. This daemon also allows for memory and compute limits to be enforced per workload.
+
+### With CUDA MPS
+
+**Note:** Using MPS to share access to GPU resources requires a CDI-enabled device
+list strategy: `cdi-annotations`. The `envvar` and `volume-mounts` strategies are not supported.
+
+The extended options for sharing using MPS can be seen below:
+```
+version: v1
+sharing:
+  mps:
+    renameByDefault: <bool>
+    resources:
+    - name: <resource-name>
+      replicas: <num-replicas>
+    ...
+```
+
+That is, for each named resource under `sharing.mps.resources`, a number
+of replicas can be specified for that resource type. As is the case with
+time-slicing, these replicas represent the number of shared accesses that will
+be granted for a GPU associated with that resource type. In contrast with
+time-slicing, the amount of memory allowed per client (i.e. per partition) is
+managed by the MPS control daemon and limited to an equal fraction of the total
+device memory. In addition to controlling the amount of memory that each client
+can consume, the MPS control daemon also limits the amount of compute capacity
+that can be consumed by a client.
+
+If `renameByDefault=true`, then each resource will be advertised under the name
+`<resource-name>.shared` instead of simply `<resource-name>`.
+
+For example:
+```
+version: v1
+sharing:
+  mps:
+    resources:
+    - name: nvidia.com/gpu
+      replicas: 10
+```
+
+If this configuration were applied to a node with 8 GPUs on it, the plugin
+would now advertise 80 `nvidia.com/gpu` resources to Kubernetes instead of 8.
+
+```
+$ kubectl describe node
+...
+Capacity:
+  nvidia.com/gpu: 80
+...
+```
+
+Likewise, if the following configuration were applied to a node, then 80
+`nvidia.com/gpu.shared` resources would be advertised to Kubernetes instead of 8
+`nvidia.com/gpu` resources.
+
+```
+version: v1
+sharing:
+  timeSlicing:
+    renameByDefault: true
+    resources:
+    - name: nvidia.com/gpu
+      replicas: 10
+    ...
+```
+
+```
+$ kubectl describe node
+...
+Capacity:
+  nvidia.com/gpu.shared: 80
+...
+```
+
+Furthermore, each of these resources -- either `nvidia.com/gpu` or
+`nvidia.com/gpu.shared` -- would have access to the same fraction (1/10) of the
+total memory and compute resources of the GPU.
+
+As of now, the only supported resource available for MPS are
+`nvidia.com/gpu` resources. This implies that it is only compatible when using
+full GPUs or the single MIG strategy.
+
+#### With CUDA Time-Slicing
 
-These extended options can be seen below:
+The extended options for sharing using time-slicing can be seen below:
 ```
 version: v1
 sharing: