Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] Dynamic MIG partitioner #361

Open
elgalu opened this issue Jun 24, 2022 · 7 comments
Open

[feature] Dynamic MIG partitioner #361

elgalu opened this issue Jun 24, 2022 · 7 comments

Comments

@elgalu
Copy link

elgalu commented Jun 24, 2022

Would it be possible that the gpu operator makes the MIG setup transparent such that the end user can directly request per-pod GPU memory requirements on-demand while, under the hood, MIG configuration is dynamically re-partitioned? i.e. without any intervention of a sysadmin / devops team.

# requesting eight 40Gi MIG slices
resources:
    limits:
        nvidia.com/gpu: 8
        nvidia.com/gpu_memory: "40Gi"

https://www.nvidia.com/en-us/technologies/multi-instance-gpu/

MIG instances can also be dynamically reconfigured

@klueska
Copy link
Contributor

klueska commented Jun 24, 2022

Please see this document on why this is not feasible under the current Kubernetes resource model:
Challenges Supporting Multi-Instance GPUs (MIG) in Kubernetes

Once the following newly accepted Kubernetes Enhancement proposal gets implemented, we will be able to build a device plugin that properly supports what you suggest:
kubernetes/enhancements#3064

@elgalu
Copy link
Author

elgalu commented Aug 2, 2022

I updated the example to emphasise the capability to perform MIG-sliced multi-gpu training, e.g. by requesting eight 40Gi MIG slices (from different GPU cards on the same DGX). This is currently not possible AFAIK not even with a static MIG layout.

@klueska
Copy link
Contributor

klueska commented Aug 3, 2022

Once we have Dynamic Resource Allocation all of what you propose will be possible. We do not plan to "hack" this support onto the existing plugin and instead will be putting all efforts to support an API like this into the new plugin for DRA.

@omer-dayan
Copy link

I agree with @klueska about how DRA is the right way.
However, @elgalu, I do not agree its not feasible under the current circumstances.
You welcome to watch the following video (https://www.youtube.com/watch?v=zk7g3FbW7go) that show it had been achieved in Kubernetes

@sshukun
Copy link

sshukun commented Jan 22, 2023

Hi @klueska, I cannot wait to try this new DRA feature but after read the KEP, I have some concerns about how the resource driver will be implemented.

What I want is not only dynamic MIG configuration but also dynamically allocating network-attached GPUs.

In my understanding, a Resource Driver needs to define its own ResourceClaimParameter CRD, allocate and configure the devices, and interact with kubelet to prepare devices for containers. Most of these work are device specific and should be handled by the device vendor I believe, but allocation seems different and complicate when the devices are dynamically attached from network.
How could the resource driver determine which device it should attach and how to interact with the infrastructure to attach the device? My infrastructure is built with Liqid fabric switches connected with NVIDIA gpus and bare metal servers. A machine can be created and reprogrammed using Liqid management software. In this case, do I need to write some component to receive request from the Resource Driver and interact with Liqid by myself?

Could you tell me the NVIDIA's thought about how to implement a Resource Driver and how to support dynamically attaching devices?

@zeryx
Copy link

zeryx commented May 6, 2024

Please see this document on why this is not feasible under the current Kubernetes resource model: Challenges Supporting Multi-Instance GPUs (MIG) in Kubernetes

Once the following newly accepted Kubernetes Enhancement proposal gets implemented, we will be able to build a device plugin that properly supports what you suggest: kubernetes/enhancements#3064

Looks like kubernetes/enhancements#3064 has merged! Any thoughts on this ask?

@likku123
Copy link

Right now I am using https://github.com/nebuly-ai/nos for dynamic GPU partitioning. It's solving the purpose for now but facing issue when using with Karpenter. These days group is not active enough to contribute for the solution.
Hoping NVIDIA will come up with a plugin to address this requirement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants