You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While using k8s-device-plugins on Kubernetes clusters, I found the GPU card allocation is not as expected. For example, for one node with 8 GPU cards, the first four (card 0 to 3) and the last four cards (card 4 to 7) have faster links (such as connecting to the same CPU). But for applications using 2 GPU cards, the device plugin would possible allocate cards across that set, such as allocating to card 2 and card 6. This could lead to bad performance if the application is communication-bounded.
To dig into this issue, I add logs into the allocation process and found that all links between GPU cards are recognized as P2PLinkUnknown. The call trace is as follow.
Nevertheless, the Init function is not called during the setup of nvdp.
Then I goes to the initialize function for the NVML manager and add the init function call for the NVML binding as follows
import nvmlbinding "github.com/NVIDIA/gpu-monitoring-tools/bindings/go/nvml"
...
...
// NewNVMLResourceManagers returns a set of ResourceManagers, one for each NVML resource in 'config'.
func NewNVMLResourceManagers(nvmllib nvml.Interface, config *spec.Config) ([]ResourceManager, error) {
ret := nvmllib.Init()
if ret != nvml.SUCCESS {
return nil, fmt.Errorf("failed to initialize NVML: %v", ret)
}
if err := nvmlbinding.Init(); err != nil {
return nil, fmt.Errorf("failed to init nvml binding: %w", err)
}
defer func() {
ret := nvmllib.Shutdown()
if ret != nvml.SUCCESS {
klog.Infof("Error shutting down NVML: %v", ret)
}
}()
Then it works.
I'm not sure if it is a wanted fix or if I mis-configured something (I use a standard setup with helm install of v0.14.1). If it is a bug and a fix is wanted, I would be happy to raise a PR for that.
2. Steps to reproduce the issue
Install k8s-device-plugin v0.14.1 in a standard Kubernetes cluster with GPU nodes (Driver 515, cuda 11.7), each containing 8 gpu cards.
Try to create an application using 2 gpu cards.
Using nvidia-smi to check which cards are allocated.
The text was updated successfully, but these errors were encountered:
@Somefive thanks for the detailed investigation. I think the workaround is something we can accept for the time being. We have had a backlog item to migrate gogpuallocator to our updated go-nvml bindings for some time now, and have some changes in flight.
We encountered a similar problem, when we turned off the health check by setting environment DP_DISABLE_HEALTHCHECKS. Since nvml was not initialized, getPreferredAllocation error "calling nvml. GetDeviceCount: nvml: Uninitialized"
1. Issue or feature description
While using k8s-device-plugins on Kubernetes clusters, I found the GPU card allocation is not as expected. For example, for one node with 8 GPU cards, the first four (card 0 to 3) and the last four cards (card 4 to 7) have faster links (such as connecting to the same CPU). But for applications using 2 GPU cards, the device plugin would possible allocate cards across that set, such as allocating to card 2 and card 6. This could lead to bad performance if the application is communication-bounded.
To dig into this issue, I add logs into the allocation process and found that all links between GPU cards are recognized as
P2PLinkUnknown
. The call trace is as follow.k8s-device-plugin/internal/rm/allocate.go
Line 46 in 5c4e43d
k8s-device-plugin/vendor/github.com/NVIDIA/go-gpuallocator/gpuallocator/device.go
Line 52 in 5c4e43d
k8s-device-plugin/vendor/github.com/NVIDIA/gpu-monitoring-tools/bindings/go/nvml/nvml.go
Line 595 in 5c4e43d
k8s-device-plugin/vendor/github.com/NVIDIA/gpu-monitoring-tools/bindings/go/nvml/bindings.go
Line 249 in 5c4e43d
The "nvmlDeviceGetTopologyCommonAncestor" is supported in NVML after driver 470 while my driver version is 515 which can be found at https://docs.nvidia.com/deploy/archive/R515/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1g518f3f171c40e5832cb2b1b26e960b2b
The reason for this problem roots from the uninitialized
dl.handles
which should have been initialized atk8s-device-plugin/vendor/github.com/NVIDIA/gpu-monitoring-tools/bindings/go/nvml/nvml_dl.go
Line 34 in 5c4e43d
Nevertheless, the
Init
function is not called during the setup of nvdp.Then I goes to the initialize function for the NVML manager and add the init function call for the NVML binding as follows
Then it works.
I'm not sure if it is a wanted fix or if I mis-configured something (I use a standard setup with helm install of v0.14.1). If it is a bug and a fix is wanted, I would be happy to raise a PR for that.
2. Steps to reproduce the issue
The text was updated successfully, but these errors were encountered: