Failed to leverage NVML to detect GPU links #439

Somefive · 2023-09-18T06:25:21Z

1. Issue or feature description

While using k8s-device-plugins on Kubernetes clusters, I found the GPU card allocation is not as expected. For example, for one node with 8 GPU cards, the first four (card 0 to 3) and the last four cards (card 4 to 7) have faster links (such as connecting to the same CPU). But for applications using 2 GPU cards, the device plugin would possible allocate cards across that set, such as allocating to card 2 and card 6. This could lead to bad performance if the application is communication-bounded.

To dig into this issue, I add logs into the allocation process and found that all links between GPU cards are recognized as P2PLinkUnknown. The call trace is as follow.

k8s-device-plugin/internal/rm/allocate.go

Line 46 in 5c4e43d

availableDevices, err := gpuallocator.NewDevicesFrom(available)
k8s-device-plugin/vendor/github.com/NVIDIA/go-gpuallocator/gpuallocator/device.go

Line 52 in 5c4e43d

p2plink, err := nvml.GetP2PLink(d1.Device, d2.Device)
k8s-device-plugin/vendor/github.com/NVIDIA/gpu-monitoring-tools/bindings/go/nvml/nvml.go

Line 595 in 5c4e43d

return P2PLinkUnknown, err
k8s-device-plugin/vendor/github.com/NVIDIA/gpu-monitoring-tools/bindings/go/nvml/bindings.go

Line 249 in 5c4e43d

r := dl.lookupSymbol("nvmlDeviceGetTopologyCommonAncestor")

The "nvmlDeviceGetTopologyCommonAncestor" is supported in NVML after driver 470 while my driver version is 515 which can be found at https://docs.nvidia.com/deploy/archive/R515/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1g518f3f171c40e5832cb2b1b26e960b2b

The reason for this problem roots from the uninitialized dl.handles which should have been initialized at

k8s-device-plugin/vendor/github.com/NVIDIA/gpu-monitoring-tools/bindings/go/nvml/nvml_dl.go

Line 34 in 5c4e43d

dl.handles = append(dl.handles, handle)

.

Nevertheless, the Init function is not called during the setup of nvdp.

Then I goes to the initialize function for the NVML manager and add the init function call for the NVML binding as follows

import nvmlbinding "github.com/NVIDIA/gpu-monitoring-tools/bindings/go/nvml"
...
...
// NewNVMLResourceManagers returns a set of ResourceManagers, one for each NVML resource in 'config'.
func NewNVMLResourceManagers(nvmllib nvml.Interface, config *spec.Config) ([]ResourceManager, error) {
	ret := nvmllib.Init()
	if ret != nvml.SUCCESS {
		return nil, fmt.Errorf("failed to initialize NVML: %v", ret)
	}
	if err := nvmlbinding.Init(); err != nil {
		return nil, fmt.Errorf("failed to init nvml binding: %w", err)
	}
	defer func() {
		ret := nvmllib.Shutdown()
		if ret != nvml.SUCCESS {
			klog.Infof("Error shutting down NVML: %v", ret)
		}
	}()

Then it works.

I'm not sure if it is a wanted fix or if I mis-configured something (I use a standard setup with helm install of v0.14.1). If it is a bug and a fix is wanted, I would be happy to raise a PR for that.

2. Steps to reproduce the issue

Install k8s-device-plugin v0.14.1 in a standard Kubernetes cluster with GPU nodes (Driver 515, cuda 11.7), each containing 8 gpu cards.
Try to create an application using 2 gpu cards.
Using nvidia-smi to check which cards are allocated.

The text was updated successfully, but these errors were encountered:

elezar · 2023-09-18T08:33:02Z

@Somefive thanks for the detailed investigation. I think the workaround is something we can accept for the time being. We have had a backlog item to migrate gogpuallocator to our updated go-nvml bindings for some time now, and have some changes in flight.

With that said, an MR against our gitlab repo https://gitlab.com/nvidia/kubernetes/device-plugin would be appreciated.

cc @ArangoGutierrez @klueska

flyingfang · 2023-11-14T09:14:04Z

We encountered a similar problem, when we turned off the health check by setting environment DP_DISABLE_HEALTHCHECKS. Since nvml was not initialized, getPreferredAllocation error "calling nvml. GetDeviceCount: nvml: Uninitialized"

elezar · 2024-02-01T11:10:03Z

@flyingfang we have released v0.14.4 of the plugin which should have addressed this issue.

Please test the latest release and close if your issue is addressed.

flyingfang mentioned this issue Nov 14, 2023

fix getPreferredAllocation occurred 'nvml Uninitialized' #456

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to leverage NVML to detect GPU links #439

Failed to leverage NVML to detect GPU links #439

Somefive commented Sep 18, 2023 •

edited

Loading

elezar commented Sep 18, 2023

flyingfang commented Nov 14, 2023

elezar commented Feb 1, 2024

Failed to leverage NVML to detect GPU links #439

Failed to leverage NVML to detect GPU links #439

Comments

Somefive commented Sep 18, 2023 • edited Loading

1. Issue or feature description

2. Steps to reproduce the issue

elezar commented Sep 18, 2023

flyingfang commented Nov 14, 2023

elezar commented Feb 1, 2024

Somefive commented Sep 18, 2023 •

edited

Loading