Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to leverage NVML to detect GPU links #439

Open
Somefive opened this issue Sep 18, 2023 · 3 comments
Open

Failed to leverage NVML to detect GPU links #439

Somefive opened this issue Sep 18, 2023 · 3 comments

Comments

@Somefive
Copy link

Somefive commented Sep 18, 2023

1. Issue or feature description

While using k8s-device-plugins on Kubernetes clusters, I found the GPU card allocation is not as expected. For example, for one node with 8 GPU cards, the first four (card 0 to 3) and the last four cards (card 4 to 7) have faster links (such as connecting to the same CPU). But for applications using 2 GPU cards, the device plugin would possible allocate cards across that set, such as allocating to card 2 and card 6. This could lead to bad performance if the application is communication-bounded.

To dig into this issue, I add logs into the allocation process and found that all links between GPU cards are recognized as P2PLinkUnknown. The call trace is as follow.

  1. availableDevices, err := gpuallocator.NewDevicesFrom(available)
  2. p2plink, err := nvml.GetP2PLink(d1.Device, d2.Device)
  3. r := dl.lookupSymbol("nvmlDeviceGetTopologyCommonAncestor")

The "nvmlDeviceGetTopologyCommonAncestor" is supported in NVML after driver 470 while my driver version is 515 which can be found at https://docs.nvidia.com/deploy/archive/R515/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1g518f3f171c40e5832cb2b1b26e960b2b

The reason for this problem roots from the uninitialized dl.handles which should have been initialized at

.

Nevertheless, the Init function is not called during the setup of nvdp.

Then I goes to the initialize function for the NVML manager and add the init function call for the NVML binding as follows

import nvmlbinding "github.com/NVIDIA/gpu-monitoring-tools/bindings/go/nvml"
...
...
// NewNVMLResourceManagers returns a set of ResourceManagers, one for each NVML resource in 'config'.
func NewNVMLResourceManagers(nvmllib nvml.Interface, config *spec.Config) ([]ResourceManager, error) {
	ret := nvmllib.Init()
	if ret != nvml.SUCCESS {
		return nil, fmt.Errorf("failed to initialize NVML: %v", ret)
	}
	if err := nvmlbinding.Init(); err != nil {
		return nil, fmt.Errorf("failed to init nvml binding: %w", err)
	}
	defer func() {
		ret := nvmllib.Shutdown()
		if ret != nvml.SUCCESS {
			klog.Infof("Error shutting down NVML: %v", ret)
		}
	}()

Then it works.

I'm not sure if it is a wanted fix or if I mis-configured something (I use a standard setup with helm install of v0.14.1). If it is a bug and a fix is wanted, I would be happy to raise a PR for that.

2. Steps to reproduce the issue

  1. Install k8s-device-plugin v0.14.1 in a standard Kubernetes cluster with GPU nodes (Driver 515, cuda 11.7), each containing 8 gpu cards.
  2. Try to create an application using 2 gpu cards.
  3. Using nvidia-smi to check which cards are allocated.
@elezar
Copy link
Member

elezar commented Sep 18, 2023

@Somefive thanks for the detailed investigation. I think the workaround is something we can accept for the time being. We have had a backlog item to migrate gogpuallocator to our updated go-nvml bindings for some time now, and have some changes in flight.

With that said, an MR against our gitlab repo https://gitlab.com/nvidia/kubernetes/device-plugin would be appreciated.

cc @ArangoGutierrez @klueska

@flyingfang
Copy link

We encountered a similar problem, when we turned off the health check by setting environment DP_DISABLE_HEALTHCHECKS. Since nvml was not initialized, getPreferredAllocation error "calling nvml. GetDeviceCount: nvml: Uninitialized"

@elezar
Copy link
Member

elezar commented Feb 1, 2024

@flyingfang we have released v0.14.4 of the plugin which should have addressed this issue.

Please test the latest release and close if your issue is addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants