Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos/nvidia: softdep on nvidia-uvm fails with the open driver (breaks CUDA) #334180

Open
eljamm opened this issue Aug 12, 2024 · 8 comments
Open
Labels
0.kind: bug Something is broken 6.topic: cuda Parallel computing platform and API 6.topic: hardware 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS

Comments

@eljamm
Copy link
Contributor

eljamm commented Aug 12, 2024

Describe the bug

Currently, the nvidia-uvm module is lazily loaded (EDIT: introduced in #267335) after the nvidia module:

boot.extraModprobeConfig = ''
  softdep nvidia post: nvidia-uvm
'';

This works fine with the closed-source module, but not with the open-source one. I'm currently using the stable 555.58.02, but I remember encountering this issue with the beta 560 drivers as well.

Aside from manually loading the module with modprobe, running nvidia-settings or nvidia-bug-report.sh (possibly other commands as well) makes the module load as reported in NVIDIA/open-gpu-kernel-modules#689

EDIT(SomeoneSerge): There are more reports reproducing the issue in #286028

Steps To Reproduce

  1. Enable the open nvidia kernel module with hardware.nvidia.open = true;
  2. Reboot your machine
  3. Try to use an application that uses CUDA (fails)
  4. See that nvidia-uvm is not loaded:
$ lsmod | grep nvidia_uvm
$ lsmod | grep nvidia
nvidia_drm            122880  1
nvidia_modeset       1916928  1 nvidia_drm
nvidia_wmi_ec_backlight    12288  0
nvidia              10297344  1 nvidia_modeset
video                  77824  4 nvidia_wmi_ec_backlight,amdgpu,ideapad_laptop,nvidia_modeset
backlight              28672  5 video,nvidia_wmi_ec_backlight,amdgpu,ideapad_laptop,nvidia_modeset
wmi                    32768  5 video,nvidia_wmi_ec_backlight,wmi_bmof,legion_laptop,ideapad_laptop
ecc                    40960  2 ecdh_generic,nvidia
  1. Run an nvidia- command with sudo:
$ sudo nvidia-settings
ERROR: libEGL setup error : libEGL.so.1: cannot open shared object file: No such file or directory

error: XDG_RUNTIME_DIR is invalid or not set in the environment.

(nvidia-settings:4006): GLib-GObject-CRITICAL **: 10:12:56.602: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
  1. See that nvidia-uvm is loaded:
$ lsmod | grep nvidia_uvm
nvidia_uvm           6815744  0
nvidia              10297344  2 nvidia_uvm,nvidia_modeset
$ lsmod | grep nvidia
nvidia_uvm           6815744  0
nvidia_drm            122880  1
nvidia_modeset       1916928  1 nvidia_drm
nvidia_wmi_ec_backlight    12288  0
nvidia              10297344  2 nvidia_uvm,nvidia_modeset
video                  77824  4 nvidia_wmi_ec_backlight,amdgpu,ideapad_laptop,nvidia_modeset
backlight              28672  5 video,nvidia_wmi_ec_backlight,amdgpu,ideapad_laptop,nvidia_modeset
wmi                    32768  5 video,nvidia_wmi_ec_backlight,wmi_bmof,legion_laptop,ideapad_laptop
ecc                    40960  2 ecdh_generic,nvidia
  1. Try to use an application that uses CUDA (succeeds)

Expected behavior

The nvidia-uvm module loads automatically after the nvidia module.

Additional context

It's worth noting that nvidia-uvm is the only module not in boot.kernelModules. Adding it there, the module is automatically loaded on startup, as intended.

Notify maintainers

Pinging those who I think would be most interested in this. Apologies if I'm wrong.

@Kiskae @NickCao @NixOS/cuda-maintainers

Metadata

[user@system:~]$ nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 6.10.3-xanmod1, NixOS, 24.11 (Vicuna), 24.11.20240809.5e0ca22`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Lix, like Nix) 2.90.0`
 - nixpkgs: `/nix/store/0jr2kk95c34c0b6yxi75q4fqgb43kqkm-source`

Add a 👍 reaction to issues you find important.

@eljamm eljamm added the 0.kind: bug Something is broken label Aug 12, 2024
@SomeoneSerge
Copy link
Contributor

SomeoneSerge commented Aug 12, 2024

CC #267335 @Atry

@SomeoneSerge
Copy link
Contributor

Maybe we can start by choosing between softdep and eager boot.kernelModules based on hardware.nvidia.open since, I think, the workflow in #267335 concerns the proprietary driver. Then we'll need somebody to confirm whether eager-loading works correctly with the opensource driver on Azure

@SomeoneSerge SomeoneSerge added 6.topic: hardware 6.topic: cuda Parallel computing platform and API labels Aug 12, 2024
@SomeoneSerge SomeoneSerge added the 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS label Aug 12, 2024
@SomeoneSerge
Copy link
Contributor

SomeoneSerge commented Aug 12, 2024

The EGL error seems unrelated. But also I don't quite understand how could nvidia-uvm it not be loaded when nvidia is (EDIT: and when manual modprobe nvidia_uvm helps)

@muni-corn
Copy link
Contributor

probably related: #333123

@eljamm
Copy link
Contributor Author

eljamm commented Aug 13, 2024

when manual modprobe nvidia_uvm helps

Just wanted to note that you can just use modprobe to load the module instead of the sudo nvidia-* commands for those who want to try this, but it's quite interesting how the latter can do it as well, though.

That said, the nvidia module apparently does not depend on nvidia-uvm anymore when using the open drivers.

# Closed
$ modprobe --show-depends nvidia
insmod /run/booted-system/kernel-modules/lib/modules/6.10.3-xanmod1/misc/nvidia.ko "NVreg_DynamicPowerManagement=0x02" NVreg_PreserveVideoMemoryAllocations=1
insmod /run/booted-system/kernel-modules/lib/modules/6.10.3-xanmod1/misc/nvidia.ko "NVreg_DynamicPowerManagement=0x02" NVreg_PreserveVideoMemoryAllocations=1
insmod /run/booted-system/kernel-modules/lib/modules/6.10.3-xanmod1/misc/nvidia-uvm.ko  # <---

# Open
$ modprobe --show-depends nvidia
insmod /run/booted-system/kernel-modules/lib/modules/6.10.3-xanmod1/kernel/crypto/ecc.ko.xz
insmod /run/booted-system/kernel-modules/lib/modules/6.10.3-xanmod1/kernel/drivers/video/nvidia.ko.xz "NVreg_DynamicPowerManagement=0x02" NVreg_PreserveVideoMemoryAllocations=1 NVreg_OpenRmEnableUnsupportedGpus=1

nvidia-uvm deps look fine to me, though.

# Closed
$ modprobe --show-depends nvidia-uvm
insmod /run/booted-system/kernel-modules/lib/modules/6.10.3-xanmod1/misc/nvidia.ko "NVreg_DynamicPowerManagement=0x02" NVreg_PreserveVideoMemoryAllocations=1
insmod /run/booted-system/kernel-modules/lib/modules/6.10.3-xanmod1/misc/nvidia-uvm.ko

# Open
$ modprobe --show-depends nvidia-uvm
insmod /run/booted-system/kernel-modules/lib/modules/6.10.3-xanmod1/kernel/crypto/ecc.ko.xz
insmod /run/booted-system/kernel-modules/lib/modules/6.10.3-xanmod1/kernel/drivers/video/nvidia.ko.xz "NVreg_DynamicPowerManagement=0x02" NVreg_PreserveVideoMemoryAllocations=1 NVreg_OpenRmEnableUnsupportedGpus=1
insmod /run/booted-system/kernel-modules/lib/modules/6.10.3-xanmod1/kernel/drivers/video/nvidia-uvm.ko.xz

@SomeoneSerge
Copy link
Contributor

@eljamm I'm pretty sure nvidia-uvm in the first chunk of the logs comes from softdep.

Trying dry-run (?) without disabling the closed driver I observe:

# nix build nixpkgs#linuxPackages.nvidia_x11_beta_open
# modprobe --show-depends ./result/lib/modules/6.6.36/kernel/drivers/video/nvidia.ko.xz 
modprobe: ERROR: kmod_module 'nvidia' already exists with different path: new-path='/run/booted-system/kernel-modules/lib/modules/6.1.96/misc/nvidia.ko' old-path='/root/./result/lib/modules/6.6.36/kernel/drivers/video/nvidia.ko.xz'
modprobe: ERROR: ctx=0x6f12a0 path=/run/booted-system/kernel-modules/lib/modules/6.1.96/misc/nvidia.ko error=File exists
modprobe: ERROR: kmod_module 'nvidia' already exists with different path: new-path='/run/booted-system/kernel-modules/lib/modules/6.1.96/misc/nvidia.ko' old-path='/root/./result/lib/modules/6.6.36/kernel/drivers/video/nvidia.ko.xz'
modprobe: ERROR: ctx=0x6f12a0 path=/run/booted-system/kernel-modules/lib/modules/6.1.96/misc/nvidia.ko error=File exists
insmod /run/booted-system/kernel-modules/lib/modules/6.1.96/kernel/drivers/base/firmware_loader/firmware_class.ko.xz path=/nix/store/fpcyqd1qd665aphflm8ii1r8s9z8pywr-firmware/lib/firmware 
insmod /root/./result/lib/modules/6.6.36/kernel/drivers/video/nvidia.ko.xz 
insmod /run/booted-system/kernel-modules/lib/modules/6.1.96/misc/nvidia-uvm.ko

As I said, I think it's justified to special-case the open driver in nixos modules until we've figured out why softdep might not work (if anyone's willing to prepare a PR)

@SomeoneSerge
Copy link
Contributor

#334340 was merged as a symptomatic treatment (thanks @eljamm) but the issue remains open:

  • Why softdep fails
  • What's up with the udev scripts that consistently fail?

@Kiskae
Copy link
Contributor

Kiskae commented Aug 25, 2024

  • What's up with the udev scripts that consistently fail?

It runs too early, when the driver is loaded but before the driver is bound to the hardware would be my guess.

Specifically

services.udev.extraRules =
is not restricted to the type of udev events, so they are executed on all events, even those where the driver isn't yet completely ready for initialization.

@SomeoneSerge SomeoneSerge changed the title CUDA doesn't work with nvidia-open because nvidia-uvm doesn't automatically load nixos/nvidia: softdep on nvidia-uvm fails with the open driver (breaks CUDA) Dec 9, 2024
marchenstar added a commit to marchenstar/flake that referenced this issue Dec 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug Something is broken 6.topic: cuda Parallel computing platform and API 6.topic: hardware 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS
Projects
Status: New
Development

No branches or pull requests

4 participants