Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-open-560.28.03 gives assertion error in dmesg with 10 RTX 4500 GPUs #694

Open
2 tasks done
QuesarVII opened this issue Aug 20, 2024 · 3 comments
Open
2 tasks done
Labels
bug Something isn't working NV-Triaged An NVBug has been created for dev to investigate

Comments

@QuesarVII
Copy link

NVIDIA Open GPU Kernel Modules Version

560.28.03

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Ubuntu 22.04

Kernel Release

6.8.0-40-generic

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

RTX 4500 Ada - quantity 10

Describe the bug

[ 40.046929] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533
[ 40.046960] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533
[ 40.227147] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533
[ 40.227179] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533
[ 40.441196] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533
[ 40.441227] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533
[ 40.542925] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533
[ 40.542955] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533
[ 40.840865] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533
[ 40.840896] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533
[ 40.972728] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533
[ 40.972759] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533
[ 41.157421] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533
[ 41.157459] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533
[ 41.356296] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533
[ 41.356326] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533
[ 41.557332] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533
[ 41.557362] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533
[ 41.724534] NVRM: nvAssertFailedNoLog: Assertion failed: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533
[ 41.724565] NVRM: nvAssertFailedNoLog: Assertion fai-iled: pGpu->gpuInstance < NV2080_MAX_SUBDEVICES @ gpu_mgr_sli.c:533

To Reproduce

We are using a Supermicro 4125gs-tnrt with (10) RTX 4500 Ada GPUs. The provided errors from dmesg occur during initialization upon boot of the system.

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

The Ubuntu cuda and cuda-toolkit-12-6 packages from the repo are requiring the nvidia-open packages instead of the dependency being an "either/or" on nvidia-driver-560 or nvidia-open-560. I worked around that to test using dummy nvidia-open and nvidia-open-560 packages so I could install the closed driver package instead, and that driver works without error.

@QuesarVII QuesarVII added the bug Something isn't working label Aug 20, 2024
@mtijanic
Copy link
Collaborator

mtijanic commented Aug 21, 2024

Hey there! Thanks for the report, great find! Here's a fix:

diff --git a/src/nvidia/src/kernel/gpu_mgr/gpu_mgr_sli.c b/src/nvidia/src/kernel/gpu_mgr/gpu_mgr_sli.c
index 23c484c8..a98a7353 100644
--- a/src/nvidia/src/kernel/gpu_mgr/gpu_mgr_sli.c
+++ b/src/nvidia/src/kernel/gpu_mgr/gpu_mgr_sli.c
@@ -528,9 +528,9 @@ gpumgrGetSliLinks(NV0000_CTRL_GPU_GET_VIDEO_LINKS_PARAMS *pVideoLinksParams)
     while ((pGpu = gpumgrGetNextGpu(gpuAttachMask, &gpuIndex)) &&
            (i < NV0000_CTRL_GPU_MAX_ATTACHED_GPUS))
     {
-        if (pGpu->gpuInstance >= NV2080_MAX_SUBDEVICES)
+        if (pGpu->gpuInstance >= NV_MAX_DEVICES)
         {
-            NV_ASSERT(pGpu->gpuInstance < NV2080_MAX_SUBDEVICES);
+            NV_ASSERT(pGpu->gpuInstance < NV_MAX_DEVICES);
             continue;
         }
 
@@ -542,7 +542,7 @@ gpumgrGetSliLinks(NV0000_CTRL_GPU_GET_VIDEO_LINKS_PARAMS *pVideoLinksParams)
                (j < NV0000_CTRL_GPU_MAX_VIDEO_LINKS))
         {
             if ((peerGpuIndex == gpuIndex) ||
-                (pPeerGpu->gpuInstance >= NV2080_MAX_SUBDEVICES))
+                (pPeerGpu->gpuInstance >= NV_MAX_DEVICES))
             {
                 continue;
             }

NV2080_MAX_SUBDEVICES is the maximum number of subdevices in a single SLI group, which is not relevant here. NV_MAX_DEVICES is the maximum number of GPUs in the system, which is what gpuInstance represents.

We'll include this fix in a future release.

As far as I can tell, this only has a minor impact on OpenGL displaying if multiple GPUs have displays connected to them. It should be entirely irrelevant for CUDA workloads. Except for the dmesg spam, obviously.

NV bug reference: 4817640

@mtijanic
Copy link
Collaborator

I worked around that to test using dummy nvidia-open and nvidia-open-560 packages so I could install the closed driver package instead, and that driver works without error.

By the way, the error is present there as well, it's just not routed to dmesg and so not end user visible.

@mtijanic mtijanic added the NV-Triaged An NVBug has been created for dev to investigate label Aug 21, 2024
@QuesarVII
Copy link
Author

This patch resolved those errors. In reading the code it looked like it should have been testing vs max devices instead of max subdevices but I wasn't sure. Thanks for the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working NV-Triaged An NVBug has been created for dev to investigate
Projects
None yet
Development

No branches or pull requests

2 participants