Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Display freezing: Assertion Failed. Failed to process POST_EVENT with status 0x57 #739

Open
MeguminSama opened this issue Nov 17, 2024 · 22 comments
Labels
NV-Triaged An NVBug has been created for dev to investigate

Comments

@MeguminSama
Copy link

MeguminSama commented Nov 17, 2024

Seemingly at random, one display will completely freeze until I reboot. The other displays continue to work just fine. If I restart the monitor, all of my remaining displays will freeze. All monitors are DisplayPort, connected to an RTX 3070 Ti.

Running on

Arch Linux
Wayland 1.23.1
KDE Plasma 6.2.3
Linux 6.11.7 (Zen Kernel)

dmesg output is just:

~ sudo dmesg
...
[  286.249080] NVRM: nvAssertFailedNoLog: Assertion failed: CliGetEventInfo(rpc_params->hClient, rpc_params->hEvent, &pEvent) @ kernel_gsp.c:462
[  286.249091] NVRM: _kgspProcessRpcEvent: Failed to process received event 0x1003 (POST_EVENT) from GPU0: status=0x57

With the following NVIDIA packages installed:

~ paru -Q | grep nvidia
lib32-nvidia-utils 565.57.01-1
libva-nvidia-driver 0.0.13-1
nvidia-open-dkms 565.57.01-1
nvidia-utils 565.57.01-1
opencl-nvidia 565.57.01-1

Relevant code:

NV_ASSERT_OR_RETURN(CliGetEventInfo(rpc_params->hClient,
rpc_params->hEvent, &pEvent), NV_ERR_OBJECT_NOT_FOUND);

@foophoof
Copy link

foophoof commented Nov 19, 2024

I think I have this issue too.

Running Bazzite 41 (from Fedora Kinoite), Linux 6.11.8-305.bazzite.fc41.x86_64, AMD Ryzen 9 5900X, NVIDIA GeForce RTX 3080 Ti, KDE Plasma 6.2.3, Wayland 1.23.0.

$ rpm -qa | /bin/grep nvidia
nvidia-gpu-firmware-20241110-1.fc41.noarch
ublue-os-nvidia-addons-0.10-1.fc41.noarch
libnvidia-ml-565.57.01-4.fc41.x86_64
libnvidia-cfg-565.57.01-4.fc41.x86_64
nvidia-driver-cuda-libs-565.57.01-4.fc41.x86_64
nvidia-persistenced-565.57.01-1.fc41.x86_64
nvidia-driver-libs-565.57.01-4.fc41.x86_64
nvidia-container-toolkit-base-1.17.2-1.x86_64
libnvidia-container1-1.17.2-1.x86_64
libnvidia-container-tools-1.17.2-1.x86_64
nvidia-modprobe-565.57.01-1.fc41.x86_64
kmod-nvidia-565.57.01-1.fc41.x86_64
nvidia-kmod-common-565.57.01-2.fc41.noarch
nvidia-driver-565.57.01-4.fc41.x86_64
nvidia-libXNVCtrl-565.57.01-1.fc41.x86_64
libnvidia-ml-565.57.01-4.fc41.i686
nvidia-settings-565.57.01-1.fc41.x86_64
xorg-x11-nvidia-565.57.01-4.fc41.x86_64
nvidia-driver-cuda-565.57.01-4.fc41.x86_64
nvidia-container-toolkit-1.17.2-1.x86_64
libnvidia-fbc-565.57.01-4.fc41.x86_64
libva-nvidia-driver-0.0.13^20241108git259b7b7-1.fc41.x86_64
nvidia-driver-libs-565.57.01-4.fc41.i686
nvidia-driver-cuda-libs-565.57.01-4.fc41.i686

systemd journal:

Nov 19 21:23:18 bazzite kwin_wayland[2537]: kwin_wayland_drm: atomic commit failed: Invalid argument
Nov 19 21:23:20 bazzite kwin_wayland[2537]: kwin_wayland_drm: atomic commit failed: Invalid argument
Nov 19 21:26:38 bazzite kwin_wayland[2537]: kwin_wayland_drm: atomic commit failed: Invalid argument
Nov 19 21:28:15 bazzite kwin_wayland[2537]: kwin_wayland_drm: atomic commit failed: Invalid argument
Nov 19 21:29:09 bazzite kernel: NVRM: nvAssertFailedNoLog: Assertion failed: CliGetEventInfo(rpc_params->hClient, rpc_params->hEvent, &pEvent) @ kernel_gsp.c:462
Nov 19 21:29:09 bazzite kernel: NVRM: _kgspProcessRpcEvent: Failed to process received event 0x1003 (POST_EVENT) from GPU0: status=0x57
Nov 19 21:29:27 bazzite steam[3128]: ERROR: ld.so: object '/usr/lib/extest/libextest.so' from LD_PRELOAD cannot be preloaded (wrong ELF class: ELFCLASS32): ignored.
Nov 19 21:29:27 bazzite steam[3128]: ERROR: ld.so: object '/usr/lib/extest/libextest.so' from LD_PRELOAD cannot be preloaded (wrong ELF class: ELFCLASS32): ignored.
Nov 19 21:29:31 bazzite kwin_wayland[2537]: kwin_wayland_drm: Pageflip timed out! This is a kernel bug
Nov 19 21:29:36 bazzite kwin_wayland[2537]: kwin_wayland_drm: Pageflip timed out! This is a kernel bug
Nov 19 21:29:41 bazzite kwin_wayland[2537]: kwin_wayland_drm: Pageflip timed out! This is a kernel bug
Nov 19 21:29:46 bazzite kwin_wayland[2537]: kwin_wayland_drm: Pageflip timed out! This is a kernel bug
Nov 19 21:29:51 bazzite kwin_wayland[2537]: kwin_wayland_drm: Pageflip timed out! This is a kernel bug
Nov 19 21:29:56 bazzite kwin_wayland[2537]: kwin_wayland_drm: Pageflip timed out! This is a kernel bug

Not 100% sure exactly when in this log things froze, but probably around 21:29 around the assertion failure.

Edit: Perhaps notably, the game that was running on the monitor that froze still works fine if I move it to my second monitor, but the main monitor is stuck displaying the most recent frame.

@mtijanic
Copy link
Collaborator

Hi! Thanks for the report! The assert itself looks like a race condition that shouldn't be fatal by itself, but it's probably indicative of the larger problem. Can you please run nvidia-bug-report.sh as soon as you hit this issue and attach the logs?

(also, please respect the bug template, it helps us with triage and makes it less likely your issues will get lost)

@ptr1337
Copy link

ptr1337 commented Nov 20, 2024

Faced this issue also while compiling a kernel.

Only my second display was frozen - the "main" one was working.

[ 2244.525940] NVRM: nvAssertFailedNoLog: Assertion failed: CliGetEventInfo(rpc_params->hClient, rpc_params->hEvent, &pEvent) @ kernel_gsp.c:463
[ 2244.525948] NVRM: _kgspProcessRpcEvent: Failed to process received event 0x1003 (POST_EVENT) from GPU0: status=0x57
[ 2294.820903] r8169 0000:0b:00.0: invalid VPD tag 0xff (size 0) at offset 0; assume missing optional EEPROM

nvidia-bug-report.log.gz

@MeguminSama
Copy link
Author

@mtijanic apologies for not following the template - I opened the issue from the line of code where the error occurred, so didn't get shown the template flow.

I just had the issue occur again, so I've attached the log here.

nvidia-bug-report.log.gz

@mtijanic
Copy link
Collaborator

Thanks! And the lack of template issue is entirely on github. I knew this happened if you go through the new issue -> create account -> finish new issue flow, but this one is new to me.

I'll get back to you tomorrow on the log analysis, the one above from ptr1337 unfortunately didn't yield too much useful info.

@foophoof
Copy link

For a workaround, I've found that if I switch to one of the console TTYs (Ctrl-Alt-1/2/3/4/5/6 usually depending on distro) and then back again things unfreeze.

@MeguminSama
Copy link
Author

Switching TTY sometimes works for me, but other times it just freezes my whole system 😔

@MeguminSama
Copy link
Author

@mtijanic just wondering if you managed to spot anything? No worries if you haven't gotten around to it yet though :)

@mtijanic
Copy link
Collaborator

Hi! I only had a chance for a quick pass and didn't really see anything actionable. Over the next few days I'll try to do more detailed pass over the logs. Interestingly, I did come across these exact prints:

Nov 19 21:29:31 bazzite kwin_wayland[2537]: kwin_wayland_drm: Pageflip timed out! This is a kernel bug
Nov 19 21:29:36 bazzite kwin_wayland[2537]: kwin_wayland_drm: Pageflip timed out! This is a kernel bug
Nov 19 21:29:41 bazzite kwin_wayland[2537]: kwin_wayland_drm: Pageflip timed out! This is a kernel bug
Nov 19 21:29:46 bazzite kwin_wayland[2537]: kwin_wayland_drm: Pageflip timed out! This is a kernel bug

while debugging #650, so the two might be related.

You did say:

Seemingly at random,

what is the rough repro rate here? I might need to ask you to turn some debug knobs to get more data, and this would help figure out how. I don't want to put significant perf-reducing or extra-power-consumption changes on you if it means running them for days on end.

@MeguminSama
Copy link
Author

MeguminSama commented Nov 27, 2024

Thanks - I haven't found any way to reproduce it so far unfortunately. Sometimes it'll happen daily, sometimes not at all for a week, no seeming pattern between them.

I'm happy to turn on debugging stuff, just let me know what to do, and I can send more logs whenever the issue happens :)

@Zerwin
Copy link

Zerwin commented Dec 1, 2024

I am also having this issue. A bit of info about my system:

Operating System: EndeavourOS 
KDE Plasma Version: 6.2.4
KDE Frameworks Version: 6.8.0
Qt Version: 6.8.0
Kernel Version: 6.12.1-arch1-1 (64-bit)
Graphics Platform: Wayland
Processors: 12 × AMD Ryzen 5 3600X 6-Core Processor
Memory: 15,5 GiB of RAM
Graphics Processor: NVIDIA GeForce RTX 3070/PCIe/SSE2

As for repro rate, I have it every few days. And it also isn't always on the same monitor, I have 3 monitors attached, with a mix of HDMI and DP, and by now every monitor was affected at least once by this bug. If it happens again I will provide a nvidia bug report as soon as I can.
One thing I also noticed is that the screen froze exactly 1 second before the kernel log lines. I know this because the clock on my monitor conveniently froze, making it easy to compare.

@Marnnite
Copy link

Marnnite commented Dec 5, 2024

I also have this issue.

NVIDIA Open GPU Kernel Modules Version
nvidia-open-dkms 565.57.01-2

Operating System and Version
Arch Linux x86_64

Kernel Release
Linux pc 6.12.1-arch1-1 #1 SMP PREEMPT_DYNAMIC Fri, 22 Nov 2024 16:04:27 +0000 x86_64 GNU/Linux

Hardware: GPU
GPU 0: NVIDIA GeForce GTX 1660 SUPER (UUID: GPU-cf667222-685b-de84-42b1-e446593243a8)

nvidia-bug-report.log.gz

dmsg

[27236.924270] NVRM: nvAssertFailedNoLog: Assertion failed: CliGetEventInfo(rpc_params->hClient, rpc_params->hEvent, &pEvent) @ kernel_gsp.c:462
[27236.924276] NVRM: _kgspProcessRpcEvent: Failed to process received event 0x1003 (POST_EVENT) from GPU0: status=0x57

Switched to tty

[29832.451157] [drm:nv_drm_atomic_commit [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Flip event timeout on head 0
[29835.651395] [drm:nv_drm_atomic_commit [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Flip event timeout on head 0

systemd journal

Dec 05 22:00:57 pc kernel: NVRM: nvAssertFailedNoLog: Assertion failed: CliGetEventInfo(rpc_params->hClient, rpc_params->hEvent, &pEvent) @ kernel_gsp.c:462
Dec 05 22:00:57 pc kernel: NVRM: _kgspProcessRpcEvent: Failed to process received event 0x1003 (POST_EVENT) from GPU0: status=0x57
Dec 05 22:01:04 pc kwin_wayland[2483]: kwin_wayland_drm: Pageflip timed out! This is a kernel bug
Dec 05 22:01:09 pc kwin_wayland[2483]: kwin_wayland_drm: Pageflip timed out! This is a kernel bug
Dec 05 22:01:14 pc kwin_wayland[2483]: kwin_wayland_drm: Pageflip timed out! This is a kernel bug
Dec 05 22:01:24 pc kwin_wayland[2483]: kwin_wayland_drm: Pageflip timed out! This is a kernel bug
Dec 05 22:01:29 pc kwin_wayland[2483]: kwin_wayland_drm: Pageflip timed out! This is a kernel bug
Dec 05 22:01:34 pc kwin_wayland[2483]: kwin_wayland_drm: Pageflip timed out! This is a kernel bug

Switched to tty

Dec 05 22:44:11 pc kwin_wayland[2483]: kwin_wayland_drm: Pageflip timed out! This is a kernel bug
Dec 05 22:44:15 pc kernel: [drm:nv_drm_atomic_commit [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Flip event timeout on head 0
Dec 05 22:44:15 pc kernel: [drm:nv_drm_atomic_commit [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Flip event timeout on head 0
Dec 05 22:44:15 pc kwin_wayland[2483]: kwin_wayland_drm: atomic commit failed: Permission denied
Dec 05 22:44:15 pc systemd[1]: Started Getty on tty3.

This happens every so often and usually when watching Youtube, I have 2 monitors, connected with DP and HDMI, sometimes only one of them will freeze, sometimes both.

@Exotic0015
Copy link

The issue is still present on 565.77.

@Zerwin
Copy link

Zerwin commented Dec 8, 2024

I have an nvidia-bug-report now after it happened on my main display again. What would be the best way to provide this file ? I would rather not upload a file with that much data about my private PC to github. Is there an email or something where I could send it to ?

@tired-runner
Copy link

I've also observed the same issue, though I haven't seen it in around one and half weeks now. Its happened twice, and while watching youtube both times. Using 3 monitors connected through displayport on a 3080

@C0rn3j
Copy link

C0rn3j commented Dec 9, 2024

565.77, RTX 4090, Arch Linux, 6.12.3.

2x DP monitors + 1x HDMI TV, which annoyingly forces screen reconnects couple times a day, so that may or may not be relevant.

My occurrences are also pretty sparse -

Nov 24 with some 565 beta driver:

[267363.354004] NVRM: nvAssertFailedNoLog: Assertion failed: CliGetEventInfo(rpc_params->hClient, rpc_params->hEvent, &pEvent) @ kernel_gsp.c:462
[267363.354015] NVRM: _kgspProcessRpcEvent: Failed to process received event 0x1003 (POST_EVENT) from GPU0: status=0x57

Dec 9 with 565.77:

[148301.259066] NVRM: nvAssertFailedNoLog: Assertion failed: CliGetEventInfo(rpc_params->hClient, rpc_params->hEvent, &pEvent) @ kernel_gsp.c:462
[148301.259075] NVRM: _kgspProcessRpcEvent: Failed to process received event 0x1003 (POST_EVENT) from GPU0: status=0x57

Image

@oDisMal
Copy link

oDisMal commented Dec 9, 2024

Hi just confirming issue still persists. Over a week now one random monitor will freeze, although very typically its a secondary monitor (if not everytime).
Ryzen 5900x
RTX 2080 - Driver Version: 565.77
Arch 6.12.3-arch1-1
KDE Wayland

Dec 09 18:38:06 bigbro kwin_wayland[2039]: kwin_wayland_drm: Pageflip timed out! This is a kernel bug
Dec 09 18:38:07 bigbro kwin_wayland[2039]: kwin_wayland_drm: atomic commit failed: Invalid argument

@mtijanic
Copy link
Collaborator

FYI, tracked internally as nvbug 5004630

@mtijanic mtijanic added the NV-Triaged An NVBug has been created for dev to investigate label Dec 11, 2024
@ClutchFred
Copy link

I'm also experiencing this issue. Switching TTYs unfreezes the display. It happens most of the time while gaming (Minecraft), and it typically occurs 2-3 times in a short period. After that, everything runs fine, sometimes for several hours.

Does anyone know of a version where this bug is not present?

@ptr1337
Copy link

ptr1337 commented Dec 14, 2024

I'm also experiencing this issue. Switching TTYs unfreezes the display. It happens most of the time while gaming (Minecraft), and it typically occurs 2-3 times in a short period. After that, everything runs fine, sometimes for several hours.

Does anyone know of a version where this bug is not present?

I think the 565 beta did not have the issue. Please provide a nvidia-bug-report.sh to get more information. This is very important, so that the developers can track it better down.

@butztill
Copy link

butztill commented Dec 17, 2024

Hi, I also have this issue. I have a HDMI and DP Screen connected as well and am using Nvidia Open Kernel Modules 565.77

Here's the logs:

Dez 17 19:57:53 butztill-PC kernel: NVRM: nvAssertFailedNoLog: Assertion failed: CliGetEventInfo(rpc_params->hClient, rpc_params->hEvent, &pEvent) @ kernel_gsp.c:462
Dez 17 19:57:53 butztill-PC kernel: NVRM: _kgspProcessRpcEvent: Failed to process received event 0x1003 (POST_EVENT) from GPU0: status=0x57
Dez 17 19:57:58 butztill-PC kwin_wayland[1953]: kwin_wayland_drm: Pageflip timed out! This is a kernel bug
Dez 17 19:58:03 butztill-PC kwin_wayland[1953]: kwin_wayland_drm: Pageflip timed out! This is a kernel bug
Dez 17 19:58:08 butztill-PC kwin_wayland[1953]: kwin_wayland_drm: Pageflip timed out! This is a kernel bug
Dez 17 19:58:13 butztill-PC kwin_wayland[1953]: kwin_wayland_drm: Pageflip timed out! This is a kernel bug
Dez 17 19:58:18 butztill-PC kwin_wayland[1953]: kwin_wayland_drm: Pageflip timed out! This is a kernel bug
Dez 17 19:58:23 butztill-PC kwin_wayland[1953]: kwin_wayland_drm: Pageflip timed out! This is a kernel bug
Dez 17 19:58:29 butztill-PC wireplumber[2009]: wp-event-dispatcher: <WpAsyncEventHook:0x5be669e88210> failed: <WpSiStandardLink:0x5be66a263da0> link failed: some node was destroyed before the link was created
Dez 17 19:58:29 butztill-PC wireplumber[2009]: wp-event-dispatcher: <WpAsyncEventHook:0x5be669e88210> failed: <WpSiStandardLink:0x5be66a00d850> link failed: some node was destroyed before the link was created
Dez 17 19:58:42 butztill-PC vesktop[2655]: 'loop->recurse > 0' failed at ../pipewire/src/pipewire/thread-loop.c:425 pw_thread_loop_wait()
Dez 17 19:58:42 butztill-PC kded6[2070]: Service  ":1.276" unregistered
Dez 17 19:58:43 butztill-PC kwin_wayland[1953]: kwin_wayland_drm: Pageflip timed out! This is a kernel bug
Dez 17 19:58:47 butztill-PC kernel: [drm:nv_drm_atomic_commit [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Flip event timeout on head 1
Dez 17 19:58:48 butztill-PC kwin_wayland[1953]: kwin_wayland_drm: Pageflip timed out! This is a kernel bug
Dez 17 19:58:50 butztill-PC kernel: [drm:nv_drm_atomic_commit [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Flip event timeout on head 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NV-Triaged An NVBug has been created for dev to investigate
Projects
None yet
Development

No branches or pull requests