-
Notifications
You must be signed in to change notification settings - Fork 964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deadlock in server while shutting down due to invalid free in v1.8.0 #532
Comments
I'm afraid I don't see much TigerVNC in any of those traces, so it might be a general Xorg bug. Can you reproduce this? And does valgrind give any more clues? |
Hi Pierre.
I can’t reproduce the bug, only the valgrind error is systematic. It just happened the first time today, and this version of tigervnc is included in an application that has been tested intensively over several months.
So I expect this issue to trigger very rarely, but when it does the vncserver –kill hung for many hours.
No further clues, valgrind reported only this one error.
Maybe the root cause is being too concerned about “proper shutdown”, that applies to cleanup of external resources (lock files, shared memory, semaphore sets etc…) it should IMHO not apply to heap usage, the process is going to exit and calling free is not only a waste of time but also risky (getting memory management completely correct is very hard). But I also understand that some existing services may only support the combination of external+internal resource cleanup, and in that case identifying and correcting the heap management bug may be the only solution.
Possibly vncserver –kill could be enhanced to provide a recovery for this kind of bug (in our application we often first ask a process nicely to stop, and if for whatever reason it does not obey for more than a reasonable timeout, e.g. 10 seconds, we report a clean shutdown failure and use kill -9 to avoid blocking the caller indefinitely).
From: Pierre Ossman (Work account) [mailto:[email protected]]
Sent: 31 October 2017 14:19
To: TigerVNC/tigervnc
Cc: VAN VLIERBERGHE Stef; Author
Subject: Re: [TigerVNC/tigervnc] deadlock in server while shutting down due to invalid free in v1.8.0 (#532)
I'm afraid I don't see much TigerVNC in any of those traces, so it might be a general Xorg bug. Can you reproduce this? And does valgrind give any more clues?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#532 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AftonFmSPg1mJBZ27r5rfvgFj4d3pQBJks5sxx4ygaJpZM4QMmaM>.
…____
This message and any files transmitted with it are legally privileged and intended for the sole use of the individual(s) or entity to whom they are addressed. If you are not the intended recipient, please notify the sender by reply and delete the message and any attachments from your system. Any unauthorised use or disclosure of the content of this message is strictly prohibited and may be unlawful.
Nothing in this e-mail message amounts to a contractual or legal commitment on the part of EUROCONTROL, unless it is confirmed by appropriately signed hard copy.
Any views expressed in this message are those of the sender.
|
Sure, But that's a band aid. We would still like to find the bug causing the crash. And that will be difficult without a reproducible test case. Have you tested with our binaries? |
Sorry, maybe I was not very clear. The vncserver did not crash, it got stuck due to the signal handler being called from within a critical section in the non-re-entrant clib malloc implementation, so when this signal handler made a re-entrant call to free(), the process went into deadlock (futex_wait) as it tried to lock a non-recursive mutex already held by the same thread. It did not give up, we killed it (using kill -9 <pid>) after we understood the issue after attaching a gdb.
The “kill” band aid is indeed only a means of making the server exit in the presence of bugs.
The free() call in run_exit_handlers is a bug, as there is clearly no strategy in place to avoid the signal handler could be called from within the critical section mentioned above. Such calls should only be allowed if there was a guarantee that they would never be re-entrant, but this is very hard to accomplish [e.g. uninstalling the signal handler before any clib call that could directly or indirectly call malloc()/calloc()…/free() and re-installing it after]. Not using dynamic memory in such signal handlers is the traditional cure.
Reducing heap management activity in the cleanup signal handlers would make the termination more robust.
The bug can be reproduced in the sense that valgrind identifies an invalid free. But you cannot expect the glibc free to *guarantee* a signal will occur when abusing the interface (such as double free). I’ve tried running vnc start/attach viewer for 3 seconds/stop in a loop but that also doesn’t reproduce, so the signal is produced rarely, or the viewer needs to activate some more server logic before the bug appears.
Yes, the issue and tests were done with your binaries (from tigervnc-1.8.0.x86_64):
vvl@dhws029> file /cm/ot/TOOL/GNU.22.0.0.5/build_G.17.IP.L7/generated/tigervnc/usr/bin/Xvnc
/cm/ot/TOOL/GNU.22.0.0.5/build_G.17.IP.L7/generated/tigervnc/usr/bin/Xvnc: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.4.0, BuildID[sha1]=dfa89f0e7dd5701dd41134d402eb6cd2732179dc, stripped
vvl@dhws029> ll /cm/ot/TOOL/GNU.22.0.0.5/build_G.17.IP.L7/generated/tigervnc/usr/bin/Xvnc
-rwxr-xr-x 1 cmadm softdev 7176896 17y05m16d_16H03M47S00 /cm/ot/TOOL/GNU.22.0.0.5/build_G.17.IP.L7/generated/tigervnc/usr/bin/Xvnc
vvl@dhws029> md5sum /cm/ot/TOOL/GNU.22.0.0.5/build_G.17.IP.L7/generated/tigervnc/usr/bin/Xvnc
4d0d76507a65399a7587a75a68cffcde /cm/ot/TOOL/GNU.22.0.0.5/build_G.17.IP.L7/generated/tigervnc/usr/bin/Xvnc
Hope this clarifies.
From: Pierre Ossman (Work account) [mailto:[email protected]]
Sent: 02 November 2017 14:39
To: TigerVNC/tigervnc
Cc: VAN VLIERBERGHE Stef; Author
Subject: Re: [TigerVNC/tigervnc] deadlock in server while shutting down due to invalid free in v1.8.0 (#532)
Sure, vncserver could probably be patched to be more aggressive. In this case it crashed though, so I'm not sure why it gave up.
But that's a band aid. We would still like to find the bug causing the crash. And that will be difficult without a reproducible test case.
Have you tested with our binaries?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#532 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AftonK4n6EusKUHSpjJmBUT8OZ0Xp3-eks5sycX5gaJpZM4QMmaM>.
…____
This message and any files transmitted with it are legally privileged and intended for the sole use of the individual(s) or entity to whom they are addressed. If you are not the intended recipient, please notify the sender by reply and delete the message and any attachments from your system. Any unauthorised use or disclosure of the content of this message is strictly prohibited and may be unlawful.
Nothing in this e-mail message amounts to a contractual or legal commitment on the part of EUROCONTROL, unless it is confirmed by appropriately signed hard copy.
Any views expressed in this message are those of the sender.
|
Ah, alright. The log suggested it terminated, but I guess it logged that line and then promptly locked up. :) I agree, such work should not be done from the signal handler. However we've inherited that code from Xorg, so there might be something fundamental preventing a cleanup. I'll have a look. |
Right, so the poor signal handling is indeed in Xorg code. It's not part of TigerVNC at all. I'm afraid that's a bug report that will have to be filed with them. What we might be able to fix is the initial corruption, provided you find some good way to reproduce the issue so we can find the offending code. |
Getting the issue again more frequently. Filed a bug with xorg: https://bugs.freedesktop.org/show_bug.cgi?id=106146 Still using Xvnc TigerVNC 1.8.0 - built May 16 2017 14:01:59, and still no clue what affects the frequency of this bug. |
Since the issue has been reported to Xorg, I don't think there is much more we can do in our end. |
We call: vncserver -kill :90
stdout:
stderr:
backtrace using gdb on core dump:
Looks like the first call to free (_int_free) raised a signal and the signal handler made a re-entrant call to _int_free again causing deadlock.
Probably heap corruption, valgrind might clarify.
A simple test shows an invalid free during shutdown:
Such a free may frequently "work" and exceptionally cause the signal.
Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.
The text was updated successfully, but these errors were encountered: