-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test MULTICORE_0004 sometimes hangs #44
Comments
This test is trying to check that the fifo kernel lock works. If the test is timing out, then that implies that the main thread isn't able to make progress, indicating that the fifo lock isn't working. |
Not sure if this will help or hinder future investigation, but I'll be disabling this test for clang on this board, because enabling it was only an experiment (the hyp+smp combination on that board used to be not enabled/supported before we tried it). |
In seL4/seL4#1109 (comment) a new failure mode for MULTICORE0004 popped up, triggering an assertion instead:
Unclear to me if this is related to the hanging issue or not. In any case it might be useful for debugging to know that this assertion can be triggered in the test. Edit: this is rare, but there was one more occurrence for seL4/seL4@3c2c5ba in this run for TQMA8XQP1GB_debug_SMP_MCS_clang_64. |
I'm not sure if it's a new failure mode, the others happened for release builds. In the test the tasks on other cores are calling |
Entirely possibly that it is the same problem (that would be great), I just hadn't seen this failure mode yet. |
Adding as an observation here that I have seen this most frequently in the combination SMP+MCS+clang in the last few months. |
Ok, so some more bizarre observations.
Both commits are entirely unrelated, 2648df4 is not even in any code that is called, and it is an equivalence transformation in the contexts it can be called. The |
All that you say seems to confirm what I suspect: The locking code is broken (sometimes miscompiled?), and whether that causes problems or not is timing dependent, so any small change can either trigger it or not. That also makes it very hard to figure out whether any change actually fixes the problem or not. It would be good to add a time check to see whether Because the locking code is inlined it is very hard to review the generated assembly: We don't know which path is buggy for sure. The one I checked seemed fine, as far as I can tell. I want to dust off seL4/seL4#871, make it out-of-line, convert it to stdatomic instead of compiler intrinsics and see if the bug still happens. It would be good to check the locking algorithm with cppmem too. I suspect rewriting to stdatomic will make the problem go away. |
Not sure if it's the same issue (might be relevant), but there was similar behaviour that we had to do this. The code has changed since then |
That commit seems to paper over a bug where |
The test is already marked as
flaky
. If at all possible we should make this non-flaky.There is also a comment inside the test that says it can never properly fail and would crash if not succesful. The hang behaviour we're seeing might be such a crash, and the issue might therefore be real (or just a race in the test).
I've seen this mostly come up on
sabre
, but not sure it was only there. (Edit: some specific configs areSABRE_verification_SMP_clang_32
*1 in 15 runs,IMX8MM_EVK_release_SMP_clang_64
* 2 in the same 15 runs)Edit: also frequently on
zynqm
andzynqmp106
forHYP
+clang
+SMP
.The text was updated successfully, but these errors were encountered: