Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test MULTICORE_0004 sometimes hangs #44

Open
lsf37 opened this issue Aug 13, 2021 · 10 comments
Open

test MULTICORE_0004 sometimes hangs #44

lsf37 opened this issue Aug 13, 2021 · 10 comments
Assignees

Comments

@lsf37
Copy link
Member

lsf37 commented Aug 13, 2021

The test is already marked as flaky. If at all possible we should make this non-flaky.

There is also a comment inside the test that says it can never properly fail and would crash if not succesful. The hang behaviour we're seeing might be such a crash, and the issue might therefore be real (or just a race in the test).

I've seen this mostly come up on sabre, but not sure it was only there. (Edit: some specific configs are SABRE_verification_SMP_clang_32 *1 in 15 runs, IMX8MM_EVK_release_SMP_clang_64 * 2 in the same 15 runs)

Edit: also frequently on zynqm and zynqmp106 for HYP + clang + SMP.

@kent-mcleod
Copy link
Member

This test is trying to check that the fifo kernel lock works. If the test is timing out, then that implies that the main thread isn't able to make progress, indicating that the fifo lock isn't working.

@lsf37
Copy link
Member Author

lsf37 commented Jun 1, 2023

Not sure if this will help or hinder future investigation, but IMX8MQ_EVK_release_SMP_hyp_clang_64 is reliably hanging on MULTICORE_0004. Most other boards show this only very sporadically. Might be a different reason here (works with gcc, for instance, or in debug mode), but I thought it worth pointing out.

I'll be disabling this test for clang on this board, because enabling it was only an experiment (the hyp+smp combination on that board used to be not enabled/supported before we tried it).

@lsf37
Copy link
Member Author

lsf37 commented Oct 5, 2023

In seL4/seL4#1109 (comment) a new failure mode for MULTICORE0004 popped up, triggering an assertion instead:

<testcase classname="sel4test" name="MULTICORE0004">
  Running test MULTICORE0004 (Test core stalling is behaving properly (flaky))
  seL4 failed assertion 'isRunnable(NODE_STATE(ksCurThread))' at /github/workspace/kernel/include/kernel/thread.h:291 in function checkBudgetRestart
  halting...
  Kernel entry via Syscall, number: 11, Yield

ODROID_C4_debug_SMP_MCS_clang_64 FAILED 

Unclear to me if this is related to the hanging issue or not. In any case it might be useful for debugging to know that this assertion can be triggered in the test.

Edit: this is rare, but there was one more occurrence for seL4/seL4@3c2c5ba in this run for TQMA8XQP1GB_debug_SMP_MCS_clang_64.

@Indanz
Copy link
Contributor

Indanz commented Oct 6, 2023

I'm not sure if it's a new failure mode, the others happened for release builds.

In the test the tasks on other cores are calling seL4_Yield() in a loop, while the main thread suspends them and then cleans them up. Because the kernel entry is via a yield call, the assert must be triggered by the MCS_DO_IF_BUDGET() at the start of handleSyscall(). This assert seems to proof that remoteTCBStall() is broken.

@lsf37
Copy link
Member Author

lsf37 commented Oct 8, 2023

Entirely possibly that it is the same problem (that would be great), I just hadn't seen this failure mode yet.

@lsf37
Copy link
Member Author

lsf37 commented Jun 18, 2024

Adding as an observation here that I have seen this most frequently in the combination SMP+MCS+clang in the last few months.

@lsf37
Copy link
Member Author

lsf37 commented Jun 21, 2024

Ok, so some more bizarre observations.

  • The seL4 commit c28e52a1ec made the test MULTICORE0004 hang almost reliably on the tqma board in the config SMP_MCS_release_clang (and only that config, all other combinations worked fine).
  • The CI test is built with the setting BAMBOO on, which sets CONFIG_PRINT_XML, which prints XML output and buffers that output. The exact same build with BAMBOO off works fine, I could not get the test to hang. With BAMBOO on, it hangs (almost?) every time.
  • The unrelated commit 2648df4 fixes it. I can't get the test to hang in any config any more (on 5 runs -- the other case above never needed more than 2).

Both commits are entirely unrelated, 2648df4 is not even in any code that is called, and it is an equivalence transformation in the contexts it can be called. The BAMBOO setting doesn't have any influence on the kernel at all (it is not visible), only on user-space sel4test, and this particular test does not print anything. The only printing calls are before and after.

@Indanz
Copy link
Contributor

Indanz commented Jun 21, 2024

All that you say seems to confirm what I suspect: The locking code is broken (sometimes miscompiled?), and whether that causes problems or not is timing dependent, so any small change can either trigger it or not. That also makes it very hard to figure out whether any change actually fixes the problem or not.

It would be good to add a time check to see whether MULTICORE_0004 takes too long to complete or not. If it doesn't finish quickly, it's also a sign that the locking is broken. Perhaps this would give us a reliable way to detect whether there is a problem, instead of the current rarer case of total hang.

Because the locking code is inlined it is very hard to review the generated assembly: We don't know which path is buggy for sure. The one I checked seemed fine, as far as I can tell.

I want to dust off seL4/seL4#871, make it out-of-line, convert it to stdatomic instead of compiler intrinsics and see if the bug still happens. It would be good to check the locking algorithm with cppmem too. I suspect rewriting to stdatomic will make the problem go away.

@heshamelmatary
Copy link
Contributor

Not sure if it's the same issue (might be relevant), but there was similar behaviour that we had to do this. The code has changed since then

@Indanz
Copy link
Contributor

Indanz commented Aug 5, 2024

Not sure if it's the same issue (might be relevant), but there was similar behaviour that we had to do this. The code has changed since then

That commit seems to paper over a bug where __atomic_exchange_n can take an indeterminate of time, so to me sounds like a work-around for a broken locking implementation instead of a proper fix. I fixed that with this commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants