test MULTICORE_0004 sometimes hangs #44

lsf37 · 2021-08-13T01:48:58Z

The test is already marked as flaky. If at all possible we should make this non-flaky.

There is also a comment inside the test that says it can never properly fail and would crash if not succesful. The hang behaviour we're seeing might be such a crash, and the issue might therefore be real (or just a race in the test).

I've seen this mostly come up on sabre, but not sure it was only there. (Edit: some specific configs are SABRE_verification_SMP_clang_32 *1 in 15 runs, IMX8MM_EVK_release_SMP_clang_64 * 2 in the same 15 runs)

Edit: also frequently on zynqm and zynqmp106 for HYP + clang + SMP.

The text was updated successfully, but these errors were encountered:

kent-mcleod · 2021-08-13T05:25:19Z

This test is trying to check that the fifo kernel lock works. If the test is timing out, then that implies that the main thread isn't able to make progress, indicating that the fifo lock isn't working.

lsf37 · 2023-06-01T07:55:07Z

Not sure if this will help or hinder future investigation, but IMX8MQ_EVK_release_SMP_hyp_clang_64 is reliably hanging on MULTICORE_0004. Most other boards show this only very sporadically. Might be a different reason here (works with gcc, for instance, or in debug mode), but I thought it worth pointing out.

I'll be disabling this test for clang on this board, because enabling it was only an experiment (the hyp+smp combination on that board used to be not enabled/supported before we tried it).

lsf37 · 2023-10-05T21:56:02Z

In seL4/seL4#1109 (comment) a new failure mode for MULTICORE0004 popped up, triggering an assertion instead:

<testcase classname="sel4test" name="MULTICORE0004">
  Running test MULTICORE0004 (Test core stalling is behaving properly (flaky))
  seL4 failed assertion 'isRunnable(NODE_STATE(ksCurThread))' at /github/workspace/kernel/include/kernel/thread.h:291 in function checkBudgetRestart
  halting...
  Kernel entry via Syscall, number: 11, Yield

ODROID_C4_debug_SMP_MCS_clang_64 FAILED

Unclear to me if this is related to the hanging issue or not. In any case it might be useful for debugging to know that this assertion can be triggered in the test.

Edit: this is rare, but there was one more occurrence for seL4/seL4@3c2c5ba in this run for TQMA8XQP1GB_debug_SMP_MCS_clang_64.

Indanz · 2023-10-06T09:25:33Z

I'm not sure if it's a new failure mode, the others happened for release builds.

In the test the tasks on other cores are calling seL4_Yield() in a loop, while the main thread suspends them and then cleans them up. Because the kernel entry is via a yield call, the assert must be triggered by the MCS_DO_IF_BUDGET() at the start of handleSyscall(). This assert seems to proof that remoteTCBStall() is broken.

lsf37 · 2023-10-08T22:16:09Z

Entirely possibly that it is the same problem (that would be great), I just hadn't seen this failure mode yet.

lsf37 · 2024-06-18T03:01:00Z

Adding as an observation here that I have seen this most frequently in the combination SMP+MCS+clang in the last few months.

lsf37 · 2024-06-21T10:12:14Z

Ok, so some more bizarre observations.

The seL4 commit c28e52a1ec made the test MULTICORE0004 hang almost reliably on the tqma board in the config SMP_MCS_release_clang (and only that config, all other combinations worked fine).
The CI test is built with the setting BAMBOO on, which sets CONFIG_PRINT_XML, which prints XML output and buffers that output. The exact same build with BAMBOO off works fine, I could not get the test to hang. With BAMBOO on, it hangs (almost?) every time.
The unrelated commit 2648df4 fixes it. I can't get the test to hang in any config any more (on 5 runs -- the other case above never needed more than 2).

Both commits are entirely unrelated, 2648df4 is not even in any code that is called, and it is an equivalence transformation in the contexts it can be called. The BAMBOO setting doesn't have any influence on the kernel at all (it is not visible), only on user-space sel4test, and this particular test does not print anything. The only printing calls are before and after.

Indanz · 2024-06-21T11:30:14Z

All that you say seems to confirm what I suspect: The locking code is broken (sometimes miscompiled?), and whether that causes problems or not is timing dependent, so any small change can either trigger it or not. That also makes it very hard to figure out whether any change actually fixes the problem or not.

It would be good to add a time check to see whether MULTICORE_0004 takes too long to complete or not. If it doesn't finish quickly, it's also a sign that the locking is broken. Perhaps this would give us a reliable way to detect whether there is a problem, instead of the current rarer case of total hang.

Because the locking code is inlined it is very hard to review the generated assembly: We don't know which path is buggy for sure. The one I checked seemed fine, as far as I can tell.

I want to dust off seL4/seL4#871, make it out-of-line, convert it to stdatomic instead of compiler intrinsics and see if the bug still happens. It would be good to check the locking algorithm with cppmem too. I suspect rewriting to stdatomic will make the problem go away.

heshamelmatary · 2024-07-24T08:46:46Z

Not sure if it's the same issue (might be relevant), but there was similar behaviour that we had to do this. The code has changed since then

Indanz · 2024-08-05T11:27:52Z

Not sure if it's the same issue (might be relevant), but there was similar behaviour that we had to do this. The code has changed since then

That commit seems to paper over a bug where __atomic_exchange_n can take an indeterminate of time, so to me sounds like a work-around for a broken locking implementation instead of a proper fix. I fixed that with this commit.

lsf37 added the help wanted label Aug 13, 2021

lsf37 mentioned this issue Aug 13, 2021

prefer macro CURRENT_CPU_INDEX() seL4/seL4#466

Merged

lsf37 mentioned this issue Mar 22, 2022

Fix Issue774 seL4/seL4#789

Merged

lsf37 mentioned this issue Jun 1, 2023

sel4test: exclude broken hyp+mc+clang imx8mq tests seL4/ci-actions#276

Merged

Indanz mentioned this issue Jun 25, 2023

SMP: Improve kernel lock seL4/seL4#871

Open

lsf37 added this to seL4 release 13.0 Feb 7, 2024

lsf37 moved this to Todo in seL4 release 13.0 Feb 7, 2024

lsf37 assigned Indanz Feb 7, 2024

lsf37 removed this from seL4 release 13.0 Jun 27, 2024

lsf37 mentioned this issue Aug 11, 2024

builds: exclude clang + hyp + smp for zynqmp seL4/ci-actions#360

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test MULTICORE_0004 sometimes hangs #44

test MULTICORE_0004 sometimes hangs #44

lsf37 commented Aug 13, 2021 •

edited

Loading

kent-mcleod commented Aug 13, 2021

lsf37 commented Jun 1, 2023

lsf37 commented Oct 5, 2023 •

edited

Loading

Indanz commented Oct 6, 2023

lsf37 commented Oct 8, 2023

lsf37 commented Jun 18, 2024

lsf37 commented Jun 21, 2024

Indanz commented Jun 21, 2024

heshamelmatary commented Jul 24, 2024

Indanz commented Aug 5, 2024

test MULTICORE_0004 sometimes hangs #44

test MULTICORE_0004 sometimes hangs #44

Comments

lsf37 commented Aug 13, 2021 • edited Loading

kent-mcleod commented Aug 13, 2021

lsf37 commented Jun 1, 2023

lsf37 commented Oct 5, 2023 • edited Loading

Indanz commented Oct 6, 2023

lsf37 commented Oct 8, 2023

lsf37 commented Jun 18, 2024

lsf37 commented Jun 21, 2024

Indanz commented Jun 21, 2024

heshamelmatary commented Jul 24, 2024

Indanz commented Aug 5, 2024

lsf37 commented Aug 13, 2021 •

edited

Loading

lsf37 commented Oct 5, 2023 •

edited

Loading