Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible Linux kernel lock contention when running multiple judgedaemons per machine #2277

Open
taoky opened this issue Dec 6, 2023 · 4 comments

Comments

@taoky
Copy link

taoky commented Dec 6, 2023

Description of the problem

When rejuding a large contest or getting a lot of submission for problems with many testcases, it could be possible that some submissions are taking much longer wall time than their CPU time. With a short timelimit overshoot these submissions might be judged as TLE even if they are correct.

And this is actually what happens in a recent ICPC Asia Regional Contest (with ~350 teams and an easy problem with 50 testcases). After taking a lot time bisecting kernel and debugging, it was found out that a lock contention issue (2 global locks: shrinker_rwsem and cgroup_mutex) in kernel < 6.3 under heavy load might block kernel operations such as cgroup and page fault handling inside memory cgroup for several seconds.

(This is fixed (or alleviated) after kernel commit torvalds/linux@da27f79)

Though it is impossible for judgedaemon (runguard) to "fix" this issue by code, mentioning the kernel issue in documentation could be helpful for server admins.

Your environment

  • DOMjudge/Webserver: any compatible version
  • OS: Ubuntu 22.04 with kernel 5.15 (default) or 6.2 (latest generic kernel in jammy repo)
  • Tested under a KVM with 32 cores and 21 or 30 judgedaemons, and a bare metal 2 CPUs (40 cores) server with 21 judgedaemons.

Steps to reproduce

Submit a correct solution many times at once like:

for i in $(seq 1 1000); ~/Downloads/domjudge-8.2.2/submit/submit --url http://localhost:12345/ --contest test -y G.cpp; end

And wait for it to be done.

Expected behaviour

Reasonable judgehost system load, and no submission takes a wall time much longer than its CPU time.

Actual behaviour

Judgehost system load >= 2 * judgedaemon number. With timelimit overshoot set to 1s|10%, some submissions are judged as TLE even they only take a very short CPU time. The judgement is very slow.

Any other information that you want to share?

#2157 mentions about "the call cgroup_delete_cgroup_ext did sometimes hang for multiple seconds". I'm afraid that a double check for this contest rejudgement might be necessary to ensure no correct solutions are judged as TLE...

If you are interested in this specific kernel issue, I have also written a blog post (Simp. Chinese) to help explain this to contestants affected in this regional contest, and for server admins in later contests.

@nickygerritsen
Copy link
Member

Thanks a lot for this big write up. We normally advice to not run many judgehosts on one machine (since there will always be some overhead) but it might indeed be worth it to mention this explicitly.

@summershrimp
Copy link

summershrimp commented Dec 6, 2023

Since you mentioned that disable CLONE_NEWIPC would fix this issue, how about using seccomp to restrict IPC related syscalls rather than create IPC namespace?

@taoky
Copy link
Author

taoky commented Dec 6, 2023

Since you mentioned that disable CLONE_NEWIPC would fix this issue, how about using seccomp to restrict IPC related syscalls rather than create IPC namespace?

Theoretically yes, but it would be a bit difficult to list all IPC-related syscalls, and the potential side effects of using seccomp are unknown.

@eldering
Copy link
Member

We should add this to our documentation and then close this issue as there's nothing to fix on DOMjudge side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants