-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[not for merge] tidying threads and file descriptors at shutdown #3397
Draft
benclifford
wants to merge
7
commits into
master
Choose a base branch
from
benc-stdstreams-bugs
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+52
−12
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
benclifford
force-pushed
the
benc-stdstreams-bugs
branch
4 times, most recently
from
April 29, 2024 12:26
ef0f2bf
to
ea3f8f7
Compare
benclifford
force-pushed
the
benc-stdstreams-bugs
branch
from
July 1, 2024 13:53
ea3f8f7
to
af8da67
Compare
benclifford
added a commit
that referenced
this pull request
Jul 2, 2024
This stops that thread being left still running over the end of the test - which is needed in work in PR #3397 to ensure that no threads are left running at the end of a test. If that shutdown hangs, this test will now hang rather than leave the thread behind. Along with PR #3504, this eliminates any threads left behind after parsl/tests/test_providers/test_local_provider.py
benclifford
added a commit
that referenced
this pull request
Jul 3, 2024
This stops that thread being left still running over the end of the test - which is needed in work in PR #3397 to ensure that no threads are left running at the end of a test. If that shutdown hangs, this test will now hang rather than leave the thread behind. Along with PR #3504, this eliminates any threads left behind after parsl/tests/test_providers/test_local_provider.py
benclifford
force-pushed
the
benc-stdstreams-bugs
branch
from
July 31, 2024 18:56
af8da67
to
979b009
Compare
benclifford
force-pushed
the
benc-stdstreams-bugs
branch
from
August 8, 2024 10:55
979b009
to
7339d0a
Compare
benclifford
added a commit
that referenced
this pull request
Aug 8, 2024
This releases 2 file descriptors with work queue (from 21 to 19 at the end of CI Work Queue test) and 4 file descriptors with Task Vine (from 19 to 15 at the end of CI Task Vine test) This is part of work being merged from draft PR #3397 to shut down components more cleanly, rather than relying on process exit.
benclifford
added a commit
that referenced
this pull request
Aug 8, 2024
This releases 2 file descriptors with work queue (from 21 to 19 at the end of CI Work Queue test) and 4 file descriptors with Task Vine (from 19 to 15 at the end of CI Task Vine test) This is part of work being merged from draft PR #3397 to shut down components more cleanly, rather than relying on process exit.
benclifford
force-pushed
the
benc-stdstreams-bugs
branch
2 times, most recently
from
August 16, 2024 13:01
e2b0448
to
96f1e63
Compare
benclifford
force-pushed
the
benc-stdstreams-bugs
branch
from
September 5, 2024 20:28
96f1e63
to
a4e0c61
Compare
benclifford
added a commit
that referenced
this pull request
Oct 19, 2024
…es (#3640) This PR reduces the load places on the interchange and on the whole test environment caused by repeatedly querying the interchange for connected managers. It does this by increasing the period between such requests, from the default, every 20ms, to every 100ms. In the last few days, test_drain.py began failing often. I have seen it occasionally fail before. This was initially a problem in PR #3639 which is unrelated, but I recreated the problem in CI against master as of #3627. I investigated and found this behaviour causing the failure: * test_drain configures the drain period to be 1 second * startup of a worker pool was taking more than 1 second * the worker pool enters drain state, drains and terminates immediately on being fully started up. * test_py fails its assumption that there is a worker pool to inspect after waiting for there to be worker pool to inspect. This is the race condition: the assertion at line 57 is true but line 59 returns an empty managers list. ``` 57 try_assert(lambda: len(htex.connected_managers()) == 1, check_period_ms=CONNECTED_MANAGERS_POLL_MS) 58 59 managers = htex.connected_managers() 60 assert managers[0]['active'], "The manager should be active" ``` Looking at the CI logs for a failing case, I saw direct evidence that the worker pool takes more than 1 second to start up in `manager.log`: ``` 2024-10-18 10:31:16.007 parsl:914 29414 MainThread [INFO] Python version: 3.12.7 (main, Oct 1 2024, 15:17:50) [GCC 9.4.0] [...] 2024-10-18 10:31:16.008 parsl:151 29414 MainThread [INFO] Manager initializing [this is where the worker start time for drain purposes is measured] [...] 2024-10-18 10:31:16.011 parsl:183 29414 MainThread [INFO] Manager connected to interchange 2024-10-18 10:31:17.058 parsl:243 29414 MainThread [INFO] Will request drain at 1729247477.0087547 [...] 2024-10-18 10:31:17.073 parsl:336 29414 Task-Puller [DEBUG] ready workers: 0, pending tasks: 0 ``` There's more than a second delay between "... connected to interchange" and the subsequent message "Will request drain". There's not a huge amount of stuff happening between these lines but there are things like multiprocessing initialization which starts a new process. It looks like this bit of code is slow even in the successful case: rerunning until success, I see this timing in CI: ``` 2024-10-18 12:11:17.475 parsl:183 23062 MainThread [INFO] Manager connected to interchange 2024-10-18 12:11:18.181 parsl:243 23062 MainThread [INFO] Will request drain at 1729253478.4731379 ``` which is still a large fraction of a second (but sufficiently less than a second for the test to pass). I haven't investigated what is taking that time. I haven't investigated if I also see that on my laptop. I hypothesised that a lot of these test failures come from the test environment being quite loaded. I'm especially suspicious of using `try_assert` with its default timings which are very tight (20ms) - the connected managers RPC here would be expected to run much less frequently, more like every 5 seconds in regular Parsl use. So I lengthed the period of the try_asserts in this test, to try to reduce load caused there. That makes the test pass repeatedly again. Things not investigated/debugged: * why this is taking >0.5 second even in the successful case - it's possible that this is a reasonable startup time and so the test might be lengthened by a few seconds * how to do a test without being timing reliant - draining is, by its very nature, reliant on the passage of "real time". For example, you might mock (at the libc level if not at the Python level) system time. * what other loads are present on the system - one of the points of slowly-ongoing PR #3397 shutdown tidyup is to reduce thread load left behind by earlier tests
benclifford
force-pushed
the
benc-stdstreams-bugs
branch
from
November 29, 2024 12:43
6dc3a8c
to
254d8ba
Compare
benclifford
added a commit
that referenced
this pull request
Nov 29, 2024
This poll happens at the configured htex poll period, which defaults to 10ms. Under heavy result load, this shoudn't result in much additional load: the poll loop will already be looping a lot to process the results. Under lower result load, there is a slight observable increase in CPU usage: a 30second sleep task shows this before this PR: before: real 0m37.451s user 0m2.160s sys 0m0.376s run 2, user 2.160s run 3, user 2.116s and this after this PR: real 0m37.473s user 0m2.400s sys 0m0.557s Run 2, 2.457s Run 3, 2.452s At shutdown, the ZMQ socket for incoming results is closed. This reduces both the number of threads and number of file descriptors left behind by the `--config local` tests. For example: $ pytest parsl/tests/test_monitoring/ --config local Before this PR, at end of test: 32 threads, 451 fds open. After this PR, at end of test: 1 thread, 48 fds open. This is part of PR #3397 shutdown tidyup.
github-merge-queue bot
pushed a commit
that referenced
this pull request
Dec 2, 2024
This poll happens at the configured htex poll period, which defaults to 10ms. Under heavy result load, this shoudn't result in much additional load: the poll loop will already be looping a lot to process the results. Under lower result load, there is a slight observable increase in CPU usage: a 30second sleep task shows this before this PR: before: real 0m37.451s user 0m2.160s sys 0m0.376s run 2, user 2.160s run 3, user 2.116s and this after this PR: real 0m37.473s user 0m2.400s sys 0m0.557s Run 2, 2.457s Run 3, 2.452s At shutdown, the ZMQ socket for incoming results is closed. This reduces both the number of threads and number of file descriptors left behind by the `--config local` tests. For example: $ pytest parsl/tests/test_monitoring/ --config local Before this PR, at end of test: 32 threads, 451 fds open. After this PR, at end of test: 1 thread, 48 fds open. This is part of PR #3397 shutdown tidyup. # Description Please include a summary of the change and (optionally) which issue is fixed. Please also include relevant motivation and context. # Changed Behaviour nothing should be really visible to normal users. Increased CPU usage in the above documented situations. ## Type of change - New feature - Code maintenance/cleanup
This poll happens at the configured htex poll period, which defaults to 10ms. Under heavy result load, this shoudn't result in much additional load: the poll loop will already be looping a lot to process the results. Under lower result load, there is a slight observable increase in CPU usage: a 30second sleep task shows this before this PR: before: real 0m37.451s user 0m2.160s sys 0m0.376s run 2, user 2.160s run 3, user 2.116s and this after this PR: real 0m37.473s user 0m2.400s sys 0m0.557s Run 2, 2.457s Run 3, 2.452s At shutdown, the ZMQ socket for incoming results is closed. This reduces both the number of threads and number of file descriptors left behind by the `--config local` tests. For example: $ pytest parsl/tests/test_monitoring/ --config local Before this PR, at end of test: 32 threads, 451 fds open. After this PR, at end of test: 1 thread, 48 fds open. This is part of PR #3397 shutdown tidyup.
parsl/tests/test_radical/test_mpi_funcs.py ... logger.error(f"BENC: end open fds: {end_fds} (vs start {start_fds}") logger.error(f"BENC: end threads: {threading.active_count()}") > assert threading.active_count() == 1, "test left threads running: " + repr(threading.enumerate()) E AssertionError: test left threads running: [<_MainThread(MainThread, started 140532448974656)>, <Thread(Thread-18 ( _work), started daemon 140531866793664)>, <Thread(Thread-19 (_run_proxy), started daemon 140531858400960)>, <Thread(Thread-20 ( _monitor), started daemon 140531850008256)>, <Thread(Thread-21 (_work), started daemon 140531841615552)>, <Thread(Thread-23 (_w atch), started daemon 140531229243072)>, <Thread(Thread-24 (_listener), started daemon 140531220850368)>, <Thread(Thread-25 (_l istener), started daemon 140530985985728)>, <Thread(Thread-26 (_listener), started daemon 140530977593024)>, <Thread(Thread-27 (_listener), started daemon 140530969200320)>, <Thread(Thread-28 (_listener), started daemon 140530960807616)>, <Thread(Thread- 29 (_listener), started daemon 140530952414912)>, <Thread(Thread-31 (_listener), started daemon 140530390398656)>, <Thread(Thre ad-32 (_listener), started daemon 140530365220544)>, <Thread(Thread-33 (_listener), started daemon 140530113570496)>, <Thread(T hread-35 (_listener), started daemon 140530096785088)>, <Thread(Thread-36 (_listener), started daemon 140530071606976)>, <Threa d(Thread-38 (_listener), started daemon 140529627023040)>, <Thread(Thread-39 (_listener), started daemon 140529618630336)>, <Th read(Thread-40 (_bulk_collector), started daemon 140529610237632)>] E assert 20 == 1 E + where 20 = <function active_count at 0x7fd041fcfec0>() E + where <function active_count at 0x7fd041fcfec0> = threading.active_count
benclifford
force-pushed
the
benc-stdstreams-bugs
branch
from
December 14, 2024 19:19
254d8ba
to
e608c99
Compare
github-merge-queue bot
pushed a commit
that referenced
this pull request
Dec 17, 2024
This brings in fixes for a few issues that are fixed on the radical side of things: #3722 - a race condition on task completion #3708 - cleaner shutdown handling as part of #3397 #3646 - Python 3.13 support # Changed Behaviour whatever has changed in radical-pilot # Fixes #3722 ## Type of change - Bug fix
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
this is a patch stack in my dev environment where i have been pushing on cleanup of threads and file descriptors at shutdown, with the eventual goal of having pytest fixtures assert that only the main thread is running and the same number of file descriptors are open at the start and the end. I've been merging work from this already in small pieces and will continue to merge small tidyups from it.
There is at least one large part which needs some thought and discussion, which is how to shut down the htex queue processing thread - the engine of almost everything in htex that doesn't happen immediately on user task submission.