Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix spin-loop/cleanup failure mode within run loop #42

Merged
merged 3 commits into from
Jun 27, 2024

Conversation

smudge
Copy link
Member

@smudge smudge commented Jun 26, 2024

This ensures that exceptions raised in thread callback hooks are rescued and properly mark jobs as failed.

This is also a good opportunity to change the num argument (of work_off(num)) to mean number of jobs (give or take a few due to max_claims), not number of iterations. Previously (before threading was introduced) I think it meant number of jobs (though jobs and iterations were 1:1). I would not have done this before the refactor, because there was no guarantee that one of success or failure would be incremented (the thread might crash for many reasons). Now, we only increment success and treat total - success as the "failure" number when we return from the method.

Fixes #23 and #41

This is also a prereq for a resolution I'm cooking up for #36

smudge added 2 commits June 26, 2024 18:26
This ensures that exceptions raised in thread callback hooks are rescued
and properly mark jobs as failed.

Fixes Betterment#23 and Betterment#41
(like in pre-threading implementation)
pool = Concurrent::FixedThreadPool.new(jobs.length)
jobs.each do |job|
pool.post do
run_thread_callbacks(job) do
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The explanation might make more sense if I put it here next to the related code.

Previously, if run_thread_callbacks crashed for any reason, neither success nor failure would be incremented. Furthermore, the job cleanup (down at the end of run(job)) would never occur, so attempts would not be incremented and run_at would not be bumped, making the job immediately available for pickup again.

This is what I mean by the "spinloop". A job gets picked up, its thread crashes, no cleanup occurs, it gets immediately picked up again, etc. Pushing the run_thread_callbacks call into run_job allows us to clean up the job if, e.g., it fails to deserialize, or if one of our callbacks fails to connect to a secondary resource (as was the case in #41).

true # did work
rescue DeserializationError => e
job_say job, "FAILED permanently with #{e.class.name}: #{e.message}", 'error'

job.error = e
failed(job)
false # work failed
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a bug that improperly reported deserialization errors in the success number, which only impacted the logging output.

@@ -8,7 +8,7 @@ jobs:
strategy:
fail-fast: false
matrix:
ruby: ['2.6', '2.7', '3.0', '3.1', '3.2']
ruby: ['2.7', '3.0', '3.1', '3.2']
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can make this more formal, but I wanted to unblock this build for now, without the linter churn that comes from actually changing the min supported Ruby.

Copy link
Member

@samandmoore samandmoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good as a solve for the issue. Just one question about the other behavior change!


num.times do
while total < num
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm not sure how likely it is to matter but this change means that we will keep looping until we see num jobs which means that if the queue becomes empty before we hit num we will keep looping, right?

Is that okay or desired?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a break if empty? below that covers that case as well, so we should only ever continue the loop if there are jobs being returned in the query (and that's consistent with the way it worked pre-threading too).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah! i missed that. excellent.

Copy link
Member

@samandmoore samandmoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

domainlgtm

Copy link
Member

@samandmoore samandmoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

platformlgtm

@smudge smudge merged commit fd97a38 into Betterment:main Jun 27, 2024
18 checks passed
@smudge smudge deleted the fix-spinloop branch June 27, 2024 16:35
smudge added a commit that referenced this pull request Dec 18, 2024
… for now) (#48)

This is related to #41 and #42 insofar as a kind of DB-bound "spinloop"
is still possible if a worker picks up jobs that take so little time
that the worker immediately turns around and asks for more. As of now,
there has been no way to tune the amount of time a worker should wait in
between _successful_ iterations of its run loop.

This introduces a configuration (`min_reserve_interval`) specifying a
minimum number of seconds (default: 0 as this is not a major release)
that a worker should wait in between _successful_ job reserve queries.
An existing config (`sleep_delay`) is still used to define the number of
seconds (default: 5) that a worker should wait in between _unsuccessful_
job reserve attempts (i.e. the queue is empty).

The job execution time is subtracted from `min_reserve_interval` when
the worker sleeps, and if jobs take more than `min_reserve_interval` to
complete than the worker will not sleep before the next reserve query.

/no-platform
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Difference in error behavior when a job is undefined
2 participants