-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix spin-loop/cleanup failure mode within run loop #42
Conversation
This ensures that exceptions raised in thread callback hooks are rescued and properly mark jobs as failed. Fixes Betterment#23 and Betterment#41
(like in pre-threading implementation)
pool = Concurrent::FixedThreadPool.new(jobs.length) | ||
jobs.each do |job| | ||
pool.post do | ||
run_thread_callbacks(job) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The explanation might make more sense if I put it here next to the related code.
Previously, if run_thread_callbacks
crashed for any reason, neither success
nor failure
would be incremented. Furthermore, the job cleanup (down at the end of run(job)
) would never occur, so attempts would not be incremented and run_at would not be bumped, making the job immediately available for pickup again.
This is what I mean by the "spinloop". A job gets picked up, its thread crashes, no cleanup occurs, it gets immediately picked up again, etc. Pushing the run_thread_callbacks
call into run_job
allows us to clean up the job if, e.g., it fails to deserialize, or if one of our callbacks fails to connect to a secondary resource (as was the case in #41).
true # did work | ||
rescue DeserializationError => e | ||
job_say job, "FAILED permanently with #{e.class.name}: #{e.message}", 'error' | ||
|
||
job.error = e | ||
failed(job) | ||
false # work failed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a bug that improperly reported deserialization errors in the success
number, which only impacted the logging output.
@@ -8,7 +8,7 @@ jobs: | |||
strategy: | |||
fail-fast: false | |||
matrix: | |||
ruby: ['2.6', '2.7', '3.0', '3.1', '3.2'] | |||
ruby: ['2.7', '3.0', '3.1', '3.2'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can make this more formal, but I wanted to unblock this build for now, without the linter churn that comes from actually changing the min supported Ruby.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change looks good as a solve for the issue. Just one question about the other behavior change!
|
||
num.times do | ||
while total < num |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I'm not sure how likely it is to matter but this change means that we will keep looping until we see num jobs which means that if the queue becomes empty before we hit num we will keep looping, right?
Is that okay or desired?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a break if empty?
below that covers that case as well, so we should only ever continue the loop if there are jobs being returned in the query (and that's consistent with the way it worked pre-threading too).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah! i missed that. excellent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
domainlgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
platformlgtm
… for now) (#48) This is related to #41 and #42 insofar as a kind of DB-bound "spinloop" is still possible if a worker picks up jobs that take so little time that the worker immediately turns around and asks for more. As of now, there has been no way to tune the amount of time a worker should wait in between _successful_ iterations of its run loop. This introduces a configuration (`min_reserve_interval`) specifying a minimum number of seconds (default: 0 as this is not a major release) that a worker should wait in between _successful_ job reserve queries. An existing config (`sleep_delay`) is still used to define the number of seconds (default: 5) that a worker should wait in between _unsuccessful_ job reserve attempts (i.e. the queue is empty). The job execution time is subtracted from `min_reserve_interval` when the worker sleeps, and if jobs take more than `min_reserve_interval` to complete than the worker will not sleep before the next reserve query. /no-platform
This ensures that exceptions raised in thread callback hooks are rescued and properly mark jobs as failed.
This is also a good opportunity to change the
num
argument (ofwork_off(num)
) to mean number of jobs (give or take a few due tomax_claims
), not number of iterations. Previously (before threading was introduced) I think it meant number of jobs (though jobs and iterations were 1:1). I would not have done this before the refactor, because there was no guarantee that one ofsuccess
orfailure
would be incremented (the thread might crash for many reasons). Now, we only incrementsuccess
and treattotal - success
as the "failure" number when we return from the method.Fixes #23 and #41
This is also a prereq for a resolution I'm cooking up for #36