-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ReentrantLock: wakeup a single task on unlock and add a short spin #56814
base: master
Are you sure you want to change the base?
ReentrantLock: wakeup a single task on unlock and add a short spin #56814
Conversation
# Instead, the backoff is simply capped at a maximum value. This can be | ||
# used to improve throughput in `compare_exchange` loops that have high | ||
# contention. | ||
@inline function spin(iteration::Int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this better than just calling yield
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case we do not want to leave the core but instead just busy wait in-core for a small number of iterations before we attempt the compare_exchange
again, this way if the critical section of the lock is small enough we have a chance to acquire the lock without paying for a OS thread context switch (or a Julia scheduler task switch if you mean Base.yield
).
This is the same strategy employed by Rust here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant Base.yield
. I think we're in a different situation than Rust since we have M:N threading and a user mode scheduler which Rust doesn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am aware of the differences but the one that matters most in this case is that Julia has a cooperative scheduler. This means we have a tradeoff between micro-contention throughput (where we want to stay in-core) and being nice to the other tasks (by calling Base.yield
).
So I wrote a benchmark that measures the total amount of forward progress that can be made both by the tasks participating in locking as well as other unrelated tasks to see where we reach a good balance in this tradeoff. It turns out yielding to the Julia scheduler up to a limited amount seems to work great and does not suffer from the pathological case of always spinning in the scheduler (the first example in the PR description).
I will update the code and post benchmark results soon-ish (my daughter arrived yesterday!).
Thanks, @gbaraldi for working on benchmarks also, maybe I can contribute this new benchmark to your LockBench.jl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrebsguedes It would be lovely to add more benchmarks to it. Having a suite of benchmarks that stress locks in different ways would be great.
(this PR is fantastically written! professional, comprehensive, and easy to follow 👏👏) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Very well written PR.
# parked if it wants to lock the lock, but it is currently being held by some other task. | ||
const PARKED_BIT = 0b10 | ||
|
||
const MAX_SPINS = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in log2
? The optimal number of spins is chip dependent if I recall the WTF code correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will change to MAX_SPINS_LOG2
.
WTF definitely goes the extra mile of optimizing the spin limit, but parking_lot
seems to be perform fine with a hardcoded 10 so I just used what they settled on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the number of spins is 10 (cmpxchg attempts), its just the amount of work per spin that increases in each iteration to introduce some backoff.
# contention. | ||
@inline function spin(iteration::Int) | ||
next = iteration >= MAX_SPINS ? MAX_SPINS : iteration + 1 | ||
for _ in 1:(1 << next) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code feels a bit "unJulian" I would use max
and 2^n
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! Will change.
Another point of comparison could be https://github.com/kprotty/usync which is just a "normal" lock |
I wrote https://github.com/gbaraldi/LockBench.jl and there it seems to be a good upgrade |
else | ||
# There are more tasks on the queue, we unlock and keep the parked bit set | ||
@atomic :release rl.havelock = PARKED_BIT | ||
notify(cond_wait, all=false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's already a notify
call above; isn't this one going to unnecessarily schedule another task?
I propose a change in the implementation of the
ReentrantLock
to improve its overall throughput for short critical sections and fix the quadratic wake-up behavior where each unlock schedules all waiting tasks on the lock's wait queue.This implementation follows the same principles of the
Mutex
in the parking_lot Rust crate which is based on the Webkit WTF::ParkingLot class. Only the basic working principle is implemented here, further improvements such as eventual fairness will be proposed separately.The gist of the change is that we add one extra state to the lock, essentially going from:
To:
In the current implementation we must schedule all tasks to cause a conflict (state 0x2) because on unlock we only notify any task if the lock is in the conflict state. This behavior means that with high contention and a short critical section the tasks will be effectively spinning in the scheduler queue.
With the extra state the proposed implementation has enough information to know if there are other tasks to be notified or not, which means we can always notify one task at a time while preserving the optimized path of not notifying if there are no tasks waiting. To improve throughput for short critical sections we also introduce a bounded amount of spinning before attempting to park.
Results
Not spinning on the scheduler queue greatly reduces the CPU utilization of the following example:
Current:
Proposed:
In a micro-benchmark where 8 threads contend for a single lock with a very short critical section we see a ~2x improvement.
Current:
Proposed:
In the uncontended scenario the extra bookkeeping causes a 10% throughput reduction:EDIT: I reverted _trylock to the simple case to recover the uncontended throughput and now both implementations are on the same ballpark (without hurting the above numbers).
In the uncontended scenario:
Current:
Proposed:
Closes #56182