-
Notifications
You must be signed in to change notification settings - Fork 781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bron-Kerbosch attestation aggregation #4507
base: unstable
Are you sure you want to change the base?
Bron-Kerbosch attestation aggregation #4507
Conversation
5329233
to
f39becd
Compare
I think it should be ready for a first pass. Overview:
Also the changes made a couple of previously used functions unused. Should I go ahead and delete them? |
I think the general ideas are there however there are a few tests that stall forever that I'm working on:
I also tried what I thought would be an optimization of going through the cliques and filtering them if they are a subset of another clique before aggregate the attestations. That didn't seem to improve the situation, so I reverted that for now. EDIT: This is no longer an issue |
b0da5a1
to
af26f34
Compare
…sh from a static graph
…e attestations in the op pool until calculating the attestations for block inclusion
…have not tested yet
b9c8fae
to
d2bd6af
Compare
…gregation Fix compute_neighbourhoods
…estation_aggregation
I was testing the performance with This is even with max-aggregates-per-data set to 1! I realised that the I'll post an updated graph once it's had a chance to warm up for a bit. The time seems to be more like the 30ms we see for regular nodes 👌 |
I think this is ready to go. I've been running it on one of our Holesky beacon nodes for the last few days without issue. If anything it's a little bit faster than Here's the median attestation packing time for this PR (green) vs unstable (purple): Here's the 95th percentile: Here's total block production time (median over 12h): 95th percentile over 12h: Investigating that spike with a shorter interval shows it's a spike to 2s+ This doesn't really correspond to a spike in attestation packing time, so I suspect it's another issue (lord knows we have state issues while packing blocks). The version of I'll let @paulhauner do a once-over of the Rayon stuff (because we've been bitten before), and then we can merge. |
Logs from the spike also seem to show that it was actually only a spike to 1.2s, but even with histogram buckets I don't quite see how that makes any sense
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heyo, looking very impressive!
I had a look at just the Rayon code and I think there's a potential for a deadlock in there (see my comment).
.map(|(data, aggregates)| { | ||
let aggregates: Vec<&CompactIndexedAttestation<T>> = aggregates | ||
.iter() | ||
.filter(|_| validity_filter(checkpoint_key, data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're at risk of the deadlock described in this HackMD (see also #1562).
The validity_filter
function can try to access the fork choice read lock:
let fork_choice_lock = self.canonical_head.fork_choice_read_lock(); |
That means this function would behave the same as the rest_api::validator::publish_attestations
function I describe in the HackMD.
In terms of a workaround, I'm not fully across this PR but I feel like the parallel functionality is most important for the second map
fn (the one which calls bron_kerbosch
). So perhaps the filter
and first map
could happen in a good ol' fashioned serial iter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good point. I'm trying to understand how we can deadlock if we are only reading and never holding a write lock while spawning a Rayon task. I think we are safe because we only have reads (item 2 from the HackMD) and not writes (item 1).
Reading the Rayon docs a bit more, I think how the deadlock happens is this:
- Thread 1 obtains the write lock and then spawns a Rayon job on the Rayon thread pool and blocks waiting for it to complete.
- Concurrently, another thread spawns a bunch of jobs on the Rayon thread pool which try to grab the read lock.
- Rayon's pool "schedules" (="steals work such that") all of the reading jobs begin executing but not the job spawned by thread 1 which is required to release the lock. Unlike
async
tasks in Tokio-land, there is no implicit yielding from Rayon jobs once they start executing. - Therefore we get a deadlock: all the threads in the pool are blocked waiting for a job that can only execute in the pool, which will never execute.
Without holding an exclusive lock/mutex while spawning I think we are safe. If all the jobs in the pool are just grabbing read locks and there is no other thread holding a write lock while it waits for a Rayon job to complete, then we are OK. Even if we have other (non-Rayon) threads grabbing the write lock, this is OK too, as they will be run concurrently by the OS, and the write lock will eventually be released, allowing the Rayon threads that are reading to make progress.
I went looking for places where we are already locking an RwLock
/Mutex
inside a Rayon job and actually the current version of the op pool uses rayon::join
on two lazy iterators that evaluate the same validity_filter
(and therefore obtain a read lock on fork choice from within the Rayon pool):
lighthouse/beacon_node/operation_pool/src/lib.rs
Lines 307 to 325 in 441fc16
let (prev_cover, curr_cover) = rayon::join( | |
move || { | |
let _timer = metrics::start_timer(&metrics::ATTESTATION_PREV_EPOCH_PACKING_TIME); | |
// If we're in the genesis epoch, just use the current epoch attestations. | |
if prev_epoch_key == curr_epoch_key { | |
vec![] | |
} else { | |
maximum_cover(prev_epoch_att, prev_epoch_limit, "prev_epoch_attestations") | |
} | |
}, | |
move || { | |
let _timer = metrics::start_timer(&metrics::ATTESTATION_CURR_EPOCH_PACKING_TIME); | |
maximum_cover( | |
curr_epoch_att, | |
T::MaxAttestations::to_usize(), | |
"curr_epoch_attestations", | |
) | |
}, | |
); |
i.e. we are already doing this and it is not deadlocking
Additionally in Milhouse, we are recursively obtaining a write
lock inside Rayon jobs here:
https://github.com/sigp/milhouse/blob/6c82029bcbc656a3cd423d403d7551974818d45d/src/tree.rs#L430-L433
Mac and I played around with that code trying to use upgradable locks and quickly hit deadlocks, which I think implies the current pattern is safe (see our attempts in sigp/milhouse#25, sigp/milhouse#29).
TL;DR: I think that Rayon is only dangerous is you hold a lock while spawning jobs into the pool.
This was what I previously thought, but I'd never understood why this was the case and thought it had something to do with thread-local storage, context switches and mutexes. I find the explanation above quite satisfactory and hope you do too!
Also, if we were to remove the par_iter
we need a collect, as @GeemoCandama showed in 376c408. If you agree with my analysis I reckon we revert that commit 🙏
Thanks for reviewing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without holding an exclusive lock/mutex while spawning I think we are safe.
I think we hold an exclusive lock whilst spawning here:
lighthouse/beacon_node/beacon_chain/src/beacon_chain.rs
Lines 1500 to 1502 in 2bc9115
fork_choice | |
.on_block(current_slot, block, block_root, &state) | |
.map_err(|e| BlockError::BeaconChainError(e.into()))?; |
We take a write-lock on fork choice, then we call on_block
which might read a state from the DB which might involve rayon-parallelised tree hashing.
So the deadlock would be:
- We take the fork choice write lock in
on_block
but don't yet trigger tree hashing. - We run the
par_iter
in this PR, which fills all the rayon threads with functions waiting on the fork choice read lock. on_block
tries to start tree hashing but it can't start because all the rayon threads are busy.- We are deadlocked because we need
on_block
to finish so we can drop the fork choice write-lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh bugger, I think that could deadlock even with our current code then.
I think we need to choose between:
- Never hold a lock while spawning Rayon tasks, or
- Never obtain a lock in a Rayon task that could be held over a spawn
I have a feeling that (1) is better in some way, but need to sleep on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually does on_block
load states? I can't see it at a quick glance. I also don't think loading a state does any tree hashing (although it will in tree-states land, so definitely something to keep an eye on).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah, here:
lighthouse/beacon_node/beacon_chain/src/beacon_fork_choice_store.rs
Lines 328 to 334 in 441fc16
let (_, state) = self | |
.store | |
.get_advanced_hot_state( | |
self.justified_checkpoint.root, | |
max_slot, | |
justified_block.state_root(), | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I get some time today or tomorrow I'll try writing a test with hiatus
that deadlocks on unstable
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a lot of mucking around I got block production & tree hashing to deadlock on unstable
, sort of. The branch is here: michaelsproul@5e253de
It was really hard to get all the bits of line up, so there are 7 separate hiatus
breakpoints. I also had to cheat and just force tree hashing in import_block
because I ran out of patience to get the tree hashing and state load in on_block
to trigger. The main problem is that the attestations were triggering the justified checkpoint update before I could trigger it with on_block
. I think using the mainnet spec would make this a bit easier as we wouldn't have to justify on the epoch boundary (as we do with minimal).
I think this deadlock (on unstable
) hasn't been observed in practice for several reasons:
- It requires block production and block import to be running concurrently. This doesn't happen particularly often, but isn't that unlikely.
- It requires the balances cache in fork choice to miss. This almost never happens in recent versions of Lighthouse.
- It requires a machine with ~2 cores, because we only spawn two op pool tasks using
rayon::join
. If there are more than 2 Rayon threads in the pool then the tree hashing can make progress and drop the fork choice lock, which unblocks the op pool tasks.
In my example you can see the deadlock happen with:
RAYON_NUM_THREADS=2 FORK_NAME=capella cargo test --release --features fork_from_env -- deadlock_block_production --nocapture
If you bump the threadpool size to 3 then the deadlock doesn't happen (due to condition 3).
RAYON_NUM_THREADS=3 FORK_NAME=capella cargo test --release --features fork_from_env -- deadlock_block_production --nocapture
I haven't tried deadlocking Geemo's branch yet, but intuitively we make the deadlock more likely by spawning more threads for the attestation data (condition 3 no longer applies). Still, we are protected by the unlikeliness of condition 1 & condition 2 occurring simultaneously.
I need to think about it more.
Hey @GeemoCandama I think to play it safe lets remove all the Rayon calls and assess the performance impact. After going deep down the Rayon rabbit hole I don't think there's a completely safe way for us to use Rayon in the op pool at the moment (see #4952). I think we should revisit parallelism in the op pool at a later date, maybe using a different dedicated thread pool, or Tokio. Hopefully #4925 buys us a little bit of wiggle room by reducing block production times overall. |
I did some benchmarking with |
Issue Addressed
This branch will address #3733
Proposed Changes
Incorporate the Bron-Kerbosch algorithm that Satalia wrote into the attestation aggregation flow. More info in #3733
Additional Info
@ankurdubey521 and I will be working on this.