Latent Gossip Issue - forced small syncs #15615

poulok · 2024-09-30T15:24:28Z

poulok
Sep 30, 2024
Maintainer

@cody-littley, in a recent proposal from @rbair23 you posted this:

Latent Gossip Issue
A word of warning about enabling gossip with birth rounds (not directly related to prior paragraphs). There is currently a major inefficiency in gossip when one peer is very far behind the other. This is because the node that is ahead iterates the entire hashgraph in order to find the relevent events to send.

This gets MUCH, MUCH worse when syncs are forced to be small. Using birth rounds to limit syncs, even if gossip can get verifiable future events, is going to force there to be small syncs. The net outcome of this is going to be that nodes that fall a little behind will never be able to catch up via gossip. I collected strong experimental data showing this phenomena, it's not just theory craft.

This issue is fixable, but the change is non-trivial and is going to require some deep changes to sync and sync data structures. Let me know if you'd like to discuss further, and I can describe the plan of attack I had originally planned on using to resolve this problem.

I want to understand this better. If a node that falls slightly behind due to a temporary network interruption, why could it not catch up via gossip after birth rounds are used to discard events that are too far in the future? Assuming that the behind node can processes events and transactions at least slightly faster than the network, it will advance its birth rounds and, therefore, the future events it can receive via gossip, and eventually catch up. What am I missing?

poulok · 2024-09-30T19:53:17Z

poulok
Sep 30, 2024
Maintainer Author

@litt3, adding you in case you are interested in this topic.

0 replies

litt3 · 2024-09-30T20:59:57Z

litt3
Sep 30, 2024

The problem is that the further a node is behind, the greater the overhead is per sync for the sender to iterate over the hashgraph into the past. This cost is paid for each and every sync, regardless of whether 1 event or 10k events are sent.

Since switching to birth rounds entails discarding far future events, it won't be possible to perform large syncs, and the inefficiency will be more impactful. This inefficient syncing could be severe enough that transaction handling is no longer the main bottleneck. Rather, a far behind node might just not be able to get events via sync quickly enough to catch up.

1 reply

litt3 Sep 30, 2024

The solution is to design a data structure that allows the sender to avoid iterating over the entire hashgraph history for each sync.

cody-littley · 2024-10-01T13:20:17Z

cody-littley
Oct 1, 2024

I agree with @litt3. Some minor clarifications.

the greater the overhead is per sync for the sender to iterate over the hashgraph into the past.

This is the key point. If I have event X in memory and want to send it to my peer, I am required to iterate backwards from my tips until I encounter event X. If X is old, then I have to iterate a very large number of events. This cost is paid each time a new sync is initiated. So if I'm consistently sending just a few events at a time, that means I'm iterating a huge number of events for each sync. This can add seconds or tens of seconds for a single sync operation to complete, and it utilizes a full CPU core for multiple seconds to boot.

birth rounds entails discarding far future events

A minor point, if gossiping with an honest peer, the peer should never send future events. In general, the only time future events need to be discarded is if an attacker is sending events they know are in a peer's future.

inefficient syncing could be severe enough

I'd suggest thinking of this as "inefficient syncing WILL be severe enough". I experimentally observed this behavior on a real network at scale. If this problem is ignored, I have a high degree of confidence that it will render the sync gossip implementation non-functional when birth rounds are enabled.

0 replies

poulok · 2024-10-01T13:41:00Z

poulok
Oct 1, 2024
Maintainer Author

Thank you both for explaining - makes sense!

We could keep track of tips by birth round - the most recent event by each creator with a given birth round, keyed by birth round. That way the sender can easily lookup the tips to starts its traversal from the tips of the birth round according to what the receiver will accept. I suppose the value would also be a map, keyed by node id. There is probably a better way to set it up. Will be a fun problem to think on.

0 replies

lpetrovic05 · 2024-10-04T12:22:34Z

lpetrovic05
Oct 4, 2024
Collaborator

We used to have a sync algorithm implemented that had children pointers, I think it required a lot less traversal. The problem with it was not the algorithm itself, it was the implementation. We could potentially see if that algorithm might solve this issue.

1 reply

cody-littley Oct 4, 2024

Ensuring thread safety of algorithm that follows child pointers seems like it would add some unfortunate complexity. If you start with a set of tips and work backwards, you never have to worry about encountering new events added to the data structure (assuming events are inserted in topological order without gaps). Iterating forward, it will be possible to encounter events being asynchronously added by another thread. In such an environment, the linkages between events would need to be created and broken atomically in a thread safe manner (as opposed to the standard object references in use today).

On the other hand, if the system remembers the set of tips for each non-ancient birth round, the existing graph traversal algorithm can work without modifications and without new thread safety concerns. All new code and complexity would be limited to the maintenance of this lookup table. Performance wise, this would add an O(1) cost for each event in the system, and an O(1) for each birth round that becomes ancient.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latent Gossip Issue - forced small syncs #15615

{{title}}

Replies: 5 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Latent Gossip Issue - forced small syncs #15615

poulok Sep 30, 2024 Maintainer

Replies: 5 comments · 2 replies

poulok Sep 30, 2024 Maintainer Author

litt3 Sep 30, 2024

litt3 Sep 30, 2024

cody-littley Oct 1, 2024

poulok Oct 1, 2024 Maintainer Author

lpetrovic05 Oct 4, 2024 Collaborator

cody-littley Oct 4, 2024

poulok
Sep 30, 2024
Maintainer

Replies: 5 comments 2 replies

poulok
Sep 30, 2024
Maintainer Author

litt3
Sep 30, 2024

cody-littley
Oct 1, 2024

poulok
Oct 1, 2024
Maintainer Author

lpetrovic05
Oct 4, 2024
Collaborator