Path based database sync #695

darioush · 2024-12-05T20:18:55Z

darioush
Dec 5, 2024
Maintainer

Geth code supports using a path based data structure to store the trie.
The main advantages of using this data structure is that it is self-pruning, which over time leads to smaller database sizes (especially for validators which only need recent state)
The self-pruning data structure leads to smaller database size, which improves database performance (eg, most r/w operations will be impacted by lg of DB size).

The current syncing algorithm picks a root to sync to and assumes this state is available until sync completion.
This means validators would need to keep several hours worth of trie node state-diffs in memory which does not seem scalable or a good use of validator RAM.

Here we discuss some possible approaches.

Authenticated prestates for blocks (or state-diffs)

Nodes can maintain a list of k/vs read (or modified) in each block and include a hash of this in the block header. These prestates (or state-diffs) can be easily maintained on disk.
Nodes can begin querying for these p2p as the sync begins, then replay them on top of synced data.

Introducing this in the header requires a network upgrade.

Change proofs

Nodes maintain recent trie node state-diffs in memory, and the client updates their target during the sync operation.
The peer with state can omit unmodified parts of the trie, if the client specifies a recent state root that they have the corresponding trie for.

An implementation for this is used for x/merkledb in the x/sync package of avalanchego. This algorithm can be adapted to support the nested trie structure of the Ethereum state trie, and also to account for the fixed key sizes. This rough draft implementation takes this approach, which shows significant modifications are needed.

These modifications can be made (with proper abstractions and code structure, not as in the draft), or an implementation inspired by this can be done in coreth/subnet-evm.

Transition to merkledb

A different approach is to use merkledb and make the hash compatible. This rough draft implementation shows how this may be achieved. The sync part (eg, verifying and calculating proofs) is TBD.

This approach has the risk of hash incompatibility due to implementation bugs / differences, so the trie structures must be exactly identical under all conditions for the transition period.

Adapting the geth sync

The geth sync uses a sync algorithm that updates the target during sync, however it continues to create intermediary trie nodes as it goes. It uses a healing process to fetch missing trie nodes using DFS. It should be demonstrated that this approach would work with the avalanche block time (2s vs 12s on Ethereum), and additionally the code must be connected, which requires a well constructed adapter layer.

Alternatively this could be conceptually implemented on top of the existing sync.

darioush · 2024-12-05T20:34:01Z

darioush
Dec 5, 2024
Maintainer Author

IMO the authenticated prestates (or statediffs) are the most straightforward. Implementations using the merkledb or x/sync code seem too risky / intrusive based on some of the linked work in the discussion (to some extent even the hashing makes more sense, but that could also be recency bias).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Path based database sync #695

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Path based database sync #695

darioush Dec 5, 2024 Maintainer

Authenticated prestates for blocks (or state-diffs)

Change proofs

Transition to merkledb

Adapting the geth sync

Replies: 1 comment

darioush Dec 5, 2024 Maintainer Author

darioush
Dec 5, 2024
Maintainer

darioush
Dec 5, 2024
Maintainer Author