Add a quick test for sampling initial sync performance #7649

arya2 · 2023-09-29T02:54:50Z

Motivation

It's slow to check a version of Zebra's initial sync performance by running a full sync when there's a regression, and a quick test that can be run on every commit may help catch regressions more easily.

Possible Design

Copy cached state
Start timer
Delete blocks after some height
Sync ~10k blocks
Shutdown Zebra and finish timer
Repeat 2-5 a few times with lower heights
Fail test if any range takes 10% longer than expected

Use checkpoint_sync config field instead of heights above the max checkpoint.

The text was updated successfully, but these errors were encountered:

teor2345 · 2023-09-29T04:43:32Z

I'm not sure if this ticket is possible with the current state format, so I wanted to let you know before you spent much time on it. I think it might also be out of scope, please check with @mpguerra.

Delete blocks after some height

I think the sprout tree might be a blocker here.

Let's chat about how to do this, because there are a lot of column families to cover, and some will be really tricky to delete correctly. It doesn't need to be perfect, but it does need to delete blocks, hashes, heights, headers, transactions. And maybe addresses, UTXOs, trees, anchors.

We won't be able to delete nullifiers unless we go through every block we're deleting. This is possible but slow. It's necessary because we check for duplicate nullifiers during verification. Similarly, reverting the value balance requires deserializing every transaction. If we don't we might get occasional underflow or overflow errors.

The sprout tree is impossible to revert because leaves are discarded when they are hashed into the frontier, and the hashing is irreversible. So some sprout anchors won't verify correctly while syncing. This seems like a blocker. We'd have to change the state format to store every sprout tree.

Edit: Here is the current format:
https://github.com/ZcashFoundation/zebra/blob/main/book/src/dev/state-db-upgrades.md#current-state-database-format

Here is some other feedback on this planned solution:

Copy cached state

Copying the state in Rust is very slow and requires a lot of disk space. Google Cloud already takes an efficient copy of the state for every CI run, so there's no need to re-do that work.

If we decide later to run the test multiple times, we can run the CI job or step multiple times to get a fresh copy of the cached state every time.

Start timer

This should go after the delete, because we don't care about the performance of test-only code.

Sync ~10k blocks

In my experience some of these performance issues only show up after 100K blocks. But let's try it with 10K and see.

Repeat 2-5 a few times with lower heights

I don't think we need to repeat the test, because we run CI on PRs and the main branch multiple times each day. So we'll catch most bugs within a few PRs.

But if we want to test a lower height, we could run the same test starting from the cached checkpoint state. (Just after Canopy.)

arya2 · 2023-09-29T05:25:51Z

I think the sprout tree might be a blocker here.

It could store the sprout tree(s) at the start height(s) in the binary.

We won't be able to delete nullifiers unless we go through every block we're deleting. This is possible but slow.

It probably doesn't need to delete many blocks, it could be a moving target as long as it only moves occasionally.

Copying the state in Rust is very slow and requires a lot of disk space. Google Cloud already takes an efficient copy of the state for every CI run, so there's no need to re-do that work.

Yep. Only new step here is rolling back the state.

In my experience some of these performance issues only show up after 100K blocks. But let's try it with 10K and see.

100K blocks sounds good too, I think 5-10K would be good for sandblasted heights.

I don't think we need to repeat the test, because we run CI on PRs and the main branch multiple times each day. So we'll catch most bugs within a few PRs.

I think checking more height ranges will help better reflect the full sync time and that it'd be useful to look for regressions in particular sections of the chain.

arya2 · 2023-09-29T05:32:52Z

It could store the sprout tree(s) at the start height(s) in the binary.

Wait, we could cache the entire state at certain start heights.

teor2345 · 2023-09-29T05:37:32Z

It could store the sprout tree(s) at the start height(s) in the binary.

Wait, we could cache the entire state at certain start heights.

It probably doesn't need to delete many blocks, it could be a moving target as long as it only moves occasionally.

The oldest cached state is currently 30K blocks behind, so that would be an option. We could modify the scheduled delete job to keep a few older states as well.

But first, we need to show that we can find a significant performance difference in a manual test. Otherwise there's no point in automating it. Luckily, we have a comparison we need to do right now:
#7618 (comment)

arya2 · 2023-09-29T05:48:58Z

The oldest cached state is currently 30K blocks behind, so that would be an option. We could modify the scheduled delete job to keep a few older states as well.

We could add a test that syncs until a precise height with:

    while end_height < zebrad_tip_height(rpc_address).await? {
        zebrad.wait_for_stdout_line(None);
        tokio::time::sleep(Duration::from_secs(10)).await;
    }

teor2345 · 2023-09-29T05:55:55Z

I'm just going to stop you there, we have existing tests that take a stop height, so we don't need to write anything new:

zebra/zebrad/tests/acceptance.rs

Lines 1123 to 1124 in 04568d2

    
           fn sync_past_mandatory_checkpoint(network: Network) -> Result<()> { 
        
               let height = network.mandatory_checkpoint_height() + 1200;

But I'd like to see some evidence that there is a detectable performance difference before we put effort into a test. (And this ticket needs to be scheduled by Pili.)

What do you think of the steps in this comment: #7618 (comment)

arya2 · 2023-09-29T06:03:48Z

What do you think of the steps in this comment: #7618 (comment)

Comparing the sync times for 100K blocks from genesis is a great idea, though I'm not sure how to check the sync times closer to the tip.

teor2345 · 2023-09-29T06:13:33Z

I answered in ticket #7618 to try and keep the conversation in the same place:
#7618 (comment)

mpguerra added this to Zebra Sep 29, 2023

github-project-automation bot moved this to 🆕 New in Zebra Sep 29, 2023

arya2 self-assigned this Sep 29, 2023

arya2 changed the title ~~Add a quick test for initial sync performance~~ Add a quick test for sampling initial sync performance Sep 29, 2023

teor2345 mentioned this issue Sep 29, 2023

Zebra full sync is taking at least 50% longer over the last 3 weeks #7618

Closed

arya2 removed their assignment Sep 29, 2023

This was referenced Oct 4, 2023

change(state): Stop using iterators on column families with many deletions #7663

Merged

Set iterator read bounds on deleting column families to avoid very slow syncs #7664

Closed

teor2345 mentioned this issue Nov 1, 2023

benchmarks: create a ticket if zebrad mainnet sync time increases significantly #4554

Closed

mpguerra removed the P-Medium ⚡ label Jan 17, 2024

mpguerra closed this as not planned Won't fix, can't repro, duplicate, stale Oct 14, 2024

github-project-automation bot moved this from New to Done in Zebra Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a quick test for sampling initial sync performance #7649

Add a quick test for sampling initial sync performance #7649

arya2 commented Sep 29, 2023 •

edited

Loading

teor2345 commented Sep 29, 2023 •

edited

Loading

arya2 commented Sep 29, 2023 •

edited

Loading

arya2 commented Sep 29, 2023 •

edited

Loading

teor2345 commented Sep 29, 2023

arya2 commented Sep 29, 2023

teor2345 commented Sep 29, 2023

arya2 commented Sep 29, 2023

teor2345 commented Sep 29, 2023

Add a quick test for sampling initial sync performance #7649

Add a quick test for sampling initial sync performance #7649

Comments

arya2 commented Sep 29, 2023 • edited Loading

Motivation

Possible Design

teor2345 commented Sep 29, 2023 • edited Loading

arya2 commented Sep 29, 2023 • edited Loading

arya2 commented Sep 29, 2023 • edited Loading

teor2345 commented Sep 29, 2023

arya2 commented Sep 29, 2023

teor2345 commented Sep 29, 2023

arya2 commented Sep 29, 2023

teor2345 commented Sep 29, 2023

arya2 commented Sep 29, 2023 •

edited

Loading

teor2345 commented Sep 29, 2023 •

edited

Loading

arya2 commented Sep 29, 2023 •

edited

Loading

arya2 commented Sep 29, 2023 •

edited

Loading