Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a quick test for sampling initial sync performance #7649

Closed
arya2 opened this issue Sep 29, 2023 · 8 comments
Closed

Add a quick test for sampling initial sync performance #7649

arya2 opened this issue Sep 29, 2023 · 8 comments
Labels
A-diagnostics Area: Diagnosing issues or monitoring performance A-state Area: State / database changes C-enhancement Category: This is an improvement C-feature Category: New features C-testing Category: These are tests I-slow Problems with performance or responsiveness S-needs-triage Status: A bug report needs triage

Comments

@arya2
Copy link
Contributor

arya2 commented Sep 29, 2023

Motivation

It's slow to check a version of Zebra's initial sync performance by running a full sync when there's a regression, and a quick test that can be run on every commit may help catch regressions more easily.

Possible Design

  1. Copy cached state
  2. Start timer
  3. Delete blocks after some height
  4. Sync ~10k blocks
  5. Shutdown Zebra and finish timer
  6. Repeat 2-5 a few times with lower heights
  7. Fail test if any range takes 10% longer than expected

Use checkpoint_sync config field instead of heights above the max checkpoint.

@arya2 arya2 added C-enhancement Category: This is an improvement S-needs-triage Status: A bug report needs triage P-Medium ⚡ I-slow Problems with performance or responsiveness A-diagnostics Area: Diagnosing issues or monitoring performance C-testing Category: These are tests A-state Area: State / database changes C-feature Category: New features labels Sep 29, 2023
@mpguerra mpguerra added this to Zebra Sep 29, 2023
@github-project-automation github-project-automation bot moved this to 🆕 New in Zebra Sep 29, 2023
@arya2 arya2 self-assigned this Sep 29, 2023
@teor2345
Copy link
Contributor

teor2345 commented Sep 29, 2023

I'm not sure if this ticket is possible with the current state format, so I wanted to let you know before you spent much time on it. I think it might also be out of scope, please check with @mpguerra.

  • Delete blocks after some height

I think the sprout tree might be a blocker here.

Let's chat about how to do this, because there are a lot of column families to cover, and some will be really tricky to delete correctly. It doesn't need to be perfect, but it does need to delete blocks, hashes, heights, headers, transactions. And maybe addresses, UTXOs, trees, anchors.

We won't be able to delete nullifiers unless we go through every block we're deleting. This is possible but slow. It's necessary because we check for duplicate nullifiers during verification. Similarly, reverting the value balance requires deserializing every transaction. If we don't we might get occasional underflow or overflow errors.

The sprout tree is impossible to revert because leaves are discarded when they are hashed into the frontier, and the hashing is irreversible. So some sprout anchors won't verify correctly while syncing. This seems like a blocker. We'd have to change the state format to store every sprout tree.

Edit: Here is the current format:
https://github.com/ZcashFoundation/zebra/blob/main/book/src/dev/state-db-upgrades.md#current-state-database-format


Here is some other feedback on this planned solution:

  • Copy cached state

Copying the state in Rust is very slow and requires a lot of disk space. Google Cloud already takes an efficient copy of the state for every CI run, so there's no need to re-do that work.

If we decide later to run the test multiple times, we can run the CI job or step multiple times to get a fresh copy of the cached state every time.

  • Start timer

This should go after the delete, because we don't care about the performance of test-only code.

  • Sync ~10k blocks

In my experience some of these performance issues only show up after 100K blocks. But let's try it with 10K and see.

  • Repeat 2-5 a few times with lower heights

I don't think we need to repeat the test, because we run CI on PRs and the main branch multiple times each day. So we'll catch most bugs within a few PRs.

But if we want to test a lower height, we could run the same test starting from the cached checkpoint state. (Just after Canopy.)

@arya2
Copy link
Contributor Author

arya2 commented Sep 29, 2023

I think the sprout tree might be a blocker here.

It could store the sprout tree(s) at the start height(s) in the binary.

We won't be able to delete nullifiers unless we go through every block we're deleting. This is possible but slow.

It probably doesn't need to delete many blocks, it could be a moving target as long as it only moves occasionally.

Copying the state in Rust is very slow and requires a lot of disk space. Google Cloud already takes an efficient copy of the state for every CI run, so there's no need to re-do that work.

Yep. Only new step here is rolling back the state.

In my experience some of these performance issues only show up after 100K blocks. But let's try it with 10K and see.

100K blocks sounds good too, I think 5-10K would be good for sandblasted heights.

I don't think we need to repeat the test, because we run CI on PRs and the main branch multiple times each day. So we'll catch most bugs within a few PRs.

I think checking more height ranges will help better reflect the full sync time and that it'd be useful to look for regressions in particular sections of the chain.

@arya2
Copy link
Contributor Author

arya2 commented Sep 29, 2023

It could store the sprout tree(s) at the start height(s) in the binary.

Wait, we could cache the entire state at certain start heights.

@teor2345
Copy link
Contributor

It could store the sprout tree(s) at the start height(s) in the binary.

Wait, we could cache the entire state at certain start heights.

It probably doesn't need to delete many blocks, it could be a moving target as long as it only moves occasionally.

The oldest cached state is currently 30K blocks behind, so that would be an option. We could modify the scheduled delete job to keep a few older states as well.

But first, we need to show that we can find a significant performance difference in a manual test. Otherwise there's no point in automating it. Luckily, we have a comparison we need to do right now:
#7618 (comment)

@arya2
Copy link
Contributor Author

arya2 commented Sep 29, 2023

The oldest cached state is currently 30K blocks behind, so that would be an option. We could modify the scheduled delete job to keep a few older states as well.

We could add a test that syncs until a precise height with:

    while end_height < zebrad_tip_height(rpc_address).await? {
        zebrad.wait_for_stdout_line(None);
        tokio::time::sleep(Duration::from_secs(10)).await;
    }

@arya2 arya2 changed the title Add a quick test for initial sync performance Add a quick test for sampling initial sync performance Sep 29, 2023
@teor2345
Copy link
Contributor

I'm just going to stop you there, we have existing tests that take a stop height, so we don't need to write anything new:

fn sync_past_mandatory_checkpoint(network: Network) -> Result<()> {
let height = network.mandatory_checkpoint_height() + 1200;

But I'd like to see some evidence that there is a detectable performance difference before we put effort into a test. (And this ticket needs to be scheduled by Pili.)

What do you think of the steps in this comment: #7618 (comment)

@arya2
Copy link
Contributor Author

arya2 commented Sep 29, 2023

What do you think of the steps in this comment: #7618 (comment)

Comparing the sync times for 100K blocks from genesis is a great idea, though I'm not sure how to check the sync times closer to the tip.

@teor2345
Copy link
Contributor

I answered in ticket #7618 to try and keep the conversation in the same place:
#7618 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-diagnostics Area: Diagnosing issues or monitoring performance A-state Area: State / database changes C-enhancement Category: This is an improvement C-feature Category: New features C-testing Category: These are tests I-slow Problems with performance or responsiveness S-needs-triage Status: A bug report needs triage
Projects
Archived in project
Development

No branches or pull requests

3 participants