-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a quick test for sampling initial sync performance #7649
Comments
I'm not sure if this ticket is possible with the current state format, so I wanted to let you know before you spent much time on it. I think it might also be out of scope, please check with @mpguerra.
I think the sprout tree might be a blocker here. Let's chat about how to do this, because there are a lot of column families to cover, and some will be really tricky to delete correctly. It doesn't need to be perfect, but it does need to delete blocks, hashes, heights, headers, transactions. And maybe addresses, UTXOs, trees, anchors. We won't be able to delete nullifiers unless we go through every block we're deleting. This is possible but slow. It's necessary because we check for duplicate nullifiers during verification. Similarly, reverting the value balance requires deserializing every transaction. If we don't we might get occasional underflow or overflow errors. The sprout tree is impossible to revert because leaves are discarded when they are hashed into the frontier, and the hashing is irreversible. So some sprout anchors won't verify correctly while syncing. This seems like a blocker. We'd have to change the state format to store every sprout tree. Edit: Here is the current format: Here is some other feedback on this planned solution:
Copying the state in Rust is very slow and requires a lot of disk space. Google Cloud already takes an efficient copy of the state for every CI run, so there's no need to re-do that work. If we decide later to run the test multiple times, we can run the CI job or step multiple times to get a fresh copy of the cached state every time.
This should go after the delete, because we don't care about the performance of test-only code.
In my experience some of these performance issues only show up after 100K blocks. But let's try it with 10K and see.
I don't think we need to repeat the test, because we run CI on PRs and the But if we want to test a lower height, we could run the same test starting from the cached checkpoint state. (Just after Canopy.) |
It could store the sprout tree(s) at the start height(s) in the binary.
It probably doesn't need to delete many blocks, it could be a moving target as long as it only moves occasionally.
Yep. Only new step here is rolling back the state.
100K blocks sounds good too, I think 5-10K would be good for sandblasted heights.
I think checking more height ranges will help better reflect the full sync time and that it'd be useful to look for regressions in particular sections of the chain. |
Wait, we could cache the entire state at certain start heights. |
The oldest cached state is currently 30K blocks behind, so that would be an option. We could modify the scheduled delete job to keep a few older states as well. But first, we need to show that we can find a significant performance difference in a manual test. Otherwise there's no point in automating it. Luckily, we have a comparison we need to do right now: |
We could add a test that syncs until a precise height with: while end_height < zebrad_tip_height(rpc_address).await? {
zebrad.wait_for_stdout_line(None);
tokio::time::sleep(Duration::from_secs(10)).await;
} |
I'm just going to stop you there, we have existing tests that take a stop height, so we don't need to write anything new: zebra/zebrad/tests/acceptance.rs Lines 1123 to 1124 in 04568d2
But I'd like to see some evidence that there is a detectable performance difference before we put effort into a test. (And this ticket needs to be scheduled by Pili.) What do you think of the steps in this comment: #7618 (comment) |
Comparing the sync times for 100K blocks from genesis is a great idea, though I'm not sure how to check the sync times closer to the tip. |
I answered in ticket #7618 to try and keep the conversation in the same place: |
Motivation
It's slow to check a version of Zebra's initial sync performance by running a full sync when there's a regression, and a quick test that can be run on every commit may help catch regressions more easily.
Possible Design
Use
checkpoint_sync
config field instead of heights above the max checkpoint.The text was updated successfully, but these errors were encountered: