Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change(state): Stop using iterators on column families with many deletions #7663

Merged
merged 2 commits into from
Oct 4, 2023

Conversation

teor2345
Copy link
Contributor

@teor2345 teor2345 commented Oct 4, 2023

Motivation

When we use a RocksDB iterator on a column family with lots of deleted keys, database performance decreases by up to 300x.

Closes #7618

Specifications

This is a known issue with RocksDB, here are similar issues in other projects:
https://tracker.ceph.com/issues/55324
https://jira.mariadb.org/browse/MDEV-19670

Solution

As a quick fix we can replace the iterator with get(&key).
The permanent fix in PR #7392 stops deleting these keys entirely.

Manual Testing

I did a local test of this fix, and Zebra is up to 1.2 million blocks in under 5 hours. This is faster than Marek's main branch checks on his machine, and about as fast as v1.2.0 in CI:
https://github.com/ZcashFoundation/zebra/actions/runs/6049184682/job/16453440408

I think v1.2.0 was about as fast on my machine, but I haven't done any detailed tests. (A manual check is ok, we can do detailed checks in #7649.)

Review

This is an urgent fix which is blocking CI from working.

Reviewer Checklist

  • Will the PR name make sense to users?
    • Does it need extra CHANGELOG info? (new features, breaking changes, large changes)
  • Are the PR labels correct?
  • Does the code do what the ticket and PR says?
    • Does it change concurrent code, unsafe code, or consensus rules?
  • How do you know it works? Does it have tests?

Follow Up Work

Permanent fixes for all column families that delete keys:

Performance testing for this fix:

@teor2345 teor2345 added C-bug Category: This is a bug P-Critical 🚑 I-slow Problems with performance or responsiveness A-state Area: State / database changes labels Oct 4, 2023
@teor2345 teor2345 requested a review from a team as a code owner October 4, 2023 01:06
@teor2345 teor2345 self-assigned this Oct 4, 2023
@teor2345 teor2345 requested review from arya2 and removed request for a team October 4, 2023 01:06
Co-authored-by: Arya <[email protected]>
@teor2345
Copy link
Contributor Author

teor2345 commented Oct 4, 2023

Once we've verified this performance fix, I think we should admin-merge this PR, because it is currently blocked by CI failures in #7659 and #7661.

@teor2345
Copy link
Contributor Author

teor2345 commented Oct 4, 2023

Manual Testing

I did a local test of this fix on my machine, and Zebra is up to 1.2 million blocks in under 5 hours. This is faster than Marek's main branch checks on his machine, and about as fast as v1.2.0 in CI:
https://github.com/ZcashFoundation/zebra/actions/runs/6049184682/job/16453440408

I think v1.2.0 was about as fast on my machine, but I haven't done any detailed tests. So it looks like this fixes the bug. (And the way we use iterators in other PRs doesn't cause similar bugs.)

But I'd appreciate someone else checking the performance as well. A manual check is ok, we can do detailed checks in #7649.

@upbqdn
Copy link
Member

upbqdn commented Oct 4, 2023

This PR is as fast as v1.2.0, so it resolves the slow sync.

I wonder if we could speed Zebra up even further by fixing a similar issue that degrades performance elsewhere, but we accepted it.

mergify bot added a commit that referenced this pull request Oct 4, 2023
@arya2
Copy link
Contributor

arya2 commented Oct 4, 2023

In the merge queue, the test passed but the job failed:

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 45 filtered out; finished in 10465.70s
tee: 'standard output': Broken pipe
Error: Process completed with exit code 1.

https://github.com/ZcashFoundation/zebra/actions/runs/6405648679/job/17392070905#step:9:5457

@arya2
Copy link
Contributor

arya2 commented Oct 4, 2023

@Mergifyio refresh

@mergify
Copy link
Contributor

mergify bot commented Oct 4, 2023

refresh

✅ Pull request refreshed

@teor2345
Copy link
Contributor Author

teor2345 commented Oct 4, 2023

In the merge queue, the test passed but the job failed:

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 45 filtered out; finished in 10465.70s
tee: 'standard output': Broken pipe
Error: Process completed with exit code 1.
ZcashFoundation/zebra/actions/runs/6405648679/job/17392070905#step:9:5457

I reopened PR #7564 for this.

Since this PR is failing due to bug #7665, I'm going to admin-merge it.

@teor2345 teor2345 merged commit fcc7bf4 into main Oct 4, 2023
118 of 120 checks passed
@teor2345 teor2345 deleted the fix-slow branch October 4, 2023 21:36
@upbqdn upbqdn mentioned this pull request Oct 13, 2023
38 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-state Area: State / database changes C-bug Category: This is a bug I-slow Problems with performance or responsiveness
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Zebra full sync is taking at least 50% longer over the last 3 weeks
3 participants