-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test Tablets split and merge #8948
base: master
Are you sure you want to change the base?
Conversation
@yarongilor what's the point of creating PR's without any description? |
d09408f
to
6a436a0
Compare
e9c26e8
to
82ca4f6
Compare
Started testing for the new PR test in https://argus.scylladb.com/test/e6972842-591d-472f-9e39-f196d3670053/runs?additionalRuns[]=93284dff-cfe7-4b45-883f-53509355301d since split-merge code is not merged to master, the test probably failed as expected with:
Waiting for a private build from Raphael in https://jenkins.scylladb.com/job/releng/job/create-private-build/306/ Not sure the PR can fully reviewed before it is tested. |
@yarongilor you can test using SCT custom branches: https://docs.google.com/presentation/d/1P5xofncoTkUI-uQ5ilG9eRhRoEfYVdJRhdps4qQ6Mx0/edit?pli=1#slide=id.g2e83887fc9a_0_0 |
Private build https://jenkins.scylladb.com/job/releng/job/create-private-build/307/
@raphaelsc , can you please see why it's broken? |
82ca4f6
to
d63c7f8
Compare
Need to remove all packages that require java. It was raised, but closed as it should be soon fixed: #8474 |
@yarongilor @soyacz I don't understand what I have to do to fix this. I need instructions. We need a custom branch of mine (i.e. work that is not yet on master), which is why I had to produce a private build. |
d63c7f8
to
3e691c3
Compare
two possibilities:
|
@yarongilor It seems the 2nd sounds easier. link to the branch: https://github.com/scylladb/scylla-dev/tree/tablet-merge |
The deletions in c-s are not good enough and might become a bottle neck or fail. |
8c6a9ac
to
d94a726
Compare
Building an AMI in: |
7ce8506
to
063f0c7
Compare
@raphaelsc , @bhalevy , the test seem to fail, since it doesn't get more than 1 tablet:
But there might be a different issue faced here, related to the replication strategy. |
opened an issue for missing split as well: scylladb/scylladb#21092 |
0f59b26
to
746b5a7
Compare
OK, thanks, what i realized from the above comments is that the trigger for split could be the data size on disk and not the "pure" dataset size. So i changed the test to become simpler in a way. It will just run a background thread that samples tablets-number every few seconds, saving the maximum number over time. Test passes ok now. |
206bca7
to
a12ff32
Compare
The test fails now since a tablets merge never happen. |
i don't think we should open issues against scylla, since merge is not merged to master yet. so there's nothing to be done there. |
2a4b60e
to
13a0c0d
Compare
The latest run encountered
and then a major compaction failed due to a core dump:
and a commit log error:
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |
it's an issue with merge, split is revoked, merge is emitted, split compaction fails. it's not the end of the world (doesn't affect correctness of the system and in practice we won't abort), since split can be later retried but we should fix it. |
@raphaelsc , can you please explain why the first (and only) merge happened only after 4 full cycles? and only after 15 hours? isn't it expected to happen in some correlation with splits?
|
d9fb905
to
714b835
Compare
2024-12-01T22:41:32.753+00:00 tablets-split-merge-tablets--db-node-9aefb63e-1 !INFO | scylla[4975]: [shard 0: gms] load_balancer - Table 5ee787b0-b00d-11ef-badc-4abe9bfd9e7c with tablet_count=256 has an average tablet size of 487239504 average size hasn't reached the threshold right before we started cycle 1. that's why I asked you to dump sstable files (and their sizes) before and after major for me, so we can try to deduce why the workload is not triggering merge. |
714b835
to
6e5dc16
Compare
@raphaelsc , it looks like the core dump of major happens only following a tablets merge:
|
@yarongilor I am bit of confused about the problems you found. I understand you found two problems:
|
1579da8
to
9c053c3
Compare
As for 1, yes, it is updated in scylladb/scylladb#21736
|
A test of high load, causing rapid tablet splits. Then have a deletion burst and a major compaction to trigger rapid tablets merges.
9c053c3
to
50267bc
Compare
opened scylladb/scylladb#21867 for coredump. |
A test of high load, causing rapid tablet splits.
Then have a deletion burst and a major compaction to trigger rapid merges as well.
Test description doc.
Testing
PR pre-checks (self review)
backport
labelsReminders
sdcm/sct_config.py
)unit-test/
folder)