Add testcase for scale-out/scale-in while having 90% storage utilization #9156

pehala · 2024-11-07T12:48:57Z

Create 3 node cluster with rf=3.
Reach 90% disk usage.
Measure latency under stress.
Perform scale-out (3 -> 4) under load
Reach 90% disk usage
Perform scale-out (4 -> 5) under load
Measure latency under stress (scaled to 5 nodes)
Perform scale-in (5 -> 4) under load
Measure latency under stress (scaled to 4 nodes)
Perform scale-in (4 -> 3) under load
Measure latency under stress (scaled to 3 nodes)

Lakshmipathi · 2024-11-11T06:50:29Z

older run

3-node (Instance type: i4i.large) cluster scaleout at 90%.

reached 91% disk usage and started waiting for 30mins, no write or read.

< t:2024-11-03 07:10:58,323 f:full_storage_utilization_test.py l:93   c:FullStorageUtilizationTest p:INFO  > Current max disk usage after writing to keyspace10: 91% (396 GB / 392.40000000000003 GB)
< t:2024-11-03 07:10:59,353 f:full_storage_utilization_test.py l:58   c:FullStorageUtilizationTest p:INFO  > Wait for 1800 seconds

After 30min idle time, started throttled write:

< t:2024-11-03 07:42:10,941 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > stress_cmd=cassandra-stress write duration=30m -rate threads=10 "throttle=1400/s" -mode cql3 native -pop seq=1..5000000 -col "size=FIXED(10240) n=FIXED(1)" -schema "replication(strategy=NetworkTopologyStrategy,replication_factor=3)"

Scaleout by adding a new node at 90%

< t:2024-11-03 07:44:05,075 f:full_storage_utilization_test.py l:41   c:FullStorageUtilizationTest p:INFO  > Adding a new node

After 30mins, scaleout (3->4) cluster has disk usage at 75%, 74%, 75% and 70%

Tablet migration over time

max/avg disk utilization

Latency
99th percentile write and read latency by Cluster (max at 90% disk utilization)

syscall	value
writes	3.07ms
read	1.79ms

https://argus.scylladb.com/tests/scylla-cluster-tests/c5de2f39-770c-4cf3-8d8c-66fef9d91d87

swasik · 2024-11-22T14:16:16Z

After 30mins, scaleout (3->4) cluster has disk usage at 75%, 74%, 75% and 70%

But I see that the chart presents average disk usage. This should change quickly as we are adding more disk space even if the new space is not used. Could you also add picture for maximal disk usage across all nodes?

swasik · 2024-11-26T12:11:14Z

The interesting fact is that after migration we have the same number of tablets everywhere but on the new node the disk utilization is ca. 5% lower. Maybe something is not cleaned yet. Could we wait a bit more time to see if the utilization will be equal in the end?

Lakshmipathi · 2024-11-26T14:18:56Z

Started new job, with 1hr wait time just before the test ends. Will check and update whether 5% lower disk usage still exists or not.

Lakshmipathi · 2024-11-27T11:59:04Z

@swasik After scaleout, waited for 40mins and ensured there is 0% load on all nodes. Final disk usage is: 66%, 69%, 71% and 73%. So on avg, the newly added node has 5% less disk usage than other 3-nodes.

pehala · 2024-11-27T12:01:54Z

Could it be due to tablet inbalance?

swasik · 2024-11-27T12:22:01Z

Could it be due to tablet inbalance?

I thought so too, but we have exactly the same number of tablets at each node and probably linear distribution of keyspace.

paszkow · 2024-11-28T15:41:01Z

@Lakshmipathi Do you have new updated results? I look at grafana for the first experiment and I have the following observations:

CPU load is around 15% when you add a new node. Moreover, at that time you have pure write workload. Use read+write and then you might observe problems like not being able to scale out on time. Tablet migration has significantly lower priority than reads so takes more time to stream than in a pure write workflow.
You stated read P99 to be 1.8ms but at that time you perform 10 reads/s. Does not seem to be correct...

Lakshmipathi · 2024-11-29T07:27:12Z

@pehala @swasik , May be because we create different keyspace size? Will that create that 5% gap for new node?

Lakshmipathi · 2024-11-29T07:28:00Z

@Lakshmipathi Do you have new updated results? I look at grafana for the first experiment and I have the following observations:

CPU load is around 15% when you add a new node. Moreover, at that time you have pure write workload. Use read+write and then you might observe problems like not being able to scale out on time. Tablet migration has significantly lower priority than reads so takes more time to stream than in a pure write workflow.

You stated read P99 to be 1.8ms but at that time you perform 10 reads/s. Does not seem to be correct...

@paszkow, ok, I'm running test without throttle now, will update the results.

pehala · 2024-12-09T07:20:24Z

Updated the description and the name to fit changes made to the test plan

Lakshmipathi · 2024-12-17T13:16:48Z

old run

[Argus] (https://argus.scylladb.com/tests/scylla-cluster-tests/0c0af1c9-d798-48d1-bfaa-04767ef08d38)

Configuration:

3 x i4i.2xlarge - 1.875TB Disk in each node
Raw Data set size: 1.16TB (RF =3)

Workload during latency measurements:

c-s mixed (R/W 50/50)
1KB Row size (8 columns, 128 bytes each)
Gauss Distribution - writing and reading only part of the dataset.

cassandra-stress mixed no-warmup cl=QUORUM duration=800m -schema keyspace=ks_mixed1 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -mode cql3 native -rate 'threads=300 fixed=16875/s' -col 'size=FIXED(128) n=FIXED(8)' -pop 'dist=gauss(1..650000000,325000000,6500000)'

Disk Usage

Load

Latency
p99 read and write shows improvement while throughput remains stable.

Decommission:
Able to perform first decommission (5->4) but unable to decommission second time (4->3) .
Few times it ran out of space. With the recent run - waited for compaction to stop hoping to reclaim the truncated
data - but it never happened.

Lakshmipathi · 2024-12-18T05:51:01Z

Decommission:
Able to perform first decommission (5->4) but unable to decommission second time (4->3) .
Few times it ran out of space. With the recent run - waited for compaction to stop hoping to reclaim the truncated
data - but it never happened.

Seems like compaction wont be triggered while truncating entire keyspace (as there is no tombstone created). SCT script waiting for
compaction to stop never required.

Lakshmipathi · 2024-12-18T18:06:42Z

Results

[Argus] (https://argus.scylladb.com/tests/scylla-cluster-tests/f6516657-0726-4362-9495-afd6177bb0fd)

Configuration:

3 x i4i.2xlarge - 1.875TB Disk in each node
Raw Data set size: 1.16TB (RF =3)

Workload during latency measurements:

c-s mixed (R/W 50/50)
1KB Row size (8 columns, 128 bytes each)
Gauss Distribution - writing and reading only part of the dataset.

cassandra-stress mixed no-warmup cl=QUORUM duration=800m -schema keyspace=ks_mixed1 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -mode cql3 native -rate 'threads=300 fixed=16875/s' -col 'size=FIXED(128) n=FIXED(8)' -pop 'dist=gauss(1..650000000,325000000,6500000)'

Disk Usage

Load

Latency
p99 read and write shows improvement while throughput remains stable. Final node decommission latency numbers are in red.

Reactor Stalls

pehala · 2024-12-18T18:36:28Z

@swasik The final decommission (4->3) shows significant increase in latencies, which is not acceptable, we need to investigate and see what is wrong.
@paszkow Could you please assist us with investigating why this happens? I thought it is due to overloading of Scylla, but the load at the time of decommission seemed to be 60-75%, which I do not feel is enough to get latencies like this.

We also run this on enterprise:latest, not sure if relevant.

pehala mentioned this issue Nov 7, 2024

Add 90% storage utilization tests #9129

Open

13 tasks

github-actions bot assigned pehala Nov 7, 2024

pehala added the area/elastic cloud Issues related to the elastic cloud project label Nov 7, 2024

pehala assigned Lakshmipathi and unassigned pehala Nov 7, 2024

pehala added area/tablets area/serverlessv2 labels Nov 11, 2024

cezarmoise mentioned this issue Nov 20, 2024

full_storage_utilization_test: reclaim and scale out tests #9305

Draft

8 tasks

pehala added the P1 Urgent label Nov 21, 2024

dani-tweig removed the area/serverlessv2 label Nov 26, 2024

pehala changed the title ~~Add testcase for scaling-out while having 90% storage utilization~~ Add testcase for scale-out/scale-in while having 90% storage utilization Dec 9, 2024

Lakshmipathi mentioned this issue Dec 19, 2024

[90% storage utilization test]: High latency during decommissioning second node scylladb/scylladb#21979

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add testcase for scale-out/scale-in while having 90% storage utilization #9156

Add testcase for scale-out/scale-in while having 90% storage utilization #9156

pehala commented Nov 7, 2024 •

edited by Lakshmipathi

Loading

Lakshmipathi commented Nov 11, 2024 •

edited

Loading

swasik commented Nov 22, 2024

swasik commented Nov 26, 2024

Lakshmipathi commented Nov 26, 2024

Lakshmipathi commented Nov 27, 2024

pehala commented Nov 27, 2024 •

edited

Loading

swasik commented Nov 27, 2024

paszkow commented Nov 28, 2024

Lakshmipathi commented Nov 29, 2024

Lakshmipathi commented Nov 29, 2024

pehala commented Dec 9, 2024

Lakshmipathi commented Dec 17, 2024 •

edited

Loading

Configuration:

Workload during latency measurements:

Lakshmipathi commented Dec 18, 2024

Lakshmipathi commented Dec 18, 2024 •

edited

Loading

Configuration:

Workload during latency measurements:

pehala commented Dec 18, 2024

Add testcase for scale-out/scale-in while having 90% storage utilization #9156

Add testcase for scale-out/scale-in while having 90% storage utilization #9156

Comments

pehala commented Nov 7, 2024 • edited by Lakshmipathi Loading

Lakshmipathi commented Nov 11, 2024 • edited Loading

swasik commented Nov 22, 2024

swasik commented Nov 26, 2024

Lakshmipathi commented Nov 26, 2024

Lakshmipathi commented Nov 27, 2024

pehala commented Nov 27, 2024 • edited Loading

swasik commented Nov 27, 2024

paszkow commented Nov 28, 2024

Lakshmipathi commented Nov 29, 2024

Lakshmipathi commented Nov 29, 2024

pehala commented Dec 9, 2024

Lakshmipathi commented Dec 17, 2024 • edited Loading

Configuration:

Workload during latency measurements:

Lakshmipathi commented Dec 18, 2024

Lakshmipathi commented Dec 18, 2024 • edited Loading

Configuration:

Workload during latency measurements:

pehala commented Dec 18, 2024

pehala commented Nov 7, 2024 •

edited by Lakshmipathi

Loading

Lakshmipathi commented Nov 11, 2024 •

edited

Loading

pehala commented Nov 27, 2024 •

edited

Loading

Lakshmipathi commented Dec 17, 2024 •

edited

Loading

Lakshmipathi commented Dec 18, 2024 •

edited

Loading