Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add testcase for scale-out/scale-in while having 90% storage utilization #9156

Open
Tracked by #9305
pehala opened this issue Nov 7, 2024 · 15 comments
Open
Tracked by #9305
Assignees
Labels
area/elastic cloud Issues related to the elastic cloud project area/tablets P1 Urgent

Comments

@pehala
Copy link
Contributor

pehala commented Nov 7, 2024

  • Create 3 node cluster with rf=3.
  • Reach 90% disk usage.
  • Measure latency under stress.
  • Perform scale-out (3 -> 4) under load
  • Reach 90% disk usage
  • Perform scale-out (4 -> 5) under load
  • Measure latency under stress (scaled to 5 nodes)
  • Perform scale-in (5 -> 4) under load
  • Measure latency under stress (scaled to 4 nodes)
  • Perform scale-in (4 -> 3) under load
  • Measure latency under stress (scaled to 3 nodes)

Results

@pehala pehala added the area/elastic cloud Issues related to the elastic cloud project label Nov 7, 2024
@pehala pehala assigned Lakshmipathi and unassigned pehala Nov 7, 2024
@Lakshmipathi
Copy link

Lakshmipathi commented Nov 11, 2024

older run

3-node (Instance type: i4i.large) cluster scaleout at 90%.

reached 91% disk usage and started waiting for 30mins, no write or read.

< t:2024-11-03 07:10:58,323 f:full_storage_utilization_test.py l:93   c:FullStorageUtilizationTest p:INFO  > Current max disk usage after writing to keyspace10: 91% (396 GB / 392.40000000000003 GB)
< t:2024-11-03 07:10:59,353 f:full_storage_utilization_test.py l:58   c:FullStorageUtilizationTest p:INFO  > Wait for 1800 seconds

After 30min idle time, started throttled write:

< t:2024-11-03 07:42:10,941 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > stress_cmd=cassandra-stress write duration=30m -rate threads=10 "throttle=1400/s" -mode cql3 native -pop seq=1..5000000 -col "size=FIXED(10240) n=FIXED(1)" -schema "replication(strategy=NetworkTopologyStrategy,replication_factor=3)"

Scaleout by adding a new node at 90%

< t:2024-11-03 07:44:05,075 f:full_storage_utilization_test.py l:41   c:FullStorageUtilizationTest p:INFO  > Adding a new node

After 30mins, scaleout (3->4) cluster has disk usage at 75%, 74%, 75% and 70%

Tablet migration over time
Image

max/avg disk utilization
Image

Latency
99th percentile write and read latency by Cluster (max at 90% disk utilization)

syscall value
writes 3.07ms
read 1.79ms

https://argus.scylladb.com/tests/scylla-cluster-tests/c5de2f39-770c-4cf3-8d8c-66fef9d91d87

@swasik
Copy link

swasik commented Nov 22, 2024

After 30mins, scaleout (3->4) cluster has disk usage at 75%, 74%, 75% and 70%

But I see that the chart presents average disk usage. This should change quickly as we are adding more disk space even if the new space is not used. Could you also add picture for maximal disk usage across all nodes?

@swasik
Copy link

swasik commented Nov 26, 2024

The interesting fact is that after migration we have the same number of tablets everywhere but on the new node the disk utilization is ca. 5% lower. Maybe something is not cleaned yet. Could we wait a bit more time to see if the utilization will be equal in the end?

@Lakshmipathi
Copy link

Started new job, with 1hr wait time just before the test ends. Will check and update whether 5% lower disk usage still exists or not.

@Lakshmipathi
Copy link

@swasik After scaleout, waited for 40mins and ensured there is 0% load on all nodes. Final disk usage is: 66%, 69%, 71% and 73%. So on avg, the newly added node has 5% less disk usage than other 3-nodes.

@pehala
Copy link
Contributor Author

pehala commented Nov 27, 2024

Could it be due to tablet inbalance?

@swasik
Copy link

swasik commented Nov 27, 2024

Could it be due to tablet inbalance?

I thought so too, but we have exactly the same number of tablets at each node and probably linear distribution of keyspace.

@paszkow
Copy link
Contributor

paszkow commented Nov 28, 2024

@Lakshmipathi Do you have new updated results? I look at grafana for the first experiment and I have the following observations:

  • CPU load is around 15% when you add a new node. Moreover, at that time you have pure write workload. Use read+write and then you might observe problems like not being able to scale out on time. Tablet migration has significantly lower priority than reads so takes more time to stream than in a pure write workflow.
  • You stated read P99 to be 1.8ms but at that time you perform 10 reads/s. Does not seem to be correct...

@Lakshmipathi
Copy link

@pehala @swasik , May be because we create different keyspace size? Will that create that 5% gap for new node?

@Lakshmipathi
Copy link

@Lakshmipathi Do you have new updated results? I look at grafana for the first experiment and I have the following observations:

  • CPU load is around 15% when you add a new node. Moreover, at that time you have pure write workload. Use read+write and then you might observe problems like not being able to scale out on time. Tablet migration has significantly lower priority than reads so takes more time to stream than in a pure write workflow.
  • You stated read P99 to be 1.8ms but at that time you perform 10 reads/s. Does not seem to be correct...

@paszkow, ok, I'm running test without throttle now, will update the results.

@pehala pehala changed the title Add testcase for scaling-out while having 90% storage utilization Add testcase for scale-out/scale-in while having 90% storage utilization Dec 9, 2024
@pehala
Copy link
Contributor Author

pehala commented Dec 9, 2024

Updated the description and the name to fit changes made to the test plan

@Lakshmipathi
Copy link

Lakshmipathi commented Dec 17, 2024

old run

[Argus] (https://argus.scylladb.com/tests/scylla-cluster-tests/0c0af1c9-d798-48d1-bfaa-04767ef08d38)

Configuration:

  • 3 x i4i.2xlarge - 1.875TB Disk in each node
  • Raw Data set size: 1.16TB (RF =3)

Workload during latency measurements:

  • c-s mixed (R/W 50/50)
  • 1KB Row size (8 columns, 128 bytes each)
  • Gauss Distribution - writing and reading only part of the dataset.

cassandra-stress mixed no-warmup cl=QUORUM duration=800m -schema keyspace=ks_mixed1 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -mode cql3 native -rate 'threads=300 fixed=16875/s' -col 'size=FIXED(128) n=FIXED(8)' -pop 'dist=gauss(1..650000000,325000000,6500000)'

Disk Usage
Image

Load
Image

Latency
p99 read and write shows improvement while throughput remains stable.

Image

Decommission:
Able to perform first decommission (5->4) but unable to decommission second time (4->3) .
Few times it ran out of space. With the recent run - waited for compaction to stop hoping to reclaim the truncated
data - but it never happened.

@Lakshmipathi
Copy link

Decommission:
Able to perform first decommission (5->4) but unable to decommission second time (4->3) .
Few times it ran out of space. With the recent run - waited for compaction to stop hoping to reclaim the truncated
data - but it never happened.

Seems like compaction wont be triggered while truncating entire keyspace (as there is no tombstone created). SCT script waiting for
compaction to stop never required.

@Lakshmipathi
Copy link

Lakshmipathi commented Dec 18, 2024

Results

[Argus] (https://argus.scylladb.com/tests/scylla-cluster-tests/f6516657-0726-4362-9495-afd6177bb0fd)

Configuration:

  • 3 x i4i.2xlarge - 1.875TB Disk in each node
  • Raw Data set size: 1.16TB (RF =3)

Workload during latency measurements:

  • c-s mixed (R/W 50/50)
  • 1KB Row size (8 columns, 128 bytes each)
  • Gauss Distribution - writing and reading only part of the dataset.

cassandra-stress mixed no-warmup cl=QUORUM duration=800m -schema keyspace=ks_mixed1 'replication(strategy=NetworkTopologyStrategy,replication_factor=3)' -mode cql3 native -rate 'threads=300 fixed=16875/s' -col 'size=FIXED(128) n=FIXED(8)' -pop 'dist=gauss(1..650000000,325000000,6500000)'

Disk Usage
Image

Load
Image

Latency
p99 read and write shows improvement while throughput remains stable. Final node decommission latency numbers are in red.

Image

Reactor Stalls
Image

@pehala
Copy link
Contributor Author

pehala commented Dec 18, 2024

@swasik The final decommission (4->3) shows significant increase in latencies, which is not acceptable, we need to investigate and see what is wrong.
@paszkow Could you please assist us with investigating why this happens? I thought it is due to overloading of Scylla, but the load at the time of decommission seemed to be 60-75%, which I do not feel is enough to get latencies like this.

We also run this on enterprise:latest, not sure if relevant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/elastic cloud Issues related to the elastic cloud project area/tablets P1 Urgent
Projects
None yet
Development

No branches or pull requests

5 participants