Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-dc testcase for 90% storage utilization #9157

Open
Tracked by #9305
pehala opened this issue Nov 7, 2024 · 23 comments
Open
Tracked by #9305

Add multi-dc testcase for 90% storage utilization #9157

pehala opened this issue Nov 7, 2024 · 23 comments
Assignees
Labels
area/elastic cloud Issues related to the elastic cloud project area/tablets P3 Medium Priority

Comments

@pehala
Copy link
Contributor

pehala commented Nov 7, 2024

  • Create 3 node cluster with rf=3.
  • Reach 90% disk usage.
  • Scaleout cluster by adding a new DC to existing cluster under load.
  • Bump up the RF to 3 for both DCs
  • Measure latency under stress.
  • Add nodes in parallel to both DCs
  • Measure latency under stress (adjusted for 4 nodes).
  • Scale-in DC under load
  • Measure latency under stress (adjusted for 3 nodes).
@pehala pehala added the area/elastic cloud Issues related to the elastic cloud project label Nov 7, 2024
@pehala pehala assigned Lakshmipathi and unassigned pehala Nov 7, 2024
@pehala pehala added the P3 Medium Priority label Nov 21, 2024
@cezarmoise
Copy link
Contributor

Last test run: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/cezar/job/byo-longevity-test/69/consoleFull

04:34:09  < t:2024-11-21 02:34:06,572 f:full_storage_utilization_test_2.py l:131  c:FullStorageUtilizationTest2 p:INFO  > Node     Total GB     Used GB      Avail GB     Used %  
04:34:29  < t:2024-11-21 02:34:28,756 f:full_storage_utilization_test_2.py l:143  c:FullStorageUtilizationTest2 p:INFO  > 1        436          396          40           91.0%
04:34:51  < t:2024-11-21 02:34:50,943 f:full_storage_utilization_test_2.py l:143  c:FullStorageUtilizationTest2 p:INFO  > 2        436          393          44           90.0%
04:35:13  < t:2024-11-21 02:35:13,134 f:full_storage_utilization_test_2.py l:143  c:FullStorageUtilizationTest2 p:INFO  > 3        436          395          42           91.0%
04:35:36  < t:2024-11-21 02:35:35,329 f:full_storage_utilization_test_2.py l:143  c:FullStorageUtilizationTest2 p:INFO  > 4        436          403          34           93.0%
04:35:36  < t:2024-11-21 02:35:35,851 f:full_storage_utilization_test_2.py l:143  c:FullStorageUtilizationTest2 p:INFO  > 5        436          37           400          9.0%
04:35:58  < t:2024-11-21 02:35:58,183 f:full_storage_utilization_test_2.py l:143  c:FullStorageUtilizationTest2 p:INFO  > 6        436          37           400          9.0%
04:36:21  < t:2024-11-21 02:36:20,536 f:full_storage_utilization_test_2.py l:143  c:FullStorageUtilizationTest2 p:INFO  > 7        436          37           400          9.0%
04:36:21  < t:2024-11-21 02:36:20,536 f:full_storage_utilization_test_2.py l:153  c:FullStorageUtilizationTest2 p:INFO  > Cluster  3052         1698         1360         56.0%

Did not redistribuite data to new dc.

Trying fix 1b0e85b

@cezarmoise
Copy link
Contributor

https://argus.scylladb.com/tests/scylla-cluster-tests/e70ab70a-063f-463e-a289-39a90805e597

cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="Only one DC's RF can be changed at a time and not by more than 1"

@cezarmoise
Copy link
Contributor

https://github.com/cezarmoise/scylla-cluster-tests/tree/new-dc

Trying to alter all keyspaces before adding the dc so they have per dc replication, and changing it after is only one change.

https://argus.scylladb.com/tests/scylla-cluster-tests/4cb74447-6750-4bba-83ef-4ccad8cf6a89

@cezarmoise
Copy link
Contributor

cezarmoise commented Nov 22, 2024

failed due to timeout on a large keyspace, updating to only add replicate small keyspaces

2024-11-22 13:41:14.531: (TestFrameworkEvent Severity.ERROR) period_type=one-time event_id=9263a51a-5198-4ce0-a22c-dfaf3d53fec5, source=FullStorageUtilizationTest2.test_scale_out (full_storage_utilization_test_2.FullStorageUtilizationTest2)() message=Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/full_storage_utilization_test_2.py", line 245, in test_scale_out
self.scale_out()
File "/home/ubuntu/scylla-cluster-tests/full_storage_utilization_test_2.py", line 57, in scale_out
self.add_new_node()
File "/home/ubuntu/scylla-cluster-tests/full_storage_utilization_test_2.py", line 63, in add_new_node
self.reconfigure_keyspaces()
File "/home/ubuntu/scylla-cluster-tests/full_storage_utilization_test_2.py", line 97, in reconfigure_keyspaces
self.execute_cql(cql)
File "/home/ubuntu/scylla-cluster-tests/full_storage_utilization_test_2.py", line 36, in execute_cql
results = session.execute(query)
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 1318, in execute_verbose
return execute_orig(*args, **kwargs)
File "cassandra/cluster.py", line 2729, in cassandra.cluster.Session.execute
File "cassandra/cluster.py", line 5120, in cassandra.cluster.ResponseFuture.result
cassandra.OperationTimedOut: errors={'10.4.3.201:9042': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=10.4.3.201:9042

@cezarmoise
Copy link
Contributor

Still timeout issues, https://argus.scylladb.com/tests/scylla-cluster-tests/5117a642-3a7a-4c9a-ba43-d1898756f556

Set timeout on queries to 5min an try again

@cezarmoise
Copy link
Contributor

Image

https://argus.scylladb.com/tests/scylla-cluster-tests/6b7ea346-0ca8-42c8-a955-7f7f4f3d1922

Only added the small keyspaces to the new dc, as I got timeouts when trying to alter the large ones.
The big sleeps are removed here to run the test faster

Will update with a new run.

@cezarmoise
Copy link
Contributor

Image

https://argus.scylladb.com/tests/scylla-cluster-tests/5ab30315-6993-4b6e-8cca-b0b430076eca

Initial Cluster: 4 x i4i.large
Write to 70%; Sleep 30 min
Write to 90%; Sleep 30 min
Writes have RF=3
Add 3 nodes in new DC: 3 x i4i.large
Update all keyspaces with replication dc1: 3, dc2: 1
Sleep 30 minutes

NOTE:
No throttle r/w during scale out, got some errors
https://argus.scylladb.com/tests/scylla-cluster-tests/394cc2b3-4719-4034-b93e-38e904335565

@pehala
Copy link
Contributor Author

pehala commented Nov 26, 2024

Add 3 nodes in new DC: 3 x i4i.large

Do we know how long did it take to provision the new DC?

Update all keyspaces with replication dc1: 3, dc2: 1

Why different replication? Does the space occupied in the DC2 corresponds to 70% storage utilization or should it be higher/lower?

Sleep 30 minutes

Did we verify the new DC works as expected? i.e. with reads or writes?

@cezarmoise
Copy link
Contributor

Add 3 nodes in new DC: 3 x i4i.large

Do we know how long did it take to provision the new DC?

01:10:18  < t:2024-11-25 23:10:17,822 f:full_storage_utilization_test_2.py l:55   c:FullStorageUtilizationTest2 p:INFO  > Started scale out
01:20:53  < t:2024-11-25 23:20:53,544 f:full_storage_utilization_test_2.py l:68   c:FullStorageUtilizationTest2 p:INFO  > New node(s) added, total nodes in cluster: 7
02:29:00  < t:2024-11-26 00:28:59,828 f:full_storage_utilization_test_2.py l:59   c:FullStorageUtilizationTest2 p:INFO  > Scale out finished with time: 4722.006384372711

10 minutes to add the nodes
1 hour to update the keypsaces

Update all keyspaces with replication dc1: 3, dc2: 1

Why different replication? Does the space occupied in the DC2 corresponds to 70% storage utilization or should it be higher/lower?

The RF for the new DC needs to be increased by 1 at a time. So it would take 3x time.
Will start a new run with adding 4 nodes in the new DC and RF=3

Sleep 30 minutes

Did we verify the new DC works as expected? i.e. with reads or writes?

Currently I get stress errors

2024-11-25 19:48:05.803: (CassandraStressEvent Severity.CRITICAL) period_type=end event_id=75fcf4f7-7d00-4677-8600-56c7d38c9c78 duration=30m5s: node=Node storage-utilization-master-loader-node-394cc2b3-1 [3.253.63.236 | 10.4.0.118] (dc name: eu-west-1)
stress_cmd=cassandra-stress read duration=30m -rate threads=16 "throttle=1400/s" -mode cql3 native -pop seq=1..5000000 -col "size=FIXED(10240) n=FIXED(1)" -schema "replication(strategy=NetworkTopologyStrategy,replication_factor=3)"
errors:
Stress command completed with bad status 1: Failed to connect over JMX; not collecting these stats
java.lang.RuntimeException: Failed to execute stress action
2024-11-25 19:48:03.845: (CassandraStressLogEvent Severity.ERROR) period_type=one-time event_id=75fcf4f7-7d00-4677-8600-56c7d38c9c78: type=IOException regex=java\.io\.IOException line_number=1912 node=Node storage-utilization-master-loader-node-394cc2b3-1 [3.253.63.236 | 10.4.0.118] (dc name: eu-west-1)
java.io.IOException: Operation x0 on key(s) [4b3132355032384c4b30]: Data returned was not validated

I think the stress command needs to be updated because of the new dc.

But in add_new_dc.py the commands are

"cassandra-stress read cl=LOCAL_QUORUM duration=20m -mode cql3 native -rate threads=8 -pop seq=1..20900 -col 'n=FIXED(10) size=FIXED(512)' -log interval=5",
"cassandra-stress write cl=LOCAL_QUORUM duration=20m -mode cql3 native -rate threads=8 -pop seq=1..20900 -col 'n=FIXED(10) size=FIXED(512)' -log interval=5"

without any mention of replication, and I don't know exactly what the difference is.

@pehala
Copy link
Contributor Author

pehala commented Nov 26, 2024

@Lakshmipathi any idea whats wrong?

@Lakshmipathi
Copy link

@pehala I'm not quite sure why stopped working with new-dc. Searching existing issues, came across this one scylladb/cassandra-stress#16

@Lakshmipathi
Copy link

Lakshmipathi commented Nov 27, 2024

@cezarmoise , can you share the jenkins link for this error: I got similar with my simple scaleout run (https://jenkins.scylladb.com/job/scylla-staging/job/LakshmipathiGanapathi/job/byo-longevity-test/268/console)

2024-11-25 19:48:05.803: (CassandraStressEvent Severity.CRITICAL) period_type=end event_id=75fcf4f7-7d00-4677-8600-56c7d38c9c78 duration=30m5s: node=Node storage-utilization-master-loader-node-394cc2b3-1 [3.253.63.236 | 10.4.0.118] (dc name: eu-west-1)
stress_cmd=cassandra-stress read duration=30m -rate threads=16 "throttle=1400/s" -mode cql3 native -pop seq=1..5000000 -col "size=FIXED(10240) n=FIXED(1)" -schema "replication(strategy=NetworkTopologyStrategy,replication_factor=3)"
errors:
Stress command completed with bad status 1: Failed to connect over JMX; not collecting these stats
java.lang.RuntimeException: Failed to execute stress action

@cezarmoise
Copy link
Contributor

cezarmoise commented Nov 27, 2024

@cezarmoise , can you share the jenkins link for this error: I got similar with my simple scaleout run (https://jenkins.scylladb.com/job/scylla-staging/job/LakshmipathiGanapathi/job/byo-longevity-test/268/console)

2024-11-25 19:48:05.803: (CassandraStressEvent Severity.CRITICAL) period_type=end event_id=75fcf4f7-7d00-4677-8600-56c7d38c9c78 duration=30m5s: node=Node storage-utilization-master-loader-node-394cc2b3-1 [3.253.63.236 | 10.4.0.118] (dc name: eu-west-1)
stress_cmd=cassandra-stress read duration=30m -rate threads=16 "throttle=1400/s" -mode cql3 native -pop seq=1..5000000 -col "size=FIXED(10240) n=FIXED(1)" -schema "replication(strategy=NetworkTopologyStrategy,replication_factor=3)"
errors:
Stress command completed with bad status 1: Failed to connect over JMX; not collecting these stats
java.lang.RuntimeException: Failed to execute stress action

https://jenkins.scylladb.com/job/scylla-staging/job/cezar/job/byo-longevity-test/97/
https://argus.scylladb.com/tests/scylla-cluster-tests/394cc2b3-4719-4034-b93e-38e904335565

The Jenkins link probably won't be around for very long, I run a lot of builds.
#97.txt

@cezarmoise
Copy link
Contributor

cezarmoise commented Nov 27, 2024

@swasik @pehala

Image

Initial Cluster: 4 x i4i.large
Write to 70%; Sleep 30 min
Write to 90%; Sleep 30 min
Writes have RF=3
Add 4 nodes in new DC: 4 x i4i.large
Update all keyspaces with replication dc1: 3, dc2: 3

order of operations here is
ks1 -> dc1: 3, dc2: 1
ks1 -> dc1: 3, dc2: 2
ks1 -> dc1: 3, dc2: 3
ks2 -> dc1: 3, dc2: 1
...

At this point I get out of space error
https://argus.scylladb.com/tests/scylla-cluster-tests/4b03ebc0-3cb3-42c9-b541-02cdfe736651
https://jenkins.scylladb.com/job/scylla-staging/job/cezar/job/byo-longevity-test/100/

22:59:45  < t:2024-11-26 20:59:35,791 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR > 2024-11-26 20:59:35.780 <2024-11-26 20:59:35.700>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=3c9fae6f-f0f1-47ca-90cb-1a3b14d7eb55: type=NO_SPACE_ERROR regex=No space left on device line_number=4972 node=storage-utilization-master-db-node-4b03ebc0-7
22:59:45 
 < t:2024-11-26 20:59:35,791 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:ERROR > 2024-11-26T20:59:35.700+00:00 storage-utilization-master-db-node-4b03ebc0-7      !ERR | scylla[5396]:  [shard 0:strm] storage_service - Shutting down communications due to I/O errors until operator intervention: Disk error: std::system_error (error system:28, No space left on device)

This happened after

ALTER KEYSPACE keyspace_large3 WITH replication = {'class': 'NetworkTopologyStrategy', 'eu-westscylla_node_west': 3, 'eu-west-2scylla_node_west': 3}

When inserting data in the original DC, after keyspace_large3, it was only at 60% capacity.

After that There are a lot of erros like this

2024-11-26 21:32:39.988 <2024-11-26 21:32:39.943>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=1002f7e2-fa1c-4027-9151-e6f32428e106: type=RUNTIME_ERROR regex=std::runtime_error line_number=114732 node=storage-utilization-master-db-node-4b03ebc0-1
2024-11-26T21:32:39.943+00:00 storage-utilization-master-db-node-4b03ebc0-1      !ERR | scylla[5512]:  [shard 0: gms] raft_topology - topology change coordinator fiber got error std::runtime_error (raft topology: exec_global_command(barrier) failed with seastar::rpc::closed_error (connection is closed))
2024-11-26 21:32:39.985 <2024-11-26 21:32:39.943>: (DatabaseLogEvent Severity.ERROR) period_type=one-time event_id=1002f7e2-fa1c-4027-9151-e6f32428e106: type=RUNTIME_ERROR regex=std::runtime_error line_number=114724 node=storage-utilization-master-db-node-4b03ebc0-1
2024-11-26T21:32:39.943+00:00 storage-utilization-master-db-node-4b03ebc0-1      !ERR | scylla[5512]:  [shard 0: gms] raft_topology - drain rpc failed, proceed to fence old writes: std::runtime_error (raft topology: exec_global_command(barrier_and_drain) failed with seastar::rpc::closed_error (connection is closed))

@swasik
Copy link

swasik commented Nov 27, 2024

Initial Cluster: 4 x i4i.large
Write to 70%; Sleep 30 min
Write to 90%; Sleep 30 min
Writes have RF=3
Add 3 nodes in new DC: 3 x i4i.large

Is not it expected to get out of space? If we have 4 nodes x 0.9 utilization and want to make the same number of replicas using just 3 nodes?

@cezarmoise
Copy link
Contributor

Initial Cluster: 4 x i4i.large
Write to 70%; Sleep 30 min
Write to 90%; Sleep 30 min
Writes have RF=3
Add 3 nodes in new DC: 3 x i4i.large

Is not it expected to get out of space? If we have 4 nodes x 0.9 utilization and want to make the same number of replicas using just 3 nodes?

My mistake. It should say 4 new nodes. In the graph you can see there are 4 new lines

@cezarmoise
Copy link
Contributor

I will run this again, but with wait_for_tablets_balanced calls in between.

@swasik
Copy link

swasik commented Nov 27, 2024

I will run this again, but with wait_for_tablets_balanced calls in between.

Between which operations? I am not sure if this is a right approach - the customer is not expected to wait for balancing to finish before scaling DCs.

@bhalevy could you take a look while we can get out of space error here? Or recommend who we should ask?

@cezarmoise
Copy link
Contributor

Managed to reproduce the failures. This time, after altering each keyspace I waited for tablets balance.

Image

https://argus.scylladb.com/tests/scylla-cluster-tests/a318f810-3ae5-4912-9605-21434e3be97f

Image

https://argus.scylladb.com/tests/scylla-cluster-tests/6c7cff0a-5fab-482c-8fd7-21499fe35a0e

@swasik
Copy link

swasik commented Dec 2, 2024

@cezarmoise could you create a separate issue describing the bug?

@pehala pehala changed the title Add testcase for adding additional DC while having 90% storage utilization Add multi-dc testcase for 90% storage utilization Dec 9, 2024
@pehala
Copy link
Contributor Author

pehala commented Dec 9, 2024

Updated the description & name to match with changes to the test plan

@cezarmoise
Copy link
Contributor

Opened scylladb/scylladb#21848 for the out of space issue

@cezarmoise
Copy link
Contributor

Image

To get it to work I had to make the instances of the new DC much larger than the one in the old DC, but the test did not wait enough, after compaction the used space is the same, but not all tables had time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/elastic cloud Issues related to the elastic cloud project area/tablets P3 Medium Priority
Projects
None yet
Development

No branches or pull requests

5 participants