Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

15 out of 22 stress commands in the longevity-large-partition-200k-pks-4days-gce-test CI job don't end in 30h if we set 24h stress_duration #7147

Closed
vponomaryov opened this issue Jan 25, 2024 · 1 comment · Fixed by #7411
Assignees

Comments

@vponomaryov
Copy link
Contributor

vponomaryov commented Jan 25, 2024

Issue description

The longevity-large-partition-200k-pks-4days-gce-test CI job has big set of stress commands:

> stress_cmd:
> - scylla-bench -workload=sequential -mode=write -replication-factor=3 -partition-count=300
>   -clustering-row-count=200000 -clustering-row-size=uniform:10..10240 -partition-offset=1001
>   -concurrency=10 -connection-count=10 -consistency-level=quorum -rows-per-request=10
>   -timeout=90s
> - scylla-bench -workload=sequential -mode=write -replication-factor=3 -partition-count=300
>   -clustering-row-count=200000 -clustering-row-size=uniform:10..10240 -partition-offset=1301
>   -concurrency=10 -connection-count=10 -consistency-level=quorum -rows-per-request=10
>   -timeout=90s
> - scylla-bench -workload=sequential -mode=write -replication-factor=3 -partition-count=400
>   -clustering-row-count=200000 -clustering-row-size=uniform:10..10240 -partition-offset=1601
>   -concurrency=10 -connection-count=10 -consistency-level=quorum -rows-per-request=10
>   -timeout=90s
> - scylla-bench -workload=sequential -mode=write -replication-factor=3 -partition-count=250
>   -clustering-row-count=100000 -partition-offset=251 -clustering-row-size=uniform:3072..5120
>   -concurrency=10 -connection-count=10 -consistency-level=quorum -rows-per-request=10
>   -timeout=90s -duration=6420m
> - scylla-bench -workload=sequential -mode=write -replication-factor=3 -partition-count=250
>   -clustering-row-count=100000 -partition-offset=501 -clustering-row-size=uniform:3072..5120
>   -concurrency=10 -connection-count=10 -consistency-level=quorum -rows-per-request=10
>   -timeout=90s -duration=6420m
> - scylla-bench -workload=sequential -mode=read -replication-factor=3 -partition-count=15
>   -clustering-row-count=100000 -clustering-row-size=uniform:3072..5120 -concurrency=100
>   -connection-count=100 -rows-per-request=10 -consistency-level=quorum -timeout=90s
>   -iterations 10 -validate-data
> - scylla-bench -workload=sequential -mode=read -replication-factor=3 -partition-count=15
>   -clustering-row-count=100000 -partition-offset=15 -clustering-row-size=uniform:3072..5120
>   -concurrency=100 -connection-count=100 -consistency-level=quorum -rows-per-request=10
>   -timeout=90s -iterations 10 -validate-data
> - scylla-bench -workload=sequential -mode=read -replication-factor=3 -partition-count=15
>   -clustering-row-count=100000 -partition-offset=31 -clustering-row-size=uniform:3072..5120
>   -concurrency=100 -connection-count=100 -consistency-level=quorum -rows-per-request=10
>   -timeout=90s -iterations 10 -validate-data
> - scylla-bench -workload=sequential -mode=read -replication-factor=3 -partition-count=15
>   -clustering-row-count=100000 -partition-offset=46 -clustering-row-size=uniform:3072..5120
>   -concurrency=100 -connection-count=100 -consistency-level=quorum -rows-per-request=10
>   -timeout=90s -iterations 10 -validate-data
> - scylla-bench -workload=sequential -mode=read -replication-factor=3 -partition-count=15
>   -clustering-row-count=100000 -partition-offset=61 -clustering-row-size=uniform:3072..5120
>   -concurrency=100 -connection-count=100 -consistency-level=quorum -rows-per-request=10
>   -timeout=90s -iterations 10 -validate-data
> - scylla-bench -workload=sequential -mode=read -replication-factor=3 -partition-count=15
>   -clustering-row-count=100000 -partition-offset=76 -clustering-row-size=uniform:3072..5120
>   -concurrency=100 -connection-count=100 -consistency-level=quorum -rows-per-request=10
>   -timeout=90s -iterations 10 -validate-data
> - scylla-bench -workload=sequential -mode=read -replication-factor=3 -partition-count=15
>   -clustering-row-count=100000 -partition-offset=91 -clustering-row-size=uniform:3072..5120
>   -concurrency=100 -connection-count=100 -consistency-level=quorum -rows-per-request=10
>   -timeout=90s -iterations 10 -validate-data
> - scylla-bench -workload=sequential -mode=read -replication-factor=3 -partition-count=15
>   -clustering-row-count=100000 -partition-offset=106 -clustering-row-size=uniform:3072..5120
>   -concurrency=100 -connection-count=100 -consistency-level=quorum -rows-per-request=10
>   -timeout=90s -iterations 10 -validate-data
> - scylla-bench -workload=sequential -mode=read -replication-factor=3 -partition-count=15
>   -clustering-row-count=100000 -partition-offset=121 -clustering-row-size=uniform:3072..5120
>   -concurrency=100 -connection-count=100 -consistency-level=quorum -rows-per-request=10
>   -timeout=90s -iterations 10 -validate-data
> - scylla-bench -workload=sequential -mode=read -replication-factor=3 -partition-count=15
>   -clustering-row-count=100000 -partition-offset=136 -clustering-row-size=uniform:3072..5120
>   -concurrency=100 -connection-count=100 -consistency-level=quorum -rows-per-request=10
>   -timeout=90s -iterations 10 -validate-data
> - scylla-bench -workload=sequential -mode=read -replication-factor=3 -partition-count=15
>   -clustering-row-count=100000 -partition-offset=151 -clustering-row-size=uniform:3072..5120
>   -concurrency=100 -connection-count=100 -consistency-level=quorum -rows-per-request=10
>   -timeout=90s -iterations 10 -validate-data
> - scylla-bench -workload=sequential -mode=read -replication-factor=3 -partition-count=42
>   -clustering-row-count=100000 -partition-offset=166 -clustering-row-size=uniform:3072..5120
>   -concurrency=100 -connection-count=100 -consistency-level=quorum -rows-per-request=10
>   -timeout=90s -iterations 10 -validate-data
> - scylla-bench -workload=sequential -mode=read -replication-factor=3 -partition-count=43
>   -clustering-row-count=100000 -partition-offset=208 -clustering-row-size=uniform:3072..5120
>   -concurrency=100 -connection-count=100 -consistency-level=quorum -rows-per-request=10
>   -timeout=90s -iterations 10 -validate-data

> stress_read_cmd:
> - scylla-bench -workload=sequential -mode=write -replication-factor=3 -partition-count=1000
>   -clustering-row-count=200000 -clustering-row-size=uniform:10..10240 -partition-offset=1001
>   -concurrency=10 -connection-count=10 -consistency-level=quorum -rows-per-request=10
>   -iterations=20 -timeout=90s
> - scylla-bench -workload=sequential -mode=write -replication-factor=3 -partition-count=250
>   -clustering-row-count=100000 -clustering-row-size=uniform:3072..5120 -concurrency=10
>   -connection-count=10 -rows-per-request=10 -consistency-level=quorum -iterations=26
>   -timeout=300s -validate-data
> - scylla-bench -workload=sequential -mode=read -replication-factor=3 -partition-count=1000
>   -clustering-row-count=100000 -clustering-row-size=uniform:3072..5120 -rows-per-request=10
>   -consistency-level=quorum -timeout=90s -concurrency=100 -connection-count=100 -iterations=0
>   -duration=6420m
> - scylla-bench -workload=sequential -mode=read -replication-factor=3 -partition-count=1000
>   -clustering-row-count=200000 -clustering-row-size=uniform:10..10240 -rows-per-request=10
>   -consistency-level=quorum -timeout=90s -partition-offset=1001 -concurrency=100 -connection-count=100
>   -iterations=0 -duration=6420m

Then we set the stress_duration to have 1440m (1d) value.
As a result, the test run gets timed out exceeding the test time limit:

06:02:02  Test duration: 1750
06:02:02  [Pipeline] echo
06:02:02  Test run timeout: 1810
06:02:02  [Pipeline] echo
06:02:02  Collect logs timeout: 90
06:02:02  [Pipeline] echo
06:02:02  Resource cleanup timeout: 30
06:02:02  [Pipeline] echo
06:02:02  Runner timeout: 1935

It is caused by the not finished 2 write stress commands and 13 read ones:
stress_cmd:

  • first 3 writes without duration finish in 2-3 hours itself
  • second 2 writes have duration part which gets overwritten with correct values
  • other 13 reads don't end not having the duration being set and, hence, not redefined

stress_read_cmd:

  • first 2 writes don't have duration, hence, not updated. Don't end because they are configured to have 20 and 26 iterations of really huge dataset writings. Designed to take ~4 days?
  • second 2 writes had duration, it was updated, it, finished correctly.

Here is the code that handles the duration update:

    def run_stress_thread_bench(self, stress_cmd, duration=None, round_robin=False, stats_aggregate_cmds=True,
                                stop_test_on_failure=True, **_):                                        
                                                                                                        
        if duration:                                                                                    
            timeout = self.get_duration(duration)                                                       
        elif self._stress_duration and '-duration=' in stress_cmd:                                      
            timeout = self.get_duration(self._stress_duration)                                          
            stress_cmd = re.sub(r'\s-duration[=\s]+\d+[mhd]+\s*', f' -duration={self._stress_duration}m ', stress_cmd)
        else:                                                                                           
            timeout = get_timeout_from_stress_cmd(stress_cmd) or self.get_duration(duration)  

So, as a result, we should either fix the duration overwriting logic to make it define the duration if it is absent or update all the affected test config files with the duration parameter for each of the stress commands which may run too long.

Impact

SCT stress commands don't finish before the test timeout.

How frequently does it reproduce?

100%

Installation details

Kernel Version: 5.15.0-1048-gcp
Scylla version (or git commit hash): 5.5.0~dev-20240119.b1ba904c4977 with build-id 7a5829efb1f6ef7b467d2dc837300abcc0b739c8

Cluster size: 5 nodes (n2-highmem-16)

Scylla Nodes used in this run:

  • longevity-large-partitions-200k-pks-db-node-ea244d4e-0-5 (35.231.237.63 | 10.142.0.178) (shards: 14)
  • longevity-large-partitions-200k-pks-db-node-ea244d4e-0-4 (35.185.19.187 | 10.142.0.176) (shards: 14)
  • longevity-large-partitions-200k-pks-db-node-ea244d4e-0-3 (35.237.75.150 | 10.142.0.174) (shards: 14)
  • longevity-large-partitions-200k-pks-db-node-ea244d4e-0-2 (34.75.220.25 | 10.142.0.82) (shards: 14)
  • longevity-large-partitions-200k-pks-db-node-ea244d4e-0-1 (34.74.198.214 | 10.142.0.75) (shards: 14)

OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/scylla-5-5-0-dev-x86-64-2024-01-20t02-19-13 (gce: undefined_region)

Test: longevity-large-partition-200k-pks-4days-gce-test
Test id: ea244d4e-60ba-40a2-8cf3-80b280fc98ba
Test name: scylla-master/longevity/longevity-large-partition-200k-pks-4days-gce-test
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor ea244d4e-60ba-40a2-8cf3-80b280fc98ba
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs ea244d4e-60ba-40a2-8cf3-80b280fc98ba

Logs:

Jenkins job URL
Argus

@roydahan
Copy link
Contributor

roydahan commented May 6, 2024

I'll refactor this test.
The problem is with "Generic test duration" and commands that don't have "duration" but "iterations".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants