fix(scan_operations): add retry policy to cql query #9600

aleksbykov · 2024-12-22T10:21:41Z

The node where scan operations was started could be
used by disruptive nemesis. If node was restarted/stopped
while scan query had been running, the scan operation would
be terminated and error event and message will mark
test as failed.

Add to cql session ExponentialBackoffRetryPolicy
which allow to retry the query, if node was down
and once it back, query will be succesfully finished

Fixes: #9284

Testing

Test 1 - Passed
Test 2 - Failed - failed due to c-s error: java.io.IOException: Operation x10 on key(s) [32374e4d3230504b3030]: Data returned was not validated. Errors related to Scan operations are not reproduced
Test 3 - Failed - job failed due to issue scylla server hangs on shutdown - on mapreduce (eventually getting killed after 15 minutes with no progress) scylladb#21568, Errors related to Scan operations were not reproduced

PR pre-checks (self review)

I added the relevant backport labels
I didn't leave commented-out/debugging code

Reminders

Add New configuration option and document them (in sdcm/sct_config.py)
Add unit tests to cover my changes (under unit-test/ folder)
Update the Readme/doc folder relevant to this change (if needed)

The node where scan operations was started could be used by disruptive nemesis. If node was restarted/stopped while scan query had been running, the scan operation would be terminated and error event and message will mark test as failed. Add to cql session ExponetionalBackoffRetryPolicy which allow to retry the query, if node was down and once it back, query will be succesfully finished Fixes: scylladb#9284

soyacz · 2024-12-23T07:25:31Z

sdcm/scan_operation_thread.py

@@ -460,6 +474,9 @@ def execute_query(self, session, cmd: str,
                                  | FullPartitionScanReversedOrderEvent]) -> None:
        self.log.debug('Will run command %s', cmd)
        validate_mapreduce_service_requests_start_time = time.time()
+        session.cluster.default_retry_policy = ExponentialBackoffRetryPolicy(**self._exp_backoff_retry_policy_params)
+        session.default_timeout = self._session_execution_timeout


is there a difference between self._request_default_timeout and self._request_default_timeout? maybe it could be reused?

fruch · 2024-12-23T07:28:24Z

this is a replacement for @temichus trials in #9370 ?

fruch · 2024-12-23T07:42:51Z

sdcm/scan_operation_thread.py

@@ -120,6 +125,8 @@ def execute_query(
                        | FullPartitionScanReversedOrderEvent]) -> ResultSet:
        # pylint: disable=unused-argument
        self.log.debug('Will run command %s', cmd)
+        session.cluster.default_retry_policy = ExponentialBackoffRetryPolicy(**self._exp_backoff_retry_policy_params)


this is a bit weird it comes next to the code executing the query, and no the code creating the session.

I would recommend consolidating the session creating code into something like:

@property def cql_connection(self, **kwargs): with self.fullscan_params.db_cluster.cql_connection_patient( node=self.db_node, user=self.fullscan_params.user, password=self.fullscan_params.user_password, **kwargs) as session: session.cluster.default_retry_policy = ExponentialBackoffRetryPolicy(**self._exp_backoff_retry_policy_params) session.default_timeout = self._request_default_timeout yield session

there way too many repetitions of applying this retry, and it should be across the board for all of the sessions.

temichus · 2024-12-23T08:37:14Z

this is a replacement for @temichus trials in #9370 ?

yes

aleksbykov added 2 commits December 22, 2024 17:11

revert(yaml): add parameters

b27d808

aleksbykov requested review from temichus and soyacz December 22, 2024 10:21

github-actions bot assigned aleksbykov Dec 22, 2024

aleksbykov requested a review from fruch December 22, 2024 10:21

aleksbykov added backport/6.2 backport/2024.2 Need backport to 2024.2 backport/6.1 Need backport to 6.1 labels Dec 22, 2024

aleksbykov marked this pull request as ready for review December 23, 2024 02:54

soyacz reviewed Dec 23, 2024

View reviewed changes

fruch reviewed Dec 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(scan_operations): add retry policy to cql query #9600

fix(scan_operations): add retry policy to cql query #9600

aleksbykov commented Dec 22, 2024 •

edited by fruch

Loading

soyacz Dec 23, 2024

fruch commented Dec 23, 2024

fruch Dec 23, 2024 •

edited

Loading

temichus commented Dec 23, 2024

fix(scan_operations): add retry policy to cql query #9600

Are you sure you want to change the base?

fix(scan_operations): add retry policy to cql query #9600

Conversation

aleksbykov commented Dec 22, 2024 • edited by fruch Loading

Testing

PR pre-checks (self review)

Reminders

soyacz Dec 23, 2024

Choose a reason for hiding this comment

fruch commented Dec 23, 2024

fruch Dec 23, 2024 • edited Loading

Choose a reason for hiding this comment

temichus commented Dec 23, 2024

aleksbykov commented Dec 22, 2024 •

edited by fruch

Loading

fruch Dec 23, 2024 •

edited

Loading