-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error: table - Memtable flush failed due to: std::filesystem::__cxx11::filesystem_error and coredump during second bootstrap after first was aborted #6865
Comments
2023-09-02 07:24:53.837 <2023-09-02 07:20:09.000>: (CoreDumpEvent Severity.ERROR) period_type=one-time event_id=eb780324-1c7f-4b28-9aeb-4afbb9873102 during_nemesis=BootstrapStreamingError node=Node longevity-parallel-topology-schema--db-node-7f42d685-10 [34.240.61.162 | 10.4.11.209] (seed: False)
download_instructions=gsutil cp gs://upload.scylladb.com/core.scylla.112.5ac4fbd29e794a1098f9309538996d99.6536.1693639209000000/core.scylla.112.5ac4fbd29e794a1098f9309538996d99.6536.1693639209000000.gz . |
@bhalevy looks like a problem with the sstable creation/deletion protocol |
The following is non-recoverable by design:
@aleksbykov Maybe some other process is clearing the node while scylla is (re)starting? |
I will check |
@bhalevy , didn't find anything what could clean node. In parallel on another node, was running next cql
|
This happend on 2 nodes at different times:
This is weird... The directory seems ok after restart, e.g. on node-10:
and on node-16:
|
@xemul please look into this. |
Issue reproduced with: Installation detailsKernel Version: 5.15.0-1043-aws Cluster size: 4 nodes (i3en.2xlarge) Scylla Nodes used in this run:
OS / Image: Test: Logs and commands
Logs:
|
happened again (you can find the cordump full details in agrus)
Installation detailsKernel Version: 5.15.0-1050-aws Cluster size: 5 nodes (i4i.2xlarge) Scylla Nodes used in this run:
OS / Image: Test: Logs and commands
Logs:
|
@xemul did you ever looked at this one ? |
@fruch it looks like the bug is in SCT: sct-ec71f603.log:
This is in if not new_node.db_up():
try:
if self.target_node.raft.get_diff_group0_token_ring_members():
self.target_node.raft.clean_group0_garbage(raise_exception=True)
self.log.debug("Clean old scylla data and restart scylla service")
new_node.clean_scylla_data() |
@mykaul can you please move this issue to the scylla-cluster-tests repo? |
@aleksbykov @temichus, can you look into this one ? |
This could happened, because scylla was restarted after stopping with scylla-manager client. In PR #6804 it should be fixed. There scylla is stopped before any next operation. |
Hit again in enterprise-2024.1/SCT_Enterprise_Features/FIPS/longevity-100gb-4h-fips-test#5 |
please take a look, #6804 is in, but still we have issues |
@fruch , we got coredump, which was the cause of sct error. I will prepare a patch for sct. |
Some times ParallelObject, abort_action wait timeout could race, and when they happened at same moment, restart scylla-service by scylla-manager in node_setup method could raise issue scylladb#6865 Set correct timeouts for operations and fix minor mistypes
PR: #7004 |
fixed by #7004 |
Issue description
node10 was bootstrapped, but bootstrap process was aborted with scylla restart
After that scylla is started and during next bootstrap next aborting happened and coredump
After that scylla was restarted again and bootstrapped correctly
Impact
Describe the impact this issue causes to the user.
How frequently does it reproduce?
Describe the frequency with how this issue can be reproduced.
Installation details
Kernel Version: 5.15.0-1043-aws
Scylla version (or git commit hash):
5.4.0~dev-20230901.3bdbe620aa1e
with build-idf44ecb6140c40cc466146304eb075258461d5dd7
Cluster size: 5 nodes (i4i.2xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-0dd654338b4cc2a70
(aws: undefined_region)Test:
longevity-schema-topology-changes-12h-test
Test id:
7f42d685-d644-4597-88ee-b71088f4d1c4
Test name:
scylla-master/longevity/longevity-schema-topology-changes-12h-test
Test config file(s):
Logs and commands
$ hydra investigate show-monitor 7f42d685-d644-4597-88ee-b71088f4d1c4
$ hydra investigate show-logs 7f42d685-d644-4597-88ee-b71088f4d1c4
Logs:
Jenkins job URL
Argus
The text was updated successfully, but these errors were encountered: