Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: splits/largerange/size=32GiB,nodes=6 failed #114349

Closed
cockroach-teamcity opened this issue Nov 13, 2023 · 15 comments · Fixed by #117614
Closed

roachtest: splits/largerange/size=32GiB,nodes=6 failed #114349

cockroach-teamcity opened this issue Nov 13, 2023 · 15 comments · Fixed by #117614
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Nov 13, 2023

roachtest.splits/largerange/size=32GiB,nodes=6 failed with artifacts on release-23.2 @ c702128fa39c4aa84e810b898064fd2d66ace765:

(sql_runner.go:255).Query: error executing 'SHOW RANGES FROM TABLE bank.bank': pq: replica unavailable: (n2,s2):2VOTER_DEMOTING_LEARNER unable to serve request to r462:/Table/106/1/9{7953640-8211751} [(n5,s5):5, (n2,s2):2VOTER_DEMOTING_LEARNER, (n6,s6):3, (n3,s3):6VOTER_INCOMING, next=7, gen=399]: closed timestamp: 1699893138.952467078,0 (2023-11-13 16:32:18); raft status: {"id":"2","term":8,"vote":"2","commit":50,"lead":"2","raftState":"StateLeader","applied":50,"progress":{"2":{"match":115,"next":116,"state":"StateReplicate"},"3":{"match":0,"next":1,"state":"StateSnapshot"},"5":{"match":114,"next":116,"state":"StateReplicate"},"6":{"match":0,"next":1,"state":"StateSnapshot"}},"leadtransferee":"0"}: encountered poisoned latch /Local/RangeID/462/r/AbortSpan/"509778ae-c62c-45eb-9006-67a7fe602666"@0,0
(monitor.go:153).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/splits/largerange/size=32GiB_nodes=6/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_metamorphicBuild=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-33473

@cockroach-teamcity cockroach-teamcity added branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Nov 13, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.2 milestone Nov 13, 2023
@shralex
Copy link
Contributor

shralex commented Nov 14, 2023

@kvoli looks like a test infra thing ?

@andrewbaptist
Copy link
Collaborator

I'm not sure - what does the "encountered poisoned latch" here mean?

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Nov 15, 2023

The replica circuit breakers tripped, poisoning latches for in-flight writes to avoid latch waiters getting permanently stuck.

Sounds like we lost quorum. Haven't looked at artifacts here at all.

@andrewbaptist
Copy link
Collaborator

I took one more look at the logs, and you are correct that it is tripping a circuit breaker, but this is unexpected. We see this in the logs:

from n1:

I231113 16:32:18.650939 332 13@kv/kvserver/store_rebalancer.go:658 ⋮ [T1,Vsystem,n1,s1,store-rebalancer,obj=‹cpu›] 3112  rebalancing r462 (‹(queries-per-second=0.4 cpu-per-second=12ms)› load) to better balance load: voters from (n1,s1):1,(n2,s2):2,(n6,s6):3 to [n2,s2 n5,s5 n6,s6]; non-voters from  to []
I231113 16:32:21.485026 332 13@kv/kvserver/replica_command.go:2352 ⋮ [T1,Vsystem,n1,s1,r462/1:‹/Table/106/1/9{79536…-82117…}›,*kvpb.AdminChangeReplicasRequest] 3255  change replicas (add [] remove []): existing descriptor r462:‹/Table/106/1/9{7953640-8211751}› [(n1,s1):1VOTER_DEMOTING_LEARNER, (n2,s2):2, (n6,s6):3, (n5,s5):5VOTER_INCOMING, next=6, gen=395]

from n2:

I231113 16:32:21.582311 52870 13@kv/kvserver/replica_raft.go:384 ⋮ [T1,Vsystem,n2,s2,r462/2:‹/Table/106/1/9{79536…-82117…}›] 1485  proposing SIMPLE(l6) [(n3,s3):6LEARNER]: after=[(n5,s5):5 (n2,s2):2 (n6,s6):3 (n3,s3):6LEARNER] next=7
I231113 16:32:21.952380 53123 13@kv/kvserver/replica_raft.go:384 ⋮ [T1,Vsystem,n2,s2,r462/2:‹/Table/106/1/9{79536…-82117…}›] 1529  proposing ENTER_JOINT(r2 l2 v6) [(n3,s3):6VOTER_INCOMING], [(n2,s2):2VOTER_DEMOTING_LEARNER]: after=[(n5,s5):5 (n2,s2):2VOTER_DEMOTING_LEARNER (n6,s6):3 (n3,s3):6VOTER_INCOMING] next=7
W231113 16:32:36.962659 53083 kv/kvserver/spanlatch/manager.go:577 ⋮ [T1,Vsystem,n2,s2,r462/2:‹/Table/106/1/9{79536…-82117…}›] 758  have been waiting 15s to acquire read latch ‹/Local/Range/Table/106/1/97953640/RangeDescriptor›@0,0, held by write latch ‹/Local/Range/Table/106/1/97953640/RangeDescriptor›@0,0

I don't think this is overload because this is the only range that is an issue. It appears that n2 got the lease and then "got stuck"

from n3:

I231113 16:32:21.938107 33243 13@kv/kvserver/replica_raftstorage.go:557 ⋮ [T1,Vsystem,n3,s3,r462/6:{-}] 327  applying snapshot 2f9b8182 from (n5,s5):5 at applied index 48
I231113 16:32:21.938107 33243 13@kv/kvserver/replica_raftstorage.go:557 ⋮ [T1,Vsystem,n3,s3,r462/6:{-}] 327  applying snapshot 2f9b8182 from (n5,s5):5 at applied index 48
I231113 16:32:21.943148 33243 kv/kvserver/replica_raftstorage.go:579 ⋮ [T1,Vsystem,n3,s3,r462/6:‹/Table/106/1/9{79536…-82117…}›] 216  applied snapshot 2f9b8182 from (n5,s5):5 at applied index 48 (total=5ms data=29 MiB ingestion=6@4ms)

from n5:

I231113 16:32:18.667483 238 kv/kvserver/store_remove_replica.go:290 ⋮ [T1,Vsystem,n5] 536  removing uninitialized replica [n5,s5,r462/4:{-}]
I231113 16:32:21.382503 24758 kv/kvserver/replica_raftstorage.go:579 ⋮ [T1,Vsystem,n5,s5,r462/5:‹/Table/106/1/9{79536…-82117…}›] 582  applied snapshot 9c88efbd from (n1,s1):1 at applied index 33 (total=4ms data=29 MiB ingestion=6@3ms)
I231113 16:32:21.534164 24946 kv/kvserver/replica_command.go:2282 ⋮ [T1,Vsystem,n5,replicate,s5,r462/5:‹/Table/106/1/9{79536…-82117…}›] 583  we were trying to exit a joint config but found that we are no longer in one; skipping
E231113 16:33:21.555797 24972 kv/kvserver/queue.go:1160 ⋮ [T1,Vsystem,n5,replicate,s5,r462/5:‹/Table/106/1/9{79536…-82117…}›] 855  operation ‹"replicate queue process replica 462"› timed out after 1m0.006s (given timeout 1m0s): change replicas of r462 failed: txn exec: context deadline exceeded

from n6:

I231113 16:32:22.346644 31320 13@kv/kvserver/replica_raftstorage.go:557 ⋮ [T1,Vsystem,n6,s6,r462/3:{-}] 157  applying snapshot 799a0858 from (n5,s5):5VOTER_INCOMING at applied index 37
W231113 16:32:21.430857 242 kv/kvserver/split_trigger_helper.go:146 ⋮ [T1,Vsystem,n6,s6,r462/3:{-},raft] 379  would have dropped incoming MsgApp to wait for split trigger, but allowing because uninitialized replica was created 45.248846985s (>20s) ago

I can take a bit more of a look later this afternoon unless someone picks this up. But this seems like a deadlock...

@andrewbaptist andrewbaptist self-assigned this Nov 15, 2023
@andrewbaptist
Copy link
Collaborator

andrewbaptist commented Nov 15, 2023

Jotting notes for the timeline:

Unknown: n6 is not a learner but needs a snapshot for most of the time- how did it get out of learner status?

16:31:32.965206 n1 - r70 rebalancing ‹voter› n2,s2 to n5,s5 - this is what causes n5 to become a learner since it doesn't get the snapshot when the splits start

16:31:36.039374 n1 - r461 splits to create r462 (size is 4.2GiB) - replicas n6/3  and n5/4 are still waiting for a snapshot

16:32:18.650939 n1 - store rebalancer begins rebalances from 1,2,6 (5 LEARNER) -> 2,5,6 - creating n5/4
16:32:18.652743 n1 - store rebalancer - remove n5/4 => gen 393
16:32:18.657009 n1 - add [n5/5-LEARNER] remove []): existing  [n1/1, n2/2, n6/3] => gen 394

16:32:21.366871 n1 - streamed snapshot at applied index 33 to n5/5
16:32:21.377980 n5 - applying snapshot from n1/1 at applied index 33

16:32:21.398427 n1 - change replicas add n5/5-VOTER_INCOMING remove n1/1-VOTER_DEMOTING_LEARNER 
16:32:21.427 - n1 - lease transfer to n5
16:32:21.536843 n1 - change replicas remove n1/1-LEARNER => n2/2, n6/3, n5/5 

16:32:21.558 n5 - n1 admin transfer lease transfer to n2 (last step of store rebalancing)

16:32:21.551888 n5 - change replicas add n3/6 to [n5/5, n2/2, n6/3]  (fails and retries since not leaseholder)

16:32:21.563416 n6 - would have dropped incoming MsgApp to wait for split trigger, but allowing because uninitialized replica was created 45.381409855s (>20s) ago - BUG?

16:32:21.570487 n5 - change replicas add n3/6 remove n2/2 to [n5/5, n2/2, n6/3] (succeeds)
16:32:21.582311 n2 -  proposing SIMPLE(l6) [(n3,s3):6LEARNER]: after=[n5/5 n2/2 n6/3 n3/6LEARNER 

16:32:21.943148 n3 - applied snapshot 2f9b8182 from (n5,s5):5 at applied index 48

16:32:21.952380 n2 - proposing ENTER_JOINT add n3, remove n2 after=[n5/5 n2/2-VOTER_DEMOTING_LEARNER n6/3 n3/6-VOTER_INCOMING] - HUNG
16:32:21.956000 n5 - change replicas - leave joint => gen 400 -> HUNG


16:32:22.332579 n5 -  streamed snapshot 799a0858 at applied index 37 to n6/3 - after 0.9s
16:32:22.346644 n6 - applying snapshot 799a0858 from (n5,s5):5VOTER_INCOMING at applied index 37

... system hung

16:32:36.962659 n2 - have been waiting 15s to acquire read latch ‹/Local/Range/Table/106/1/97953640/RangeDescriptor›@0,0, held by write latch ‹/Local/Range/Table/106/1/97953640/RangeDescriptor›@0,0

16:33:21.555153 n5 - error processing replica: change replicas of r462 failed: txn exec: context deadline exceeded

Lease history:

lease
n1 - 1699893085784953569 - 1699893145217695207
16:31:25.784 - 16:32:25.217

n5 - 1699893141427306670 - 1699893147427207021
16:32:21.427 - 16:32:27.427

n2 - 1699893141558284520 - 1699893147558234018
16:32:21.558 - 16:32:27.558

Raft Status

n2 -         "raft_applied_index": 50,
n3 -         "raft_applied_index": 48,
n5 -         "raft_applied_index": 50,
n6 -         "raft_applied_index": 37,

I understand why n6 is behind - since it hit the 45s timeout, but its not clear why n3 is behind, maybe its because of the structure of the joint configs.

At this point the joint configs are:
2, 5, 6 and 3, 5, 6
The second one can't make progress since there isn't a quorum.

@erikgrinaker
Copy link
Contributor

@pavelkalinnikov pointed out that this looks similar to #104588. Replicas 3 and 6 (confusingly on nodes n6 and n3 respectively) are both waiting for snapshots, which stalls the second quorum. The leader 2 says:

  "raft_state": {
    "replica_id": 2,
    "hard_state": {
      "term": 8,
      "vote": 2,
      "commit": 50
    },
    "lead": 2,
    "state": "StateLeader",
    "applied": 50,
    "progress": {
      "2": {
        "match": 159,
        "next": 160,
        "state": "StateReplicate"
      },
      "3": {
        "next": 1,
        "state": "StateSnapshot",
        "paused": true,
        "pending_snapshot": 45
      },
      "5": {
        "match": 159,
        "next": 160,
        "state": "StateReplicate"
      },
      "6": {
        "next": 1,
        "state": "StateSnapshot",
        "paused": true,
        "pending_snapshot": 48
      }
    }

@erikgrinaker
Copy link
Contributor

Yeah, this looks a lot like etcd-io/raft#84. n6 did apply a snapshot from n5 at applied index 37:

I231113 16:32:22.346644 31320 13@kv/kvserver/replica_raftstorage.go:557 ⋮ [T1,Vsystem,n6,s6,r462/3:{-}] 157  applying snapshot 799a0858 from (n5,s5):5VOTER_INCOMING at applied index 37

This is a perfectly fine snapshot, because the leader's log is truncated at index 10:

      "truncated_state": {
        "index": 10,
        "term": 5
      },

But the leader isn't cool with this, because it insists that n6 must apply a snapshot beyond applied index 48:

      "6": {
        "next": 1,
        "state": "StateSnapshot",
        "paused": true,
        "pending_snapshot": 48
      }

@erikgrinaker
Copy link
Contributor

While the immediate problem here is #106813 and etcd-io/raft#84, I've been trying to understand why we even allowed entering this joint quorum that wouldn't be able to make progress.

We have some safety checks when executing a configuration change, but this only takes into account the liveness status of replicas, not their replication status. As the comment points out, callers generally don't check this at all, which is why we do it so far down the stack. My guess is that the replicate queue and allocator don't do any checks like this either elsewhere -- is that accurate @kvoli?

We could consider strengthening the checks here to consider replicas not in StateReplicate to be unavailable, but that would require access to the leader's Raft state. The replicate queue appears to be the only caller here though, and it only runs on leaseholder, so maybe that's a reasonable assumption -- at the very least, we could do a best-effort check that's skipped if we don't have leader state.

Does this all seem reasonable @kvoli? Do we have any existing issues for this, or should I write one up?

// Run (and retry for a bit) a sanity check that the configuration
// resulting from this change is able to meet quorum. It's
// important to do this at this low layer as there are multiple
// entry points that are not generally too careful. For example,
// before the below check existed, the store rebalancer could
// carry out operations that would lead to a loss of quorum.
//
// See:
// https://github.com/cockroachdb/cockroach/issues/54444#issuecomment-707706553
replicas := crt.Desc.Replicas()
// We consider stores marked as "suspect" to be alive for the purposes of
// determining whether the range can achieve quorum since these stores are
// known to be currently live but have failed a liveness heartbeat in the
// recent past.
//
// Note that the allocator will avoid rebalancing to stores that are
// currently marked suspect. See uses of StorePool.getStoreList() in
// allocator.go.
liveReplicas, _ := args.liveAndDeadReplicas(replicas.Descriptors(),
true /* includeSuspectAndDrainingStores */)
if !replicas.CanMakeProgress(
func(rDesc roachpb.ReplicaDescriptor) bool {
for _, inner := range liveReplicas {
if inner.ReplicaID == rDesc.ReplicaID {
return true
}
}
return false
}) {
// NB: we use newQuorumError which is recognized by the replicate queue.
return newQuorumError("range %s cannot make progress with proposed changes add=%v del=%v "+
"based on live replicas %v", crt.Desc, crt.Added(), crt.Removed(), liveReplicas)
}

@erikgrinaker
Copy link
Contributor

I'm going to downgrade this to a GA blocker, since it's a known, pre-existing problem. It seems unlikely that we'll be able to get the upstream Raft changes properly tested and merged in time for 23.2, so I think it's unlikely we'll actually block 23.2 over this.

@erikgrinaker erikgrinaker added GA-blocker C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-replication Relating to Raft, consensus, and coordination. and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Nov 16, 2023
Copy link

blathers-crl bot commented Nov 16, 2023

cc @cockroachdb/replication

@kvoli
Copy link
Collaborator

kvoli commented Nov 16, 2023

My guess is that the replicate queue and allocator don't do any checks like this either elsewhere -- is that accurate @kvoli?

They do checks for replica removal:

candidates = allocatorimpl.FilterUnremovableReplicas(ctx, raftStatus, existingVoters, lastReplAdded)
log.KvDistribution.VEventf(ctx, 3, "filtered unremovable replicas from %v to get %v as candidates for removal: %s",
existingVoters, candidates, rangeRaftProgress(raftStatus, existingVoters))
if len(candidates) > 0 {
break
}

and for rebalancing:

if targetType == VoterTarget && raftStatus != nil && raftStatus.Progress != nil {
replicaCandidates = simulateFilterUnremovableReplicas(
ctx, raftStatus, replicaCandidates, newReplica.ReplicaID)
}

The replica rebalancing check assumes the newly added replica is up to date. This could be changed to not consider the newly added replica -- however this would pause any rebalancing when a node goes down (temporarily), which doesn't appear desirable for load bearing systems.

Does this all seem reasonable @kvoli? Do we have any existing issues for this, or should I write one up?

Seems reasonable to me, I filed #114570.

@erikgrinaker
Copy link
Contributor

The replica rebalancing check assumes the newly added replica is up to date. This could be changed to not consider the newly added replica -- however this would pause any rebalancing when a node goes down (temporarily), which doesn't appear desirable for load bearing systems.

Yeah, let's not do that. It's reasonable to assume the snapshot will catch it up, which it will given #106813. I'll see if I can get etcd-io/raft#84 ready tomorrow.

erikgrinaker pushed a commit to erikgrinaker/raft that referenced this issue Nov 17, 2023
Prior to this commit, the leader would not take into account snapshots
reported by a follower unless they matched or exceeded the tracked
PendingSnapshot index (which is the leader's last index at the time of
requesting the snapshot). This is too inflexible: the leader should take
into account any snapshot that reconnects the follower to its log. This
PR adds a config option ResumeReplicateBelowPendingSnapshot that enables
this behavior.

In doing so, it addresses long-standing problems that we've encountered
in CockroachDB. Unless you create the snapshot immediately and locally
when raft emits an MsgSnap, it's difficult/impossible to later
synthesize a snapshot at the requested index. It is possible to get one
above the requested index which raft always accepted, but CockroachDB
delegates snapshots to followers who might be behind on applying the
log, and it is awkward to have to wait for log application to send the
snapshot just to satisfy an overly strict condition in raft.
Additionally, CockroachDB also sends snapshots preemptively when adding
a new replica since there are qualitative differences between an initial
snapshot and one needed to reconnect to the log and one does not want to
wait for raft to round-trip to the follower to realize that a snapshot
is needed. In this case, the sent snapshot is commonly behind the
PendingSnapshot since the leader transitions the follower into
StateProbe when a snapshot is already in flight.

Touches cockroachdb/cockroach#84242.
Touches cockroachdb/cockroach#87553.
Touches cockroachdb/cockroach#87554.
Touches cockroachdb/cockroach#97971.
Touches cockroachdb/cockroach#114349.
See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143.

Signed-off-by: Erik Grinaker <[email protected]>
Signed-off-by: Tobias Grieger <[email protected]>
erikgrinaker added a commit to erikgrinaker/raft that referenced this issue Nov 17, 2023
A leader will not take into account snapshots reported by a follower
unless they match or exceed the tracked PendingSnapshot index (which is
the leader's last index at the time of requesting the snapshot). This is
too inflexible: the leader should take into account any snapshot that
reconnects the follower to its log. This PR adds a config option
ResumeReplicateBelowPendingSnapshot that enables this behavior.

In doing so, it addresses long-standing problems that we've encountered
in CockroachDB. Unless you create the snapshot immediately and locally
when raft emits an MsgSnap, it's difficult/impossible to later
synthesize a snapshot at the requested index. It is possible to get one
above the requested index which raft always accepted, but CockroachDB
delegates snapshots to followers who might be behind on applying the
log, and it is awkward to have to wait for log application to send the
snapshot just to satisfy an overly strict condition in raft.
Additionally, CockroachDB also sends snapshots preemptively when adding
a new replica since there are qualitative differences between an initial
snapshot and one needed to reconnect to the log and one does not want to
wait for raft to round-trip to the follower to realize that a snapshot
is needed. In this case, the sent snapshot is commonly behind the
PendingSnapshot since the leader transitions the follower into
StateProbe when a snapshot is already in flight.

Touches cockroachdb/cockroach#84242.
Touches cockroachdb/cockroach#87553.
Touches cockroachdb/cockroach#87554.
Touches cockroachdb/cockroach#97971.
Touches cockroachdb/cockroach#114349.
See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143.

Signed-off-by: Erik Grinaker <[email protected]>
Signed-off-by: Tobias Grieger <[email protected]>
erikgrinaker added a commit to erikgrinaker/raft that referenced this issue Nov 17, 2023
A leader will not take into account snapshots reported by a follower
unless they match or exceed the tracked PendingSnapshot index (which is
the leader's last indexat the time of requesting the snapshot). This is
too inflexible: the leader should take into account any snapshot that
reconnects the follower to its log. This PR adds a config option
ResumeReplicateBelowPendingSnapshot that enables this behavior.

In doing so, it addresses long-standing problems that we've encountered
in CockroachDB. Unless you create the snapshot immediately and locally
when raft emits an MsgSnap, it's difficult/impossible to later
synthesize a snapshot at the requested index. It is possible to get one
above the requested index which raft always accepted, but CockroachDB
delegates snapshots to followers who might be behind on applying the
log, and it is awkward to have to wait for log application to send the
snapshot just to satisfy an overly strict condition in raft.
Additionally, CockroachDB also sends snapshots preemptively when adding
a new replica since there are qualitative differences between an initial
snapshot and one needed to reconnect to the log and one does not want to
wait for raft to round-trip to the follower to realize that a snapshot
is needed. In this case, the sent snapshot is commonly behind the
PendingSnapshot since the leader transitions the follower into
StateProbe when a snapshot is already in flight.

Touches cockroachdb/cockroach#84242.
Touches cockroachdb/cockroach#87553.
Touches cockroachdb/cockroach#87554.
Touches cockroachdb/cockroach#97971.
Touches cockroachdb/cockroach#114349.
See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143.

Signed-off-by: Erik Grinaker <[email protected]>
Signed-off-by: Tobias Grieger <[email protected]>
@erikgrinaker
Copy link
Contributor

erikgrinaker commented Nov 17, 2023

Submitted upstream fix in etcd-io/raft#110, CRDB fix in #114640.

erikgrinaker added a commit to erikgrinaker/raft that referenced this issue Nov 17, 2023
A leader will not take into account snapshots reported by a follower
unless they match or exceed the tracked PendingSnapshot index (which is
the leader's last indexat the time of requesting the snapshot). This is
too inflexible: the leader should take into account any snapshot that
reconnects the follower to its log. This PR adds a config option
ResumeReplicateBelowPendingSnapshot that enables this behavior.

In doing so, it addresses long-standing problems that we've encountered
in CockroachDB. Unless you create the snapshot immediately and locally
when raft emits an MsgSnap, it's difficult/impossible to later
synthesize a snapshot at the requested index. It is possible to get one
above the requested index which raft always accepted, but CockroachDB
delegates snapshots to followers who might be behind on applying the
log, and it is awkward to have to wait for log application to send the
snapshot just to satisfy an overly strict condition in raft.
Additionally, CockroachDB also sends snapshots preemptively when adding
a new replica since there are qualitative differences between an initial
snapshot and one needed to reconnect to the log and one does not want to
wait for raft to round-trip to the follower to realize that a snapshot
is needed. In this case, the sent snapshot is commonly behind the
PendingSnapshot since the leader transitions the follower into
StateProbe when a snapshot is already in flight.

Touches cockroachdb/cockroach#84242.
Touches cockroachdb/cockroach#87553.
Touches cockroachdb/cockroach#87554.
Touches cockroachdb/cockroach#97971.
Touches cockroachdb/cockroach#114349.
See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143.

Signed-off-by: Erik Grinaker <[email protected]>
Signed-off-by: Tobias Grieger <[email protected]>
@erikgrinaker
Copy link
Contributor

Removing GA-blocker, since this is a pre-existing problem.

@kvoli
Copy link
Collaborator

kvoli commented Dec 1, 2023

Assigning P2, as this is a pre-existing problem but somewhat specific to this test and we would expect the system to recover shortly in most settings.

@kvoli kvoli added the P-2 Issues/test failures with a fix SLA of 3 months label Dec 1, 2023
@erikgrinaker erikgrinaker self-assigned this Jan 3, 2024
erikgrinaker added a commit to erikgrinaker/raft that referenced this issue Jan 4, 2024
A leader will not take into account snapshots reported by a follower
unless they match or exceed the tracked PendingSnapshot index (which is
the leader's last indexat the time of requesting the snapshot). This is
too inflexible: the leader should take into account any snapshot that
reconnects the follower to its log. This PR adds a config option
ResumeReplicateBelowPendingSnapshot that enables this behavior.

In doing so, it addresses long-standing problems that we've encountered
in CockroachDB. Unless you create the snapshot immediately and locally
when raft emits an MsgSnap, it's difficult/impossible to later
synthesize a snapshot at the requested index. It is possible to get one
above the requested index which raft always accepted, but CockroachDB
delegates snapshots to followers who might be behind on applying the
log, and it is awkward to have to wait for log application to send the
snapshot just to satisfy an overly strict condition in raft.
Additionally, CockroachDB also sends snapshots preemptively when adding
a new replica since there are qualitative differences between an initial
snapshot and one needed to reconnect to the log and one does not want to
wait for raft to round-trip to the follower to realize that a snapshot
is needed. In this case, the sent snapshot is commonly behind the
PendingSnapshot since the leader transitions the follower into
StateProbe when a snapshot is already in flight.

Touches cockroachdb/cockroach#84242.
Touches cockroachdb/cockroach#87553.
Touches cockroachdb/cockroach#87554.
Touches cockroachdb/cockroach#97971.
Touches cockroachdb/cockroach#114349.
See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143.

Signed-off-by: Erik Grinaker <[email protected]>
Signed-off-by: Tobias Grieger <[email protected]>
erikgrinaker added a commit to erikgrinaker/raft that referenced this issue Jan 4, 2024
A leader will not take into account snapshots reported by a follower
unless they match or exceed the tracked PendingSnapshot index (which is
the leader's last indexat the time of requesting the snapshot). This is
too inflexible: the leader should take into account any snapshot that
reconnects the follower to its log. This PR adds a config option
ResumeReplicateBelowPendingSnapshot that enables this behavior.

In doing so, it addresses long-standing problems that we've encountered
in CockroachDB. Unless you create the snapshot immediately and locally
when raft emits an MsgSnap, it's difficult/impossible to later
synthesize a snapshot at the requested index. It is possible to get one
above the requested index which raft always accepted, but CockroachDB
delegates snapshots to followers who might be behind on applying the
log, and it is awkward to have to wait for log application to send the
snapshot just to satisfy an overly strict condition in raft.
Additionally, CockroachDB also sends snapshots preemptively when adding
a new replica since there are qualitative differences between an initial
snapshot and one needed to reconnect to the log and one does not want to
wait for raft to round-trip to the follower to realize that a snapshot
is needed. In this case, the sent snapshot is commonly behind the
PendingSnapshot since the leader transitions the follower into
StateProbe when a snapshot is already in flight.

Touches cockroachdb/cockroach#84242.
Touches cockroachdb/cockroach#87553.
Touches cockroachdb/cockroach#87554.
Touches cockroachdb/cockroach#97971.
Touches cockroachdb/cockroach#114349.
See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143.

Signed-off-by: Erik Grinaker <[email protected]>
Signed-off-by: Tobias Grieger <[email protected]>
erikgrinaker added a commit to erikgrinaker/raft that referenced this issue Jan 5, 2024
A leader will not take into account snapshots reported by a follower
unless they match or exceed the tracked PendingSnapshot index (which is
the leader's last indexat the time of requesting the snapshot). This is
too inflexible: the leader should take into account any snapshot that
reconnects the follower to its log. This PR adds a config option
ResumeReplicateBelowPendingSnapshot that enables this behavior.

In doing so, it addresses long-standing problems that we've encountered
in CockroachDB. Unless you create the snapshot immediately and locally
when raft emits an MsgSnap, it's difficult/impossible to later
synthesize a snapshot at the requested index. It is possible to get one
above the requested index which raft always accepted, but CockroachDB
delegates snapshots to followers who might be behind on applying the
log, and it is awkward to have to wait for log application to send the
snapshot just to satisfy an overly strict condition in raft.
Additionally, CockroachDB also sends snapshots preemptively when adding
a new replica since there are qualitative differences between an initial
snapshot and one needed to reconnect to the log and one does not want to
wait for raft to round-trip to the follower to realize that a snapshot
is needed. In this case, the sent snapshot is commonly behind the
PendingSnapshot since the leader transitions the follower into
StateProbe when a snapshot is already in flight.

Touches cockroachdb/cockroach#84242.
Touches cockroachdb/cockroach#87553.
Touches cockroachdb/cockroach#87554.
Touches cockroachdb/cockroach#97971.
Touches cockroachdb/cockroach#114349.
See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143.

Signed-off-by: Erik Grinaker <[email protected]>
Signed-off-by: Tobias Grieger <[email protected]>
erikgrinaker added a commit to erikgrinaker/raft that referenced this issue Jan 9, 2024
A leader will not take into account snapshots reported by a follower
unless they match or exceed the tracked PendingSnapshot index (which is
the leader's last indexat the time of requesting the snapshot). This is
too inflexible: the leader should take into account any snapshot that
reconnects the follower to its log. This PR makes that change.

In doing so, it addresses long-standing problems that we've encountered
in CockroachDB. Unless you create the snapshot immediately and locally
when raft emits an MsgSnap, it's difficult/impossible to later
synthesize a snapshot at the requested index. It is possible to get one
above the requested index which raft always accepted, but CockroachDB
delegates snapshots to followers who might be behind on applying the
log, and it is awkward to have to wait for log application to send the
snapshot just to satisfy an overly strict condition in raft.
Additionally, CockroachDB also sends snapshots preemptively when adding
a new replica since there are qualitative differences between an initial
snapshot and one needed to reconnect to the log and one does not want to
wait for raft to round-trip to the follower to realize that a snapshot
is needed. In this case, the sent snapshot is commonly behind the
PendingSnapshot since the leader transitions the follower into
StateProbe when a snapshot is already in flight.

Touches cockroachdb/cockroach#84242.
Touches cockroachdb/cockroach#87553.
Touches cockroachdb/cockroach#87554.
Touches cockroachdb/cockroach#97971.
Touches cockroachdb/cockroach#114349.
See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143.

Signed-off-by: Erik Grinaker <[email protected]>
Signed-off-by: Tobias Grieger <[email protected]>
erikgrinaker added a commit to erikgrinaker/raft that referenced this issue Jan 9, 2024
A leader will not take into account snapshots reported by a follower
unless they match or exceed the tracked PendingSnapshot index (which is
the leader's last indexat the time of requesting the snapshot). This is
too inflexible: the leader should take into account any snapshot that
reconnects the follower to its log. This PR makes that change.

In doing so, it addresses long-standing problems that we've encountered
in CockroachDB. Unless you create the snapshot immediately and locally
when raft emits an MsgSnap, it's difficult/impossible to later
synthesize a snapshot at the requested index. It is possible to get one
above the requested index which raft always accepted, but CockroachDB
delegates snapshots to followers who might be behind on applying the
log, and it is awkward to have to wait for log application to send the
snapshot just to satisfy an overly strict condition in raft.
Additionally, CockroachDB also sends snapshots preemptively when adding
a new replica since there are qualitative differences between an initial
snapshot and one needed to reconnect to the log and one does not want to
wait for raft to round-trip to the follower to realize that a snapshot
is needed. In this case, the sent snapshot is commonly behind the
PendingSnapshot since the leader transitions the follower into
StateProbe when a snapshot is already in flight.

Touches cockroachdb/cockroach#84242.
Touches cockroachdb/cockroach#87553.
Touches cockroachdb/cockroach#87554.
Touches cockroachdb/cockroach#97971.
Touches cockroachdb/cockroach#114349.
See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143.

Signed-off-by: Erik Grinaker <[email protected]>
Signed-off-by: Tobias Grieger <[email protected]>
@craig craig bot closed this as completed in 60092e0 Jan 10, 2024
@github-project-automation github-project-automation bot moved this to Closed in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. P-2 Issues/test failures with a fix SLA of 3 months
Projects
No open projects
Status: Closed
5 participants