-
-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Race: concurrent access to recBatch.canFailFromLoadErrs during retry errors #785
Comments
Heh, I was actually looking at this for a different issue and thought there might be a race here... How did you fuzz this / force the issue? I'd like to add a test myself to ensure there's no regression. I can also try myself, but if you have some base test code to go on, that'd be helpful. |
Oh wow! Super fast response. We are testing Bufstream with Antithesis, using franz-go in our test workload. They fuzz a bunch of behaviors, including introducing network partitions that ultimately end up surfacing as errors on the Kafka protocol. Unfortunately, this means I can't really provide you with a repeatable test you could run. Looking through the logs, the only error that surfaced prior to the race was in a fetch (which appeared to have been wedged as well):
|
I'll have this one fixed tomorrow, it's a fairly quick patch -- the mutex is there, I'm not sure why I didn't use the mutex around this access; I ran into roughly this exact same race a long time ago and introduced this mutex. |
I have an open fix, but I'm trying to solve #777 as well in this release. I'm hoping I can figure out why the PR is failing CI today. |
Some fuzz testing revealed a data race when interacting with an unhealthy Kafka cluster:
The
recBatch.canFailFromLoadErrors
boolean ends up being concurrently accessed viarecBatch.bumpRepeatedLoadErr
andproduceRequest.AppendTo
.The text was updated successfully, but these errors were encountered: