Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Race: concurrent access to recBatch.canFailFromLoadErrs during retry errors #785

Closed
rodaine opened this issue Jul 19, 2024 · 4 comments · Fixed by #786
Closed

Data Race: concurrent access to recBatch.canFailFromLoadErrs during retry errors #785

rodaine opened this issue Jul 19, 2024 · 4 comments · Fixed by #786
Labels
bug Something isn't working has pr

Comments

@rodaine
Copy link
Contributor

rodaine commented Jul 19, 2024

Some fuzz testing revealed a data race when interacting with an unhealthy Kafka cluster:

==================
WARNING: DATA RACE
Read at 0x00c000211a00 by goroutine 11:
  github.com/twmb/franz-go/pkg/kgo.(*recBuf).bumpRepeatedLoadErr()
      github.com/twmb/[email protected]/pkg/kgo/sink.go:1298 +0x1e4
  github.com/twmb/franz-go/pkg/kgo.(*Client).bumpMetadataFailForTopics()
      github.com/twmb/[email protected]/pkg/kgo/topics_and_partitions.go:435 +0x671
  github.com/twmb/franz-go/pkg/kgo.(*Client).updateMetadata()
      github.com/twmb/[email protected]/pkg/kgo/metadata.go:336 +0x488
  github.com/twmb/franz-go/pkg/kgo.(*Client).updateMetadataLoop()
      github.com/twmb/[email protected]/pkg/kgo/metadata.go:227 +0x91b
  github.com/twmb/franz-go/pkg/kgo.NewClient.gowrap1()
      github.com/twmb/[email protected]/pkg/kgo/client.go:518 +0x33

Previous write at 0x00c000211a00 by goroutine 1481:
  github.com/twmb/franz-go/pkg/kgo.(*produceRequest).AppendTo()
      github.com/twmb/[email protected]/pkg/kgo/sink.go:1970 +0x13c4
  github.com/twmb/franz-go/pkg/kmsg.(*RequestFormatter).AppendRequest()
      github.com/twmb/franz-go/pkg/[email protected]/api.go:241 +0x72d
  github.com/twmb/franz-go/pkg/kgo.(*brokerCxn).writeRequest()
      github.com/twmb/[email protected]/pkg/kgo/broker.go:1022 +0x5e4
  github.com/twmb/franz-go/pkg/kgo.(*broker).handleReq()
      github.com/twmb/[email protected]/pkg/kgo/broker.go:441 +0x11d3
  github.com/twmb/franz-go/pkg/kgo.(*broker).handleReqs()
      github.com/twmb/[email protected]/pkg/kgo/broker.go:286 +0xfb
  github.com/twmb/franz-go/pkg/kgo.(*broker).do.gowrap1()
      github.com/twmb/[email protected]/pkg/kgo/broker.go:260 +0xab

Goroutine 11 (running) created at:
  github.com/twmb/franz-go/pkg/kgo.NewClient()
      github.com/twmb/[email protected]/pkg/kgo/client.go:518 +0x15aa
  github.com/<redacted>/workload/simple.(*Workload).Run()
      workload/simple/simple.go:41 +0xc14
  main.main.func1()
      main.go:38 +0xab
  golang.org/x/sync/errgroup.(*Group).Go.func1()
      golang.org/x/[email protected]/errgroup/errgroup.go:78 +0x91

Goroutine 1481 (finished) created at:
  github.com/twmb/franz-go/pkg/kgo.(*broker).do()
      github.com/twmb/[email protected]/pkg/kgo/broker.go:260 +0x2e4
  github.com/twmb/franz-go/pkg/kgo.(*sink).doSequenced()
      github.com/twmb/[email protected]/pkg/kgo/sink.go:393 +0x35d
  github.com/twmb/franz-go/pkg/kgo.(*sink).produce()
      github.com/twmb/[email protected]/pkg/kgo/sink.go:368 +0x104e
  github.com/twmb/franz-go/pkg/kgo.(*sink).drain()
      github.com/twmb/[email protected]/pkg/kgo/sink.go:246 +0x126
  github.com/twmb/franz-go/pkg/kgo.(*sink).maybeDrain.gowrap1()
      github.com/twmb/[email protected]/pkg/kgo/sink.go:188 +0x33
==================

The recBatch.canFailFromLoadErrors boolean ends up being concurrently accessed via recBatch.bumpRepeatedLoadErr and produceRequest.AppendTo.

@twmb
Copy link
Owner

twmb commented Jul 19, 2024

Heh, I was actually looking at this for a different issue and thought there might be a race here...

How did you fuzz this / force the issue? I'd like to add a test myself to ensure there's no regression. I can also try myself, but if you have some base test code to go on, that'd be helpful.

@twmb twmb added the bug Something isn't working label Jul 19, 2024
@rodaine
Copy link
Contributor Author

rodaine commented Jul 19, 2024

Oh wow! Super fast response. We are testing Bufstream with Antithesis, using franz-go in our test workload. They fuzz a bunch of behaviors, including introducing network partitions that ultimately end up surfacing as errors on the Kafka protocol. Unfortunately, this means I can't really provide you with a repeatable test you could run. Looking through the logs, the only error that surfaced prior to the race was in a fetch (which appeared to have been wedged as well):

unable to join group session: the internal broker struct chosen to issue this request has died--either the broker id is migrating or no longer exists

@twmb
Copy link
Owner

twmb commented Jul 20, 2024

I'll have this one fixed tomorrow, it's a fairly quick patch -- the mutex is there, I'm not sure why I didn't use the mutex around this access; I ran into roughly this exact same race a long time ago and introduced this mutex.

@twmb
Copy link
Owner

twmb commented Jul 22, 2024

I have an open fix, but I'm trying to solve #777 as well in this release. I'm hoping I can figure out why the PR is failing CI today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working has pr
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants