Data Race: concurrent access to recBatch.canFailFromLoadErrs during retry errors #785

rodaine · 2024-07-19T16:14:46Z

Some fuzz testing revealed a data race when interacting with an unhealthy Kafka cluster:

==================
WARNING: DATA RACE
Read at 0x00c000211a00 by goroutine 11:
  github.com/twmb/franz-go/pkg/kgo.(*recBuf).bumpRepeatedLoadErr()
      github.com/twmb/[email protected]/pkg/kgo/sink.go:1298 +0x1e4
  github.com/twmb/franz-go/pkg/kgo.(*Client).bumpMetadataFailForTopics()
      github.com/twmb/[email protected]/pkg/kgo/topics_and_partitions.go:435 +0x671
  github.com/twmb/franz-go/pkg/kgo.(*Client).updateMetadata()
      github.com/twmb/[email protected]/pkg/kgo/metadata.go:336 +0x488
  github.com/twmb/franz-go/pkg/kgo.(*Client).updateMetadataLoop()
      github.com/twmb/[email protected]/pkg/kgo/metadata.go:227 +0x91b
  github.com/twmb/franz-go/pkg/kgo.NewClient.gowrap1()
      github.com/twmb/[email protected]/pkg/kgo/client.go:518 +0x33

Previous write at 0x00c000211a00 by goroutine 1481:
  github.com/twmb/franz-go/pkg/kgo.(*produceRequest).AppendTo()
      github.com/twmb/[email protected]/pkg/kgo/sink.go:1970 +0x13c4
  github.com/twmb/franz-go/pkg/kmsg.(*RequestFormatter).AppendRequest()
      github.com/twmb/franz-go/pkg/[email protected]/api.go:241 +0x72d
  github.com/twmb/franz-go/pkg/kgo.(*brokerCxn).writeRequest()
      github.com/twmb/[email protected]/pkg/kgo/broker.go:1022 +0x5e4
  github.com/twmb/franz-go/pkg/kgo.(*broker).handleReq()
      github.com/twmb/[email protected]/pkg/kgo/broker.go:441 +0x11d3
  github.com/twmb/franz-go/pkg/kgo.(*broker).handleReqs()
      github.com/twmb/[email protected]/pkg/kgo/broker.go:286 +0xfb
  github.com/twmb/franz-go/pkg/kgo.(*broker).do.gowrap1()
      github.com/twmb/[email protected]/pkg/kgo/broker.go:260 +0xab

Goroutine 11 (running) created at:
  github.com/twmb/franz-go/pkg/kgo.NewClient()
      github.com/twmb/[email protected]/pkg/kgo/client.go:518 +0x15aa
  github.com/<redacted>/workload/simple.(*Workload).Run()
      workload/simple/simple.go:41 +0xc14
  main.main.func1()
      main.go:38 +0xab
  golang.org/x/sync/errgroup.(*Group).Go.func1()
      golang.org/x/[email protected]/errgroup/errgroup.go:78 +0x91

Goroutine 1481 (finished) created at:
  github.com/twmb/franz-go/pkg/kgo.(*broker).do()
      github.com/twmb/[email protected]/pkg/kgo/broker.go:260 +0x2e4
  github.com/twmb/franz-go/pkg/kgo.(*sink).doSequenced()
      github.com/twmb/[email protected]/pkg/kgo/sink.go:393 +0x35d
  github.com/twmb/franz-go/pkg/kgo.(*sink).produce()
      github.com/twmb/[email protected]/pkg/kgo/sink.go:368 +0x104e
  github.com/twmb/franz-go/pkg/kgo.(*sink).drain()
      github.com/twmb/[email protected]/pkg/kgo/sink.go:246 +0x126
  github.com/twmb/franz-go/pkg/kgo.(*sink).maybeDrain.gowrap1()
      github.com/twmb/[email protected]/pkg/kgo/sink.go:188 +0x33
==================

The recBatch.canFailFromLoadErrors boolean ends up being concurrently accessed via recBatch.bumpRepeatedLoadErr and produceRequest.AppendTo.

The text was updated successfully, but these errors were encountered:

twmb · 2024-07-19T16:17:08Z

Heh, I was actually looking at this for a different issue and thought there might be a race here...

How did you fuzz this / force the issue? I'd like to add a test myself to ensure there's no regression. I can also try myself, but if you have some base test code to go on, that'd be helpful.

rodaine · 2024-07-19T16:56:47Z

Oh wow! Super fast response. We are testing Bufstream with Antithesis, using franz-go in our test workload. They fuzz a bunch of behaviors, including introducing network partitions that ultimately end up surfacing as errors on the Kafka protocol. Unfortunately, this means I can't really provide you with a repeatable test you could run. Looking through the logs, the only error that surfaced prior to the race was in a fetch (which appeared to have been wedged as well):

unable to join group session: the internal broker struct chosen to issue this request has died--either the broker id is migrating or no longer exists

twmb · 2024-07-20T04:21:21Z

I'll have this one fixed tomorrow, it's a fairly quick patch -- the mutex is there, I'm not sure why I didn't use the mutex around this access; I ran into roughly this exact same race a long time ago and introduced this mutex.

twmb · 2024-07-22T15:29:30Z

I have an open fix, but I'm trying to solve #777 as well in this release. I'm hoping I can figure out why the PR is failing CI today.

twmb added the bug Something isn't working label Jul 19, 2024

twmb mentioned this issue Jul 21, 2024

kgo sink: fix read/write race for recBatch.canFailFromLoadErrs #786

Merged

twmb added the has pr label Jul 21, 2024

twmb mentioned this issue Jul 24, 2024

fix setting lz4 compression levels #781

Merged

twmb closed this as completed in #786 Jul 29, 2024

twmb closed this as completed in 1827add Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Race: concurrent access to recBatch.canFailFromLoadErrs during retry errors #785

Data Race: concurrent access to recBatch.canFailFromLoadErrs during retry errors #785

rodaine commented Jul 19, 2024

twmb commented Jul 19, 2024

rodaine commented Jul 19, 2024

twmb commented Jul 20, 2024

twmb commented Jul 22, 2024

Data Race: concurrent access to recBatch.canFailFromLoadErrs during retry errors #785

Data Race: concurrent access to recBatch.canFailFromLoadErrs during retry errors #785

Comments

rodaine commented Jul 19, 2024

twmb commented Jul 19, 2024

rodaine commented Jul 19, 2024

twmb commented Jul 20, 2024

twmb commented Jul 22, 2024