Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent deadlock when closing a channel using CloseAsync in 7.x #1751

Open
Andersso opened this issue Dec 19, 2024 · 5 comments
Open

Intermittent deadlock when closing a channel using CloseAsync in 7.x #1751

Andersso opened this issue Dec 19, 2024 · 5 comments
Assignees
Labels
Milestone

Comments

@Andersso
Copy link

Describe the bug

Hi there,

Ever since upgrading from 6.x to 7.x, I've been running into intermittent deadlocks whenever I try to close a channel via CloseAsync.
I haven't been able to reproduce it locally, but I've been able to do some remote debugging, but I could not get any insight. (all TP threads are waiting for work)

I did however manage to run a dotnet-dump dumpasync during one of these deadlocks and got the following info:

First dump

00007ebcbe050400 00007efcd1b1e298 ( ) System.Threading.Tasks.Task
  00007ebcbe9d7c48 00007efcd74554b8 (0) RabbitMQ.Client.ConsumerDispatching.ConsumerDispatcherChannelBase+<WaitForShutdownAsync>d__17
    00007ebcbe915d68 00007efcd7453028 (3) RabbitMQ.Client.Impl.Channel+<CloseAsync>d__73
      00007ebcbe915e10 00007efcd74533e0 (0) RabbitMQ.Client.Impl.AutorecoveringChannel+<CloseAsync>d__53
        00007ebcbe915ea0 00007efcd7453788 (0) <My code>

Second dump (another instance)

00007f0a56a17290 00007f4a69375380 ( ) System.Threading.Tasks.Task
  00007f0a597ed238 00007f4a6d2dd968 (0) RabbitMQ.Client.ConsumerDispatching.ConsumerDispatcherChannelBase+<WaitForShutdownAsync>d__17
    00007f0a59573cf8 00007f4a6d2da998 (3) RabbitMQ.Client.Impl.Channel+<CloseAsync>d__73
      00007f0a59573da0 00007f4a6d2dad50 (0) RabbitMQ.Client.Impl.AutorecoveringChannel+<CloseAsync>d__53
        00007f0a59573e30 00007f4a6d2db0f8 (0) <My code>

I noticed that in both dump instances, the stacks aren’t displayed with the usual Awaiting: notation you often see in async stack traces, but it might be normal.

Reproduction steps

I haven’t pinned down a reliable way to reproduce this, but calling CloseAsync more frequently seems to increase the chances of hitting the deadlock. It also appears more common on Linux than Windows, though that might just be due to hardware differences rather than OS behavior.

Expected behavior

When calling CloseAsync, I’d expect the channel to close normally without causing a deadlock.

Additional context

No response

@Andersso Andersso added the bug label Dec 19, 2024
@lukebakken
Copy link
Contributor

Hi, thanks for the report. As I'm sure you're aware of, there's not much to work with here 😸 Obviously, the gold standard is to provide code that reproduces this issue, or at least some idea of steps to do so.

calling CloseAsync more frequently

What does this mean? Do you have some way in your application to increase the frequency of channel closure?

@lukebakken lukebakken self-assigned this Dec 19, 2024
@lukebakken lukebakken added this to the 7.1.0 milestone Dec 19, 2024
@Andersso
Copy link
Author

What does this mean? Do you have some way in your application to increase the frequency of channel closure?

We're running tests that create and close channels very frequently, and it appears that the test suite that do this the most; is the the one that is usually getting stuck.

Anyhow, I can try to look dig into this further and see if I can provide something that will help you reproduce it.

Thanks

@michaelklishin
Copy link
Member

@Andersso channel and connection churn are workloads explicitly recommended against.

@lukebakken
Copy link
Contributor

lukebakken commented Dec 19, 2024

We're running tests that create and close channels very frequently, and it appears that the test suite that do this the most; is the the one that is usually getting stuck.

It would be extremely helpful for you to share your test code. If you can't do that, describe the test as best you can:

  • How many channels are created at any given point?
  • Are they created concurrently?
  • Is your test code a separate, console app, or using a test framework like xunit?

My guess is that you could be hitting a ThreadPool limit which prevents a Task from being scheduled, while another Task waits for the result. If you'd like to test that theory, please add the following code to the startup of your test program / test suite:

ThreadPool.SetMinThreads(16 * Environment.ProcessorCount, 16 * Environment.ProcessorCount);

This is a related issue:
#1354

@michaelklishin
Copy link
Member

Also note that management UI has connection and channel churn metrics, on the Overview page but also on the node page IIRC.

So at the very least it should be easy to see the churn rate: is it 50 channels opened per second? Is it 200?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants