-
-
Notifications
You must be signed in to change notification settings - Fork 931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prevent event loop polling from stopping on active redis connections #1734
base: main
Are you sure you want to change the base?
Conversation
@auvipy @thedrow would either of you guys be able to review this? And do you think this is the right approach? I was able to reproduce this issue locally using this example repo: https://github.com/pb-dod/celery_pyamqp_memory_leak/tree/test_redis_disconnect I was also able to verify this fix works there too. I think this fix is somewhat critical, because it's likely making running Celery on the Redis broker unreliable for everyone who upgraded to at least v5.2.3. |
@auvipy I tested this change in production today, and I'm still seeing the same behavior where workers stop responding the worker heartbeats after about an hour (the red lines are a deployment start/finish): I have INFO level logging turned on, and I see no indication in the logs about why the workers get stuck (the last log message is often a successful task run). Workers on different queues with very different workloads get stuck too. I'm not seeing the issue with workers getting stuck when I use RabbitMQ as a broker. I think this PR fixes an issue around workers getting stuck after several Redis disconnections, but now I'm not sure it fixes the primary cause of celery/celery#7276 This also explains why I was getting this issue without seeing Redis connection errors in my logs. |
@auvipy I added the integration test: 34e366d It looks like it's revealing a problem. Good call on adding it! I assumed the two kombu/kombu/transport/redis.py Line 1307 in 2df5be2
kombu/kombu/transport/redis.py Line 1300 in 2df5be2
However:
The way I have it currently won't work because I'm adding I'll need to re-think the solution for this. What I tested in production today effectively disabled the fix from #1476 and workers were still getting stuck. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we close this in favor of #2007?
also should we try to extract relevant integrations tests from here to add to the test suite? |
bab5ee5
to
8c16509
Compare
Fixes: celery/celery#7276
Caused by: #1476
Currently, the Redis transport will always remove the last function added to the event loop (
on_tick
) regardless of which connection is disconnected. If you restart Redis enough (and repeatedly cause connection errors), eventually_on_connection_disconnect
will remove a function from the event loop for the only active connection and the worker will get stuck.I've been seeing these issues with workers getting stuck after I migrated from the RabbitMQ broker to Redis on Celery 5.2.6 and Kombu 5.2.3.
This PR changes it to track which event loop function belongs to which connection and remove the correct function from
on_tick
.