-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Celery memory leak with large files #26
Comments
What do you see in the job status (in the admin for the import job) you should see which line it failed on, or at least within 100 lines... https://github.com/auto-mat/django-import-export-celery/blob/master/import_export_celery/tasks.py#L79 |
Changing the max-tasks-per-child option, and celery max-memory-per-child options should not affect the system OOM and a SIGKILL sugests a system OOM kill. Increasing the instance memory should have helped. Can you launch up top or somehow connect to your dev environment to try to see if there is actually a memory leak going on? Another thing that occurs to me, though I don't think it's possible because I don't know why it would cause a SIGKILL, is that the DB is running out of memory. Afterall, the entire process gets wrapped into a single transaction, which is rather huge. If your resource involves setting PK links or large or non-fixed width fields (such as TextFields), this could mean that you would be potentially setting millions of records in the DB in a single transaction. Not sure how well your DB is designed to handle that sort of thing. If that ends up being a problem, we may need to figure out how to break up the transactions into smaller palatable chunks. |
Well the job crashes before the third step so I am not sure I can have access to this level of information in the second step. We did increase the instance memory but it doesn't seem to help since even with 13Gb of memory didn't prevent the crash (after a much longer time though). I will investigate about your suggestion regarding the DB because I didn't yet. Moreover, the last release of Edit: I checked the DB free space and I can tell it is ok so I definitely don't think it is the cause. |
Does using a much smaller file work? |
It does work with a smaller file. |
I think that what may be going on here, is that Another possibility, is that there is some weird interaction with whatever file backend you are using on your dev environment. Like this could be a problem with the S3 backend, since a large file is being streamed through django file apis. But that's just a guess. |
Indeed this |
There is actually a discussion about this issue on |
Update: I finally fixed (not totally but it is much better) my issue by modifying I have also modified the I could provide a PR if you are interested (also I could be interested in you point of view). The bad news (for now) is that I had to disable the summary file generation as it was too big. |
This would be quite interesting, however, I'm curious how you made it work and if this is an actual, realistic solution. I'd have to look at the PR to be sure. Also, unfortunately import summaries are necessary in some cases, if they need to be disabled for large files, that needs to be optional. |
I do agree summaries should be optionally disabled. |
Problem is how do you define large files? I have a big file of 35000 rows where there needs to be some "before_import" operations done and "post_save signals" done. Everytime I load my file, I have a memory error. But after one minute we everything goes smooth. There is no problem with a 10k rows file. I wonder if the celery jobs can be distributed to two or |
For future readers facing this issue (as we recently have) - there's a couple of parts to a potential solution - Firstly,
Also there are a couple of areas of this library which can be modified to drastically reduce memory usage on large querysets - most issues come from trying to evaluate large querysets of iterate over each item. We've made the following 2 improvements in the context of exporting large amounts of data:
which is incredibly slow for large querysets. If you don't need UUID support, change this to the following:
This helps make the 'creating' of
We've create a fork of this library which is purely focused on improving/optimising the exporting of large datasets asynchronously - hopefully we can make time to get it published. It also does away with the |
Awsome, looking forward to trying your fork! |
Thanks @djw27 for these optimisations ! |
I am facing a blocking issue using
django-import-export
anddjango-import-export-celery
.Context
I need to import large CSV files (about ~250k lines) to my database.
I work on my local environment and I have a few other's available (dev, staging, prod).
Issue
When I perform the import on my local environment, the import is quite long but it eventually works great.
But each time I try to perform an import on dev environment I get this error from celery:
It seems to be a memory usage issue but I can't figure out why it occurs as I tried many to change settings (celery
max-tasks-per-child
option, celerymax-memory-per-child
option,DEBUG
Django setting).Also, I tried by increasing my instance memory up to 13Gb (1Gb before) but still, the error occurs.
Questions
Do you have any insight that I can use to solve my issue ?
Is a 250k lines file too much ?
Are my celery settings bad ?
The text was updated successfully, but these errors were encountered: