Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Celery memory leak with large files #26

Open
NeilujD opened this issue Jul 13, 2020 · 15 comments
Open

Celery memory leak with large files #26

NeilujD opened this issue Jul 13, 2020 · 15 comments

Comments

@NeilujD
Copy link

NeilujD commented Jul 13, 2020

I am facing a blocking issue using django-import-export and django-import-export-celery.

Context

I need to import large CSV files (about ~250k lines) to my database.
I work on my local environment and I have a few other's available (dev, staging, prod).

Issue

When I perform the import on my local environment, the import is quite long but it eventually works great.
But each time I try to perform an import on dev environment I get this error from celery:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/billiard/pool.py", line 1267, in mark_as_worker_lost
    human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).

It seems to be a memory usage issue but I can't figure out why it occurs as I tried many to change settings (celery max-tasks-per-child option, celery max-memory-per-child option, DEBUG Django setting).

Also, I tried by increasing my instance memory up to 13Gb (1Gb before) but still, the error occurs.

Questions

Do you have any insight that I can use to solve my issue ?
Is a 250k lines file too much ?
Are my celery settings bad ?

@timthelion
Copy link
Member

What do you see in the job status (in the admin for the import job) you should see which line it failed on, or at least within 100 lines... https://github.com/auto-mat/django-import-export-celery/blob/master/import_export_celery/tasks.py#L79

@timthelion
Copy link
Member

Changing the max-tasks-per-child option, and celery max-memory-per-child options should not affect the system OOM and a SIGKILL sugests a system OOM kill. Increasing the instance memory should have helped. Can you launch up top or somehow connect to your dev environment to try to see if there is actually a memory leak going on?

Another thing that occurs to me, though I don't think it's possible because I don't know why it would cause a SIGKILL, is that the DB is running out of memory. Afterall, the entire process gets wrapped into a single transaction, which is rather huge. If your resource involves setting PK links or large or non-fixed width fields (such as TextFields), this could mean that you would be potentially setting millions of records in the DB in a single transaction. Not sure how well your DB is designed to handle that sort of thing. If that ends up being a problem, we may need to figure out how to break up the transactions into smaller palatable chunks.

@NeilujD
Copy link
Author

NeilujD commented Jul 15, 2020

What do you see in the job status (in the admin for the import job) you should see which line it failed on, or at least within 100 lines... https://github.com/auto-mat/django-import-export-celery/blob/master/import_export_celery/tasks.py#L79

Well the job crashes before the third step so I am not sure I can have access to this level of information in the second step.

We did increase the instance memory but it doesn't seem to help since even with 13Gb of memory didn't prevent the crash (after a much longer time though).

I will investigate about your suggestion regarding the DB because I didn't yet. Moreover, the last release of django-import-export package brings bulk support so it is possible to save for instance 1000 rows per 1000 rows which could fit the need you mentioned.

Edit: I checked the DB free space and I can tell it is ok so I definitely don't think it is the cause.

@timthelion
Copy link
Member

Does using a much smaller file work?

@NeilujD
Copy link
Author

NeilujD commented Jul 17, 2020

It does work with a smaller file.
For now, I can tell that a 50k lines file import works.

@timthelion
Copy link
Member

I think that what may be going on here, is that import_set method in tablib is called https://github.com/jazzband/tablib/blob/aaeb5c83600052f6696eb09a155112bc0252fffa/src/tablib/formats/_csv.py#L36 and the import_set method loads the entire file (rather than streaming it) and probably does some sort of explosive memory growth thing. Try using a different input format. I don't have this problem myself, so you'll have to kind of debug it yourself.

Another possibility, is that there is some weird interaction with whatever file backend you are using on your dev environment. Like this could be a problem with the S3 backend, since a large file is being streamed through django file apis. But that's just a guess.

@NeilujD
Copy link
Author

NeilujD commented Jul 17, 2020

Indeed this tablib issue seems to be the cause of mine.
Thank you for notify me this.
I will probably investigate on this later.
If I do I'll let you know of my advancements.

@NeilujD
Copy link
Author

NeilujD commented Jul 21, 2020

There is actually a discussion about this issue on django-import-export repository: django-import-export/django-import-export#1164

@NeilujD
Copy link
Author

NeilujD commented Jul 27, 2020

Update:

I finally fixed (not totally but it is much better) my issue by modifying django-import-export-celery.
My version of this package now support task pagination. In my case I made it to import 1000 lines per task's child.
I am now able to support about 250k lines CSV file imports before facing a SIGKILL.

I have also modified the django-import-export so it uses pyexcel instead of tablib.

I could provide a PR if you are interested (also I could be interested in you point of view).

The bad news (for now) is that I had to disable the summary file generation as it was too big.
My lead about that is to make a paginate view of the summary.

@timthelion
Copy link
Member

This would be quite interesting, however, I'm curious how you made it work and if this is an actual, realistic solution. I'd have to look at the PR to be sure. Also, unfortunately import summaries are necessary in some cases, if they need to be disabled for large files, that needs to be optional.

@NeilujD
Copy link
Author

NeilujD commented Jul 27, 2020

I do agree summaries should be optionally disabled.

@srugano
Copy link

srugano commented Sep 23, 2021

Problem is how do you define large files? I have a big file of 35000 rows where there needs to be some "before_import" operations done and "post_save signals" done. Everytime I load my file, I have a memory error. But after one minute we everything goes smooth. There is no problem with a 10k rows file.

I wonder if the celery jobs can be distributed to two or
more workers if it's in the settings.

IMG-20210923-WA0022

@djw27
Copy link

djw27 commented Nov 3, 2022

For future readers facing this issue (as we recently have) - there's a couple of parts to a potential solution -

Firstly, django-import-export supports pagination out of the box, you just need to specify the chunk_size within your resource, e.g.:

class MyResource(resources.ModelResource):
    class Meta:
        model = SomeModel
        fields = (
            ...
        )
        # Iterate over chunks of 5000 objects at once
        chunk_size = 5000

Also there are a couple of areas of this library which can be modified to drastically reduce memory usage on large querysets - most issues come from trying to evaluate large querysets of iterate over each item. We've made the following 2 improvements in the context of exporting large amounts of data:

  1. In admin_actions.py when creating an ExportJob we currently do the following:
ej = ExportJob.objects.create(
    app_label=arbitrary_obj._meta.app_label,
    model=arbitrary_obj._meta.model_name,
    queryset=json.dumps(
        [
            str(obj.pk) if isinstance(obj.pk, UUID) else obj.pk
            for obj in queryset
        ]
    ),
    site_of_origin=request.scheme + "://" + request.get_host(),
)

which is incredibly slow for large querysets. If you don't need UUID support, change this to the following:

ej = ExportJob.objects.create(
    app_label=arbitrary_obj._meta.app_label,
    model=arbitrary_obj._meta.model_name,
    queryset=list(queryset.values_list("pk", flat=True)),
    site_of_origin=request.scheme + "://" + request.get_host(),
)

This helps make the 'creating' of ExportJob objects a little faster but doesn't help with any celery memory leaks.

  1. Modify tasks.py to not evaluate large querysets in run_export_job():

qs_len = len(queryset) -> qs_len = queryset.count() - this change alone stopped our memory usage ballooning.

We've create a fork of this library which is purely focused on improving/optimising the exporting of large datasets asynchronously - hopefully we can make time to get it published. It also does away with the export_resource_classes() concept which can easily lead to circular dependency headaches in our experience!

@timthelion
Copy link
Member

Awsome, looking forward to trying your fork!

@NeilujD
Copy link
Author

NeilujD commented Nov 21, 2022

Thanks @djw27 for these optimisations !
It seems to be only useful for export unless I am wrong ?
My issues mostly related to the import, but I guess we could figure something out from your proposal in order to enhance that other part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants