Celery memory leak with large files #26

NeilujD · 2020-07-13T14:32:42Z

I am facing a blocking issue using django-import-export and django-import-export-celery.

Context

I need to import large CSV files (about ~250k lines) to my database.
I work on my local environment and I have a few other's available (dev, staging, prod).

Issue

When I perform the import on my local environment, the import is quite long but it eventually works great.
But each time I try to perform an import on dev environment I get this error from celery:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/billiard/pool.py", line 1267, in mark_as_worker_lost
    human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).

It seems to be a memory usage issue but I can't figure out why it occurs as I tried many to change settings (celery max-tasks-per-child option, celery max-memory-per-child option, DEBUG Django setting).

Also, I tried by increasing my instance memory up to 13Gb (1Gb before) but still, the error occurs.

Questions

Do you have any insight that I can use to solve my issue ?
Is a 250k lines file too much ?
Are my celery settings bad ?

The text was updated successfully, but these errors were encountered:

timthelion · 2020-07-13T20:29:00Z

What do you see in the job status (in the admin for the import job) you should see which line it failed on, or at least within 100 lines... https://github.com/auto-mat/django-import-export-celery/blob/master/import_export_celery/tasks.py#L79

timthelion · 2020-07-13T20:39:25Z

Changing the max-tasks-per-child option, and celery max-memory-per-child options should not affect the system OOM and a SIGKILL sugests a system OOM kill. Increasing the instance memory should have helped. Can you launch up top or somehow connect to your dev environment to try to see if there is actually a memory leak going on?

Another thing that occurs to me, though I don't think it's possible because I don't know why it would cause a SIGKILL, is that the DB is running out of memory. Afterall, the entire process gets wrapped into a single transaction, which is rather huge. If your resource involves setting PK links or large or non-fixed width fields (such as TextFields), this could mean that you would be potentially setting millions of records in the DB in a single transaction. Not sure how well your DB is designed to handle that sort of thing. If that ends up being a problem, we may need to figure out how to break up the transactions into smaller palatable chunks.

NeilujD · 2020-07-15T08:08:46Z

What do you see in the job status (in the admin for the import job) you should see which line it failed on, or at least within 100 lines... https://github.com/auto-mat/django-import-export-celery/blob/master/import_export_celery/tasks.py#L79

Well the job crashes before the third step so I am not sure I can have access to this level of information in the second step.

We did increase the instance memory but it doesn't seem to help since even with 13Gb of memory didn't prevent the crash (after a much longer time though).

I will investigate about your suggestion regarding the DB because I didn't yet. Moreover, the last release of django-import-export package brings bulk support so it is possible to save for instance 1000 rows per 1000 rows which could fit the need you mentioned.

Edit: I checked the DB free space and I can tell it is ok so I definitely don't think it is the cause.

timthelion · 2020-07-16T20:36:29Z

Does using a much smaller file work?

NeilujD · 2020-07-17T12:46:13Z

It does work with a smaller file.
For now, I can tell that a 50k lines file import works.

timthelion · 2020-07-17T14:46:21Z

I think that what may be going on here, is that import_set method in tablib is called https://github.com/jazzband/tablib/blob/aaeb5c83600052f6696eb09a155112bc0252fffa/src/tablib/formats/_csv.py#L36 and the import_set method loads the entire file (rather than streaming it) and probably does some sort of explosive memory growth thing. Try using a different input format. I don't have this problem myself, so you'll have to kind of debug it yourself.

Another possibility, is that there is some weird interaction with whatever file backend you are using on your dev environment. Like this could be a problem with the S3 backend, since a large file is being streamed through django file apis. But that's just a guess.

NeilujD · 2020-07-17T15:27:01Z

Indeed this tablib issue seems to be the cause of mine.
Thank you for notify me this.
I will probably investigate on this later.
If I do I'll let you know of my advancements.

NeilujD · 2020-07-21T08:33:58Z

There is actually a discussion about this issue on django-import-export repository: django-import-export/django-import-export#1164

NeilujD · 2020-07-27T13:30:47Z

Update:

I finally fixed (not totally but it is much better) my issue by modifying django-import-export-celery.
My version of this package now support task pagination. In my case I made it to import 1000 lines per task's child.
I am now able to support about 250k lines CSV file imports before facing a SIGKILL.

I have also modified the django-import-export so it uses pyexcel instead of tablib.

I could provide a PR if you are interested (also I could be interested in you point of view).

The bad news (for now) is that I had to disable the summary file generation as it was too big.
My lead about that is to make a paginate view of the summary.

timthelion · 2020-07-27T13:49:42Z

This would be quite interesting, however, I'm curious how you made it work and if this is an actual, realistic solution. I'd have to look at the PR to be sure. Also, unfortunately import summaries are necessary in some cases, if they need to be disabled for large files, that needs to be optional.

NeilujD · 2020-07-27T14:01:32Z

I do agree summaries should be optionally disabled.

srugano · 2021-09-23T14:23:36Z

Problem is how do you define large files? I have a big file of 35000 rows where there needs to be some "before_import" operations done and "post_save signals" done. Everytime I load my file, I have a memory error. But after one minute we everything goes smooth. There is no problem with a 10k rows file.

I wonder if the celery jobs can be distributed to two or
more workers if it's in the settings.

djw27 · 2022-11-03T16:50:22Z

For future readers facing this issue (as we recently have) - there's a couple of parts to a potential solution -

Firstly, django-import-export supports pagination out of the box, you just need to specify the chunk_size within your resource, e.g.:

class MyResource(resources.ModelResource):
    class Meta:
        model = SomeModel
        fields = (
            ...
        )
        # Iterate over chunks of 5000 objects at once
        chunk_size = 5000

Also there are a couple of areas of this library which can be modified to drastically reduce memory usage on large querysets - most issues come from trying to evaluate large querysets of iterate over each item. We've made the following 2 improvements in the context of exporting large amounts of data:

In admin_actions.py when creating an ExportJob we currently do the following:

ej = ExportJob.objects.create(
    app_label=arbitrary_obj._meta.app_label,
    model=arbitrary_obj._meta.model_name,
    queryset=json.dumps(
        [
            str(obj.pk) if isinstance(obj.pk, UUID) else obj.pk
            for obj in queryset
        ]
    ),
    site_of_origin=request.scheme + "://" + request.get_host(),
)

which is incredibly slow for large querysets. If you don't need UUID support, change this to the following:

ej = ExportJob.objects.create(
    app_label=arbitrary_obj._meta.app_label,
    model=arbitrary_obj._meta.model_name,
    queryset=list(queryset.values_list("pk", flat=True)),
    site_of_origin=request.scheme + "://" + request.get_host(),
)

This helps make the 'creating' of ExportJob objects a little faster but doesn't help with any celery memory leaks.

Modify tasks.py to not evaluate large querysets in run_export_job():

qs_len = len(queryset) -> qs_len = queryset.count() - this change alone stopped our memory usage ballooning.

We've create a fork of this library which is purely focused on improving/optimising the exporting of large datasets asynchronously - hopefully we can make time to get it published. It also does away with the export_resource_classes() concept which can easily lead to circular dependency headaches in our experience!

timthelion · 2022-11-03T18:02:48Z

Awsome, looking forward to trying your fork!

NeilujD · 2022-11-21T15:49:13Z

Thanks @djw27 for these optimisations !
It seems to be only useful for export unless I am wrong ?
My issues mostly related to the import, but I guess we could figure something out from your proposal in order to enhance that other part.

timthelion added the waiting-for-YOU-to-make-a-pr label Jul 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Celery memory leak with large files #26

Celery memory leak with large files #26

NeilujD commented Jul 13, 2020

timthelion commented Jul 13, 2020

timthelion commented Jul 13, 2020

NeilujD commented Jul 15, 2020 •

edited

Loading

timthelion commented Jul 16, 2020

NeilujD commented Jul 17, 2020

timthelion commented Jul 17, 2020

NeilujD commented Jul 17, 2020

NeilujD commented Jul 21, 2020

NeilujD commented Jul 27, 2020 •

edited

Loading

timthelion commented Jul 27, 2020

NeilujD commented Jul 27, 2020

srugano commented Sep 23, 2021

djw27 commented Nov 3, 2022

timthelion commented Nov 3, 2022

NeilujD commented Nov 21, 2022

Celery memory leak with large files #26

Celery memory leak with large files #26

Comments

NeilujD commented Jul 13, 2020

Context

Issue

Questions

timthelion commented Jul 13, 2020

timthelion commented Jul 13, 2020

NeilujD commented Jul 15, 2020 • edited Loading

timthelion commented Jul 16, 2020

NeilujD commented Jul 17, 2020

timthelion commented Jul 17, 2020

NeilujD commented Jul 17, 2020

NeilujD commented Jul 21, 2020

NeilujD commented Jul 27, 2020 • edited Loading

timthelion commented Jul 27, 2020

NeilujD commented Jul 27, 2020

srugano commented Sep 23, 2021

djw27 commented Nov 3, 2022

timthelion commented Nov 3, 2022

NeilujD commented Nov 21, 2022

NeilujD commented Jul 15, 2020 •

edited

Loading

NeilujD commented Jul 27, 2020 •

edited

Loading