Potential memory leak with extract_words function #1229

maximeBAY · 2024-11-25T19:18:50Z

Describe the bug

There seems to be a memory leak when running the extract_words function. Although I've explored past issues that indicated that the use of page.close() or page.get_textmap.cache_clear(), running those dont solve my issue.

I'm doing: words = page.extract_words(
x_tolerance=1, y_tolerance=1, extra_attrs=["fontname", "size"]
)

on every page on different pdf files, and the memory keeps increasing

Code to reproduce the problem

words = page.extract_words(
x_tolerance=1, y_tolerance=1, extra_attrs=["fontname", "size"]
)

Expected behavior

Memory should not constantly increase (or at least be released at some point) after every call of extract_words.

Actual behavior

The memory keeps increasing over time and is never released, I've ran profiling using memory-profiler package, and it all points towards the extract_words function not properly releasing memory.

Line     Mem usage    Increment  Occurrences   Line Contents
=============================================================
   511    348.7 MiB    348.7 MiB           1       @profile
   512                                             def extract_words(self, **kwargs: Any) -> T_obj_list:
   513    349.1 MiB      0.4 MiB           1           return utils.extract_words(self.chars, **kwargs)
   
Next execution:
  
  Line     Mem usage    Increment  Occurrences   Line Contents
=============================================================
   511    349.4 MiB    349.4 MiB           1       @profile
   512                                             def extract_words(self, **kwargs: Any) -> T_obj_list:
   513    350.1 MiB      0.6 MiB           1           return utils.extract_words(self.chars, **kwargs)

As you can see, the memory is never released.

Environment

pdfplumber version: 0.11.4
Python version: 3.10
OS: Windows, or Linux, tested in both.

The text was updated successfully, but these errors were encountered:

jsvine · 2024-12-09T04:26:55Z

Thank you for flagging, @maximeBAY. I believe that this might be related to other memory issues flagged elsewhere. But just to check: Do you see similar memory issues if you replace page.extract_words(...) with just len(page.chars)? Or only when using .extract_words(...)?

maximeBAY added the bug label Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential memory leak with extract_words function #1229

Potential memory leak with extract_words function #1229

maximeBAY commented Nov 25, 2024 •

edited

Loading

jsvine commented Dec 9, 2024

Potential memory leak with extract_words function #1229

Potential memory leak with extract_words function #1229

Comments

maximeBAY commented Nov 25, 2024 • edited Loading

Describe the bug

Code to reproduce the problem

Expected behavior

Actual behavior

Environment

jsvine commented Dec 9, 2024

maximeBAY commented Nov 25, 2024 •

edited

Loading