Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential memory leak with extract_words function #1229

Open
maximeBAY opened this issue Nov 25, 2024 · 1 comment
Open

Potential memory leak with extract_words function #1229

maximeBAY opened this issue Nov 25, 2024 · 1 comment
Labels

Comments

@maximeBAY
Copy link

maximeBAY commented Nov 25, 2024

Describe the bug

There seems to be a memory leak when running the extract_words function. Although I've explored past issues that indicated that the use of page.close() or page.get_textmap.cache_clear(), running those dont solve my issue.

I'm doing: words = page.extract_words(
x_tolerance=1, y_tolerance=1, extra_attrs=["fontname", "size"]
)

on every page on different pdf files, and the memory keeps increasing

Code to reproduce the problem

words = page.extract_words(
x_tolerance=1, y_tolerance=1, extra_attrs=["fontname", "size"]
)

Expected behavior

Memory should not constantly increase (or at least be released at some point) after every call of extract_words.

Actual behavior

The memory keeps increasing over time and is never released, I've ran profiling using memory-profiler package, and it all points towards the extract_words function not properly releasing memory.

Line     Mem usage    Increment  Occurrences   Line Contents
=============================================================
   511    348.7 MiB    348.7 MiB           1       @profile
   512                                             def extract_words(self, **kwargs: Any) -> T_obj_list:
   513    349.1 MiB      0.4 MiB           1           return utils.extract_words(self.chars, **kwargs)
   
Next execution:
  
  Line     Mem usage    Increment  Occurrences   Line Contents
=============================================================
   511    349.4 MiB    349.4 MiB           1       @profile
   512                                             def extract_words(self, **kwargs: Any) -> T_obj_list:
   513    350.1 MiB      0.6 MiB           1           return utils.extract_words(self.chars, **kwargs)

As you can see, the memory is never released.

Environment

  • pdfplumber version: 0.11.4
  • Python version: 3.10
  • OS: Windows, or Linux, tested in both.
@maximeBAY maximeBAY added the bug label Nov 25, 2024
@jsvine
Copy link
Owner

jsvine commented Dec 9, 2024

Thank you for flagging, @maximeBAY. I believe that this might be related to other memory issues flagged elsewhere. But just to check: Do you see similar memory issues if you replace page.extract_words(...) with just len(page.chars)? Or only when using .extract_words(...)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants