You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There seems to be a memory leak when running the extract_words function. Although I've explored past issues that indicated that the use of page.close() or page.get_textmap.cache_clear(), running those dont solve my issue.
I'm doing: words = page.extract_words(
x_tolerance=1, y_tolerance=1, extra_attrs=["fontname", "size"]
)
on every page on different pdf files, and the memory keeps increasing
Code to reproduce the problem
words = page.extract_words(
x_tolerance=1, y_tolerance=1, extra_attrs=["fontname", "size"]
)
Expected behavior
Memory should not constantly increase (or at least be released at some point) after every call of extract_words.
Actual behavior
The memory keeps increasing over time and is never released, I've ran profiling using memory-profiler package, and it all points towards the extract_words function not properly releasing memory.
Line Mem usage Increment Occurrences Line Contents
=============================================================
511 348.7 MiB 348.7 MiB 1 @profile
512 def extract_words(self, **kwargs: Any) -> T_obj_list:
513 349.1 MiB 0.4 MiB 1 return utils.extract_words(self.chars, **kwargs)
Next execution:
Line Mem usage Increment Occurrences Line Contents
=============================================================
511 349.4 MiB 349.4 MiB 1 @profile
512 def extract_words(self, **kwargs: Any) -> T_obj_list:
513 350.1 MiB 0.6 MiB 1 return utils.extract_words(self.chars, **kwargs)
As you can see, the memory is never released.
Environment
pdfplumber version: 0.11.4
Python version: 3.10
OS: Windows, or Linux, tested in both.
The text was updated successfully, but these errors were encountered:
Thank you for flagging, @maximeBAY. I believe that this might be related to other memory issues flagged elsewhere. But just to check: Do you see similar memory issues if you replace page.extract_words(...) with just len(page.chars)? Or only when using .extract_words(...)?
Describe the bug
There seems to be a memory leak when running the extract_words function. Although I've explored past issues that indicated that the use of page.close() or page.get_textmap.cache_clear(), running those dont solve my issue.
I'm doing: words = page.extract_words(
x_tolerance=1, y_tolerance=1, extra_attrs=["fontname", "size"]
)
on every page on different pdf files, and the memory keeps increasing
Code to reproduce the problem
words = page.extract_words(
x_tolerance=1, y_tolerance=1, extra_attrs=["fontname", "size"]
)
Expected behavior
Memory should not constantly increase (or at least be released at some point) after every call of extract_words.
Actual behavior
The memory keeps increasing over time and is never released, I've ran profiling using memory-profiler package, and it all points towards the extract_words function not properly releasing memory.
As you can see, the memory is never released.
Environment
The text was updated successfully, but these errors were encountered: