-
When I use extract_text() it prints out fine as you can see here : With every other pdfs for now I did not have any problems, but in this pdf when using extract_words() on the same part it includes white spaces : At first I thought I needed to change the tolerance but while checking the values of x0 and x1 those white spaces are not supposed to be detected, unless there is something I did not understand. Here are the values : Anyone has any idea what is going on? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hello, and thanks for your interest in this library. Apologies for not responding sooner. Are you able to provide the PDF and specific page number? It's a bit difficult to debug without it. But a couple of observations that may or may not be helpful:
|
Beta Was this translation helpful? Give feedback.
Hello, and thanks for your interest in this library. Apologies for not responding sooner. Are you able to provide the PDF and specific page number? It's a bit difficult to debug without it. But a couple of observations that may or may not be helpful:
You're passing a
y_tolerance
in your first example, but not in your second.Passing
extra_attrs
changes howpdfplumber
groups characters into words. In your example, if two characters do not share the samesize
andfontname
, they will not be grouped into the same word.