extract_words() behaving poorly while extract_text() works fine with the same parameters. #503

PrimoJefe · 2021-09-09T20:54:29Z

PrimoJefe
Sep 9, 2021

When I use extract_text() it prints out fine as you can see here :

With every other pdfs for now I did not have any problems, but in this pdf when using extract_words() on the same part it includes white spaces :

At first I thought I needed to change the tolerance but while checking the values of x0 and x1 those white spaces are not supposed to be detected, unless there is something I did not understand. Here are the values :

Anyone has any idea what is going on?
Thanks in advance

Answered by jsvine

Oct 15, 2021

Hello, and thanks for your interest in this library. Apologies for not responding sooner. Are you able to provide the PDF and specific page number? It's a bit difficult to debug without it. But a couple of observations that may or may not be helpful:

You're passing a y_tolerance in your first example, but not in your second.
Passing extra_attrs changes how pdfplumber groups characters into words. In your example, if two characters do not share the same size and fontname, they will not be grouped into the same word.

View full answer

jsvine · 2021-10-15T23:03:39Z

jsvine
Oct 15, 2021
Maintainer

Hello, and thanks for your interest in this library. Apologies for not responding sooner. Are you able to provide the PDF and specific page number? It's a bit difficult to debug without it. But a couple of observations that may or may not be helpful:

You're passing a y_tolerance in your first example, but not in your second.
Passing extra_attrs changes how pdfplumber groups characters into words. In your example, if two characters do not share the same size and fontname, they will not be grouped into the same word.

1 reply

PrimoJefe Oct 21, 2021
Author

Hi, thank you for your answer.

Indeed, the problem was that some characters inside the same words did not have the same fontname for some reason.
I did not expect that behavior, also they were pretty similar (BHKDKN+stoneSans vs BHJKJD+stoneSans).

Thank you again for your time, this library is great.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract_words() behaving poorly while extract_text() works fine with the same parameters. #503

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

extract_words() behaving poorly while extract_text() works fine with the same parameters. #503

PrimoJefe Sep 9, 2021

Replies: 1 comment · 1 reply

jsvine Oct 15, 2021 Maintainer

PrimoJefe Oct 21, 2021 Author

PrimoJefe
Sep 9, 2021

Replies: 1 comment 1 reply

jsvine
Oct 15, 2021
Maintainer

PrimoJefe Oct 21, 2021
Author