How does "extract words" work? #412

mangu75 · 2021-04-12T15:23:38Z

mangu75
Apr 12, 2021

Hi,
i´m using extract word in order to get all the words from a pdf but i´m not sure how exactly it works.

This is the way i´m using it:
pdfplumber.utils.extract_words(pdf_content.chars, x_tolerance=1, y_tolerance=1, keep_blank_chars=True)

I understand that it saves in a list every group of words separated just by one space, however sometimes it appends words that have more than one space away such as this one:

what is the explanation?

Thank you!!

jsvine · 2021-10-16T17:19:02Z

jsvine
Oct 16, 2021
Maintainer

Hello, and apologies for the late response — I lost track of this inquiry. It's a bit difficult to say with certainty without having access to the PDF, but I believe the reason you are seeing words with multiple spaces is that you have set keep_blank_chars=True. There are two types of things that look like whitespace in PDFs: (1) literal blank characters, and (2) the absence of characters.

By default, extract_words(...) uses both (1) and (2) to distinguish between words. But keep_blank_chars=True tells the method to only use (2).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does "extract words" work? #412

{{title}}

Replies: 1 comment

{{title}}

Select a reply

How does "extract words" work? #412

mangu75 Apr 12, 2021

Replies: 1 comment

jsvine Oct 16, 2021 Maintainer

mangu75
Apr 12, 2021

jsvine
Oct 16, 2021
Maintainer