possible to ID locations of words, not just chars? #916

wdchild · 2023-06-28T21:04:18Z

wdchild
Jun 28, 2023

I've spent some time exploring pdfplumber, including taking a look at the page.chars and page.rects objects and the page.extract_text method. Chars come with their positions well documented in the returned data. But is there a sensible way to identify the positions / boundary boxes of individual words in the text? For scanned images, engines like tesseract allow you to bound individual words (or what it perceives as words). I'm seeing no easy way to do that with plumber. What am I missing? Thanks!

jsvine · 2023-06-28T21:18:24Z

jsvine
Jun 28, 2023
Maintainer

Hi @wdchild, I think you're looking for the page.extract_words(...) method. You can learn more about what the method does and the parameters it accepts in the "Extracting text" section of this project's README.md file. Does that help?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

possible to ID locations of words, not just chars? #916

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

possible to ID locations of words, not just chars? #916

wdchild Jun 28, 2023

Replies: 1 comment

jsvine Jun 28, 2023 Maintainer

wdchild
Jun 28, 2023

jsvine
Jun 28, 2023
Maintainer