Avoid newline in long sentences #350

LivingDeadCloud · 2021-02-11T13:29:15Z

LivingDeadCloud
Feb 11, 2021

Hello everybody

I'm kind of a newbie about PDFPlumber, but I find it very easy and powerful, so I would like to use to scan a bunch of documents.

My problem is that if a sentence is long and split on many lines PDFlumber introduces new line characters when the line increases because of the sentence "splitting".

For example, look at the followind PDF:

Every time the sentence is split because it reaches the end of the page, PDFPlumber adds an extra newline (the red \n in the image). I would like to avoid that, and I would also like to preserve the real newline (the green \n in the image).

As a workaround, I tried to use the extract_text() method, specifying a y_tolerance bigger than the difference between doctop of a word of the second line (e.g. "Tolkien") and a word of the first line (e.g. "R."):


import pdfplumber

filename = 'LOTR_intro.pdf'

text = ''
with pdfplumber.open(filename) as pdf:
    for page in pdf.pages:
        text = text+str(page.extract_text()).lower()

print(single_page.extract_text(x_tolerance=3, y_tolerance=20))

Note that 'doctop' of words of the first line is 59.217, while for second line is 77.617, so their difference is about 18.4. For this reason, I chose as y_tolerance a value of 20.

This issues are:

The text I get is completely messed up and with (apparently) characters that have nothing to do with the original text. Here's what I get:

TTTlWdaanuoohrrdrglleikki ten 1tiiLreeeg9 nnnos 5Wt,'r 5so idw n.roe y orah.slrfidtlc ai thgeWh reew asf ar aRb nsIie tInlat agwisttsye e w reib snafo is oat1 tnkoe9 dr3Tei 7gphai isenac aanH hl dltoiyr gbi1 lhbp9o iu4gtf,b9ay lan.,i ntsTwahdhsie ytesdho s noimtnonov u rtdyeche lhrbv e weeeoglr fovia tpinotteel uadnbsm e i binaenyt sgos eJ iaw.nq u mRr1ei.t9ult 5ecRtnho4.

Even if I would be able to avoid the "extra" newlines with this approach, how do I preserve the real one (i.e. the one between "... story." and "Written ...")?

In the meanwhile, I will try to look for another workaround.

Thanks a lot!

mkl-public · 2021-02-11T14:18:47Z

mkl-public
Feb 11, 2021

Just a remark: You talk about "extra" vs. "real" newlines. Please be aware that all newline characters in extracted text are non-real: There (usually) is no marker or anything in a PDF indicating where a paragraph ends. Thus, the only thing text extractors can do for generic documents is determining text lines and add newlines there (and even that sometimes is difficult, in particular for tightly set text with superscripts or subscripts, or for text lines which are not parallel to a border or probably not even straight lines to start with).

0 replies

jsvine · 2021-02-12T02:19:05Z

jsvine
Feb 12, 2021
Maintainer

Hi @LivingDeadCloud, and thanks for your interest in this library. Echoing @mkl-public's answer, there isn't really a native concept of newlines in PDFs ("real" or otherwise). For other PDFs, you might be able group the text into paragraphs if there were extra spacing between paragraphs than between lines in the paragraph. Unfortunately, it seems that doesn't seem to be the case for this PDF, which appears to have no extra spacing between paragraphs.

1 reply

LivingDeadCloud Feb 12, 2021
Author

Thanks both @jsvine and @mkl-public for your answers.

The PDF i posted is a just an example, the real ones I will scan are much more complex. Anyway, I think I will craft some function using the doctop property and try to split my document if paragraphs like @jsvine suggests. Not a very solid approach, since unfortunately my PDFs will have different formats, but it's probably better than nothing!

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid newline in long sentences #350

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Avoid newline in long sentences #350

LivingDeadCloud Feb 11, 2021

Replies: 2 comments · 1 reply

mkl-public Feb 11, 2021

jsvine Feb 12, 2021 Maintainer

LivingDeadCloud Feb 12, 2021 Author

LivingDeadCloud
Feb 11, 2021

Replies: 2 comments 1 reply

mkl-public
Feb 11, 2021

jsvine
Feb 12, 2021
Maintainer

LivingDeadCloud Feb 12, 2021
Author