Avoid newline in long sentences #350
Replies: 2 comments 1 reply
-
Just a remark: You talk about "extra" vs. "real" newlines. Please be aware that all newline characters in extracted text are non-real: There (usually) is no marker or anything in a PDF indicating where a paragraph ends. Thus, the only thing text extractors can do for generic documents is determining text lines and add newlines there (and even that sometimes is difficult, in particular for tightly set text with superscripts or subscripts, or for text lines which are not parallel to a border or probably not even straight lines to start with). |
Beta Was this translation helpful? Give feedback.
-
Hi @LivingDeadCloud, and thanks for your interest in this library. Echoing @mkl-public's answer, there isn't really a native concept of newlines in PDFs ("real" or otherwise). For other PDFs, you might be able group the text into paragraphs if there were extra spacing between paragraphs than between lines in the paragraph. Unfortunately, it seems that doesn't seem to be the case for this PDF, which appears to have no extra spacing between paragraphs. |
Beta Was this translation helpful? Give feedback.
-
Hello everybody
I'm kind of a newbie about PDFPlumber, but I find it very easy and powerful, so I would like to use to scan a bunch of documents.
My problem is that if a sentence is long and split on many lines PDFlumber introduces new line characters when the line increases because of the sentence "splitting".
For example, look at the followind PDF:
Every time the sentence is split because it reaches the end of the page, PDFPlumber adds an extra newline (the red
\n
in the image). I would like to avoid that, and I would also like to preserve the real newline (the green\n
in the image).As a workaround, I tried to use the
extract_text()
method, specifying ay_tolerance
bigger than the difference betweendoctop
of a word of the second line (e.g. "Tolkien") and a word of the first line (e.g. "R."):Note that 'doctop' of words of the first line is
59.217
, while for second line is77.617
, so their difference is about18.4
. For this reason, I chose asy_tolerance
a value of20
.This issues are:
In the meanwhile, I will try to look for another workaround.
Thanks a lot!
Beta Was this translation helpful? Give feedback.
All reactions