Replies: 2 comments 1 reply
-
Howdy, and thanks for the thorough writeup! In the case of this specific PDF, the issue seems to be that the PDF itself includes that space (perhaps for its own layout engine's reason). But I can also explain more about how This specific PDFIt appears the space you're seeing comes from the PDF itself. One piece of evidence: When I open the PDF in Chrome, and try searching for Perhaps more compelling: pdf = pdfplumber.open("~/Downloads/Preface-3.pdf")
page = pdf.pages[0]
print("•".join(c["text"] for c in page.chars[355:380])) ... prints this: In generalAs of version Regardless, the ligature-expansion affects only the output of those methods, and not the underlying information extracted by You can see the specific ligatures handled here, which I based on initial research but am open to updating: pdfplumber/pdfplumber/utils/text.py Lines 18 to 26 in 86e935d This seems a bit more targeted than full-NFKD decomposition, which I worry might have unintended effects for non-ligature unicode characters. What do you think? |
Beta Was this translation helpful? Give feedback.
-
Thanks for the rapid helpful answer.
I have been using the You can see the specific ligatures handled here, which I based on initial research but am open to updating: Lines 18 to 26 in 86e935d LIGATURES = { That looks OK for most EN documents. I would be tempted to adopt NFKD as they will have thought of many things that EN-language doesn't require. I work with collaborators who use Devanagari scripts and although we are using EN at present I have no idea how ligatures, and more generally diacritics, case, etc work there. Maybe in a future release NFKD could be added as an option to [1] One typical horror is making bold characters by shifting characters a fraction of a pixel and re-emitting. But I think PDFs are getting "cleaner". |
Beta Was this translation helpful? Give feedback.
-
(mainly EN)
Note
I am interested in processing large amounts (thousands of pages) of running text into HTML automatically (preserving styles, and creating empirical formatting). If anyone else is, I'm happy to share experiences.
ligatures
When text is typeset, some characters are combined into single characters (ligatures, see https://en.wikipedia.org/wiki/Ligature_(writing). For text processing (including searching) I (and probably many others) need to split the ligatures . Examples follow:
original
Note the ligatures in "scientific" and "specifics" . (Although there are no joining pixels the
fi
are single characters). Other ligatures in EN areff
,fl
,ffl
,ffi
, and more rarelyst
,ct
,ft
,fj
,Th
These are represented by single Unicode characters: e.g. https://www.compart.com/en/unicode/U+FB02 for
fl
.(Note: The original is https://www.ipcc.ch/site/assets/uploads/2018/03/Preface-3.pdf (page 1))
pdfplumber output converted to HTML (visual)
If the text above is run into
pdfplumber
we get:Note that the ligatures are preserved as single characters BUT there is an added trailing space (presumably to estimate the width of the glyph.
Q: Where does this space come from? (it's not me) Is it
pdfminer.six
orpdfplumber
? I'd regard it as a bug because it's not an interword space. I agree that if the characters are rendered they might overlap but I expect that's up to the reader to adjust, especially if the original font isn't available.pdfplumber source
The HTML source of the above is:
Python
There are routines in Python for composing and decomposing ligatures. The default should probably be NFKD for EN and Iso-Latin text.
https://stackoverflow.com/questions/9175073/convert-hexadecimal-character-ligature-to-utf-8-character
pdfminer
It appears that
pdfminer
(sic) does not decompose ligatures:https://stackoverflow.com/questions/23750863/dealing-with-ligatures-using-pdfminer-in-python
"solutions" in
pdfplumber
(note this is written from an EN-Latin point of view and may be too narrow.)
Beta Was this translation helpful? Give feedback.
All reactions