Where should ligatures be normalized? #904

petermr · 2023-06-11T10:41:10Z

petermr
Jun 11, 2023

(mainly EN)

Note

I am interested in processing large amounts (thousands of pages) of running text into HTML automatically (preserving styles, and creating empirical formatting). If anyone else is, I'm happy to share experiences.

ligatures

When text is typeset, some characters are combined into single characters (ligatures, see https://en.wikipedia.org/wiki/Ligature_(writing). For text processing (including searching) I (and probably many others) need to split the ligatures . Examples follow:

original

Note the ligatures in "scientific" and "specifics" . (Although there are no joining pixels the fi are single characters). Other ligatures in EN are ff, fl, ffl , ffi, and more rarely st, ct, ft, fj, Th

These are represented by single Unicode characters: e.g. https://www.compart.com/en/unicode/U+FB02 for fl.

(Note: The original is https://www.ipcc.ch/site/assets/uploads/2018/03/Preface-3.pdf (page 1))

pdfplumber output converted to HTML (visual)

If the text above is run into pdfplumber we get:

Note that the ligatures are preserved as single characters BUT there is an added trailing space (presumably to estimate the width of the glyph.
Q: Where does this space come from? (it's not me) Is it pdfminer.six or pdfplumber? I'd regard it as a bug because it's not an interword space. I agree that if the characters are rendered they might overlap but I expect that's up to the reader to adjust, especially if the original font isn't available.

pdfplumber source

The HTML source of the above is:

The results presented here are based on an extensive assessment of scientiﬁ c literature, including speciﬁ cs of individual studies, but also an aggregate across studies analyzed for broader conclusions. The report combines information on technology speciﬁ c studies with results of large-scale integrated models,

Python

There are routines in Python for composing and decomposing ligatures. The default should probably be NFKD for EN and Iso-Latin text.
https://stackoverflow.com/questions/9175073/convert-hexadecimal-character-ligature-to-utf-8-character

pdfminer

It appears that pdfminer (sic) does not decompose ligatures:

https://stackoverflow.com/questions/23750863/dealing-with-ligatures-using-pdfminer-in-python

"solutions" in `pdfplumber`

(note this is written from an EN-Latin point of view and may be too narrow.)

do nothing, but try to document the process
automatically apply NFKD
give the user a switch to apply NFKD

jsvine · 2023-06-11T13:16:29Z

jsvine
Jun 11, 2023
Maintainer

Howdy, and thanks for the thorough writeup! In the case of this specific PDF, the issue seems to be that the PDF itself includes that space (perhaps for its own layout engine's reason). But I can also explain more about how pdfplumber handles ligatures.

This specific PDF

It appears the space you're seeing comes from the PDF itself.

One piece of evidence: When I open the PDF in Chrome, and try searching for scientific, I get no hits. But when I search for scientifi c, Chrome highlights the correct words. (This does not seem to be the case for all PDF viewers, some of which may be more flexible with handling spaces.)

Perhaps more compelling:

pdf = pdfplumber.open("~/Downloads/Preface-3.pdf")
page = pdf.pages[0]
print("•".join(c["text"] for c in page.chars[355:380]))

... prints this: •o•f• •s•c•i•e•n•t•i•ﬁ• •c• •l•i•t•e•r•a•t•u•r•e•,• •i•n•c

In general

As of version 0.9.0 (, by default pdfplumber's text-extraction methods (.extract_text(...), .extract_words(...), etc.) expands ligatures. This can be disabled by passing the parameter expand_ligatures=False to those methods.

Regardless, the ligature-expansion affects only the output of those methods, and not the underlying information extracted by pdfplumber/pdfminer.six (as you can see from the output above, where we're accessing the underlying char objects).

You can see the specific ligatures handled here, which I based on initial research but am open to updating:

pdfplumber/pdfplumber/utils/text.py

Lines 18 to 26 in 86e935d

    
           LIGATURES = { 
        
               "ﬀ": "ff", 
        
               "ﬃ": "ffi", 
        
               "ﬄ": "ffl", 
        
               "ﬁ": "fi", 
        
               "ﬂ": "fl", 
        
               "ﬆ": "st", 
        
               "ﬅ": "st", 
        
           }

This seems a bit more targeted than full-NFKD decomposition, which I worry might have unintended effects for non-ligature unicode characters. What do you think?

0 replies

petermr · 2023-06-12T08:01:24Z

petermr
Jun 12, 2023
Author

Thanks for the rapid helpful answer.

This specific PDF
It appears the space you're seeing comes from the PDF itself.

Agreed. It's a typical bodge [1]. Assuming that the tool required ligatures I would assume it should know the width of the ligature and emit it as a single character.

One piece of evidence: [...]
Agree on the analysis.

Regardless, the ligature-expansion affects only the output of those methods, and not the underlying information extracted by pdfplumber/pdfminer.six (as you can see from the output above, where we're accessing the underlying char objects).

I have been using the char objects rather than words and text as I need all the attributes of char (all coordinates as I do sub/superscripts, line-joining, etc.). I can now, I hope, apply NFKD to each char.

You can see the specific ligatures handled here, which I based on initial research but am open to updating:
pdfplumber/pdfplumber/utils/text.py

Lines 18 to 26 in 86e935d

LIGATURES = {
"ﬀ": "ff",
"ﬃ": "ffi",
"ﬄ": "ffl",
"ﬁ": "fi",
"ﬂ": "fl",
"ﬆ": "st",
"ﬅ": "st",
}
This seems a bit more targeted than full-NFKD decomposition, which I worry might have unintended effects for non-ligature unicode characters. What do you think?

That looks OK for most EN documents. I would be tempted to adopt NFKD as they will have thought of many things that EN-language doesn't require. I work with collaborators who use Devanagari scripts and although we are using EN at present I have no idea how ligatures, and more generally diacritics, case, etc work there.

Maybe in a future release NFKD could be added as an option to char emission.

[1] One typical horror is making bold characters by shifting characters a fraction of a pixel and re-emitting. But I think PDFs are getting "cleaner".

1 reply

jsvine Jun 13, 2023
Maintainer

Maybe in a future release NFKD could be added as an option to char emission.

I like this suggestion (thanks!), and have opened a feature-request issue here: #905

(In that issue, I'm suggesting adding the parameter to the text-extraction methods rather than the raw char emission, though.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Where should ligatures be normalized? #904

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Where should ligatures be normalized? #904

petermr Jun 11, 2023

Note

ligatures

original

pdfplumber output converted to HTML (visual)

pdfplumber source

Python

pdfminer

"solutions" in pdfplumber

Replies: 2 comments · 1 reply

jsvine Jun 11, 2023 Maintainer

This specific PDF

In general

petermr Jun 12, 2023 Author

jsvine Jun 13, 2023 Maintainer

petermr
Jun 11, 2023

"solutions" in `pdfplumber`

Replies: 2 comments 1 reply

jsvine
Jun 11, 2023
Maintainer

petermr
Jun 12, 2023
Author

jsvine Jun 13, 2023
Maintainer