Extracting footnotes and their targets #870

petermr · 2023-04-20T09:08:20Z

petermr
Apr 20, 2023

(see https://report.ipcc.ch/ar6syr/pdf/IPCC_AR6_SYR_SPM.pdf page 3 for example; the latest IPCC report (last month) on climate change).

Has anyone developed pdfplumber code to extract footnotes (references and their targets)?. (see page above for an example of 5 footnotes It's a bit similar to table extraction and the actual footnotes (superscript number + text) are formatted as a mini-table.

Footnote refs:

Footnotes

(note numbers are left justified). We have extracted these based on font characteristics and position but any addiitonal logic would be valuable.

petermr · 2023-05-26T16:56:26Z

petermr
May 26, 2023
Author

Interim comment

I have a hack which looks for smaller font text at the bottom of the page, prefixed by a number. This works well. I have extracted 100 footnotes from 1 chapter of an 80-page PDF (SYR/LongerReport above). I put them in an endnote container rather than keep them on the "page" as I am merging pages into HTML. There we 0 false negatives and about 3-4 false positives (which I will write a heuristic for). It remains to turn the superscripts in main text into <a href...>s to make the notes interactive.

2 replies

williamhakim10 Jul 9, 2024

@petermr any chance you'd be willing to share the code you use/used for this?

petermr Jul 25, 2024
Author

Hi,
only just read your post.
The code is in https://github.com/petermr/amilib/ (library) and https://github.com/petermr/amiclimate (which uses amilib). It's very much alpha so you may need to search for "footnote" and "endnotes".

I use pdfplumber to create HTML , retaining everything that pdfplumber extracts (styles, fornts, weights, coords, etc.) The logic of the footnotes is done in HTML as we have an object model, not just an event stream.

try https://github.com/petermr/pyamihtml/blob/main/pyamihtmlx/ami_html.py. This assumes we have used pdfplumber (in ami_pdf.py) to create raw HTML from which we extract footnotes

line 4497 approx
Typical code

class Footnote:
    """extracts footnotes by size and style and perhaps positiom
    """

    @classmethod
    def extract_footnotes(cls, fn_xpath, footnote_text_classes, html_elem):
        new_html_elem = HtmlLib.create_new_html_with_old_styles(html_elem)
        footnote_number_font_elems = html_elem.xpath(fn_xpath)
        last_footnum = 0
        ul = lxml.etree.Element("ul")
        HtmlLib.get_body(new_html_elem).append(ul)
        # messy, need a Footnote object here
        for footnote_number_elem in footnote_number_font_elems:
            last_footnum, li = Footnote.extract_footnote_and_save(
                footnote_number_elem, footnote_text_classes, last_footnum)
            if li is not None:
                ul.append(li)
        return new_html_elem
... and more ...

If you have a significant amount of work to do I would be happy to discuss how amilib can help. Do you have an example of your problem? I'd be happy to work together using test-driven-development.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting footnotes and their targets #870

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Extracting footnotes and their targets #870

petermr Apr 20, 2023

Footnote refs:

Footnotes

Replies: 1 comment · 2 replies

petermr May 26, 2023 Author

Interim comment

williamhakim10 Jul 9, 2024

petermr Jul 25, 2024 Author

petermr
Apr 20, 2023

Replies: 1 comment 2 replies

petermr
May 26, 2023
Author

petermr Jul 25, 2024
Author