Extracting footnotes and their targets #870
petermr
started this conversation in
Ask for help with specific PDFs
Replies: 1 comment 2 replies
-
Interim commentI have a hack which looks for smaller font text at the bottom of the page, prefixed by a number. This works well. I have extracted 100 footnotes from 1 chapter of an 80-page PDF (SYR/LongerReport above). I put them in an endnote container rather than keep them on the "page" as I am merging pages into HTML. There we 0 false negatives and about 3-4 false positives (which I will write a heuristic for). It remains to turn the superscripts in main text into |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
(see https://report.ipcc.ch/ar6syr/pdf/IPCC_AR6_SYR_SPM.pdf page 3 for example; the latest IPCC report (last month) on climate change).
Has anyone developed
pdfplumber
code to extract footnotes (references and their targets)?. (see page above for an example of 5 footnotes It's a bit similar to table extraction and the actual footnotes (superscript number + text) are formatted as a mini-table.Footnote refs:
Footnotes
(note numbers are left justified). We have extracted these based on font characteristics and position but any addiitonal logic would be valuable.
Beta Was this translation helpful? Give feedback.
All reactions