Replies: 1 comment 1 reply
-
Ah! |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
First to say what a great resource
pdfplumber
is , including an active discussion list. My goal is to turn scientific publications in PDF into semantic form (SVG and domain-specific markup such as chemistry). I have been using the excellent Apache PDFBox but am now changing from Java to Python. This discussion is to find out howpdflumber
(orPDFMiner
) can extract vector graphics (e.g. using.curves
).Currently
pdfplumber
describes a curve as having an array of coordinates but the PDF actually contains Bezier curves. Here are examples of what can can extracted automatically by PDFBox (I then turn it into SVG).Here's a typical PDF (all documents are Open (CC BY)): https://github.com/petermr/openDiagram/blob/master/physchem/liion/PMC7070767/fulltext.pdf
and here is a page translated into SVG https://github.com/petermr/openDiagram/blob/master/physchem/liion/PMC7070767/svg/fulltext-page.4.svg (neglect the textual overwriting artefact).
All vector graphics in the PDF (including style) are faithfully converted to SVG - e.g.: https://github.com/petermr/openDiagram/blob/master/physchem/liion/PMC7070767/svg/fulltext-page.4.svg
with a typical graphic for a triangle as
Curves are represented by Beziers, this is a circle (4 cubic arcs of 90 deg.)
I have heuristics that convert this to a circle.
How can I extract the same graphics using
pdfplumber
?Beta Was this translation helpful? Give feedback.
All reactions