Extracting vector graphics (SVG) from PDFs #667

petermr · 2022-06-10T09:29:46Z

petermr
Jun 10, 2022

First to say what a great resource pdfplumber is , including an active discussion list. My goal is to turn scientific publications in PDF into semantic form (SVG and domain-specific markup such as chemistry). I have been using the excellent Apache PDFBox but am now changing from Java to Python. This discussion is to find out how pdflumber (or PDFMiner) can extract vector graphics (e.g. using .curves).
Currently pdfplumber describes a curve as having an array of coordinates but the PDF actually contains Bezier curves. Here are examples of what can can extracted automatically by PDFBox (I then turn it into SVG).
Here's a typical PDF (all documents are Open (CC BY)): https://github.com/petermr/openDiagram/blob/master/physchem/liion/PMC7070767/fulltext.pdf
and here is a page translated into SVG https://github.com/petermr/openDiagram/blob/master/physchem/liion/PMC7070767/svg/fulltext-page.4.svg (neglect the textual overwriting artefact).
All vector graphics in the PDF (including style) are faithfully converted to SVG - e.g.: https://github.com/petermr/openDiagram/blob/master/physchem/liion/PMC7070767/svg/fulltext-page.4.svg
with a typical graphic for a triangle as

  <path style="fill:rgb(255,0,0);stroke:none;stroke-width:0.25;" 
           d="M281.090 488.140 L282.630 490.880 L279.500 490.880 Z"/>

Curves are represented by Beziers, this is a circle (4 cubic arcs of 90 deg.)

  <path style="fill:rgb(179,201,62);stroke:none;stroke-width:1.0;" 
              d="M455.990 171.910 C455.990 174.410 453.960 176.450 451.460 176.450 C448.950 176.450 446.920 174.410 
                  446.920 171.910 C446.920 169.400 448.950 167.370 451.460 167.370 C453.960 167.370 455.990 169.400 
                  455.990 171.910 "/>

I have heuristics that convert this to a circle.

How can I extract the same graphics using pdfplumber?

petermr · 2022-06-10T20:16:06Z

petermr
Jun 10, 2022
Author

Ah!
I have just discovered #345 (comment) which suggests that (I think) pdfminer.six has not extracted the control points. I wonder if anything has changed since then/

1 reply

jsvine Jun 15, 2022
Maintainer

Thanks for the kind words, @petermr! I'm not aware of any relevant changes since then, unfortunately, but it might be worth tracking/watching pdfminer/pdfminer.six#672

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting vector graphics (SVG) from PDFs #667

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Extracting vector graphics (SVG) from PDFs #667

petermr Jun 10, 2022

Replies: 1 comment · 1 reply

petermr Jun 10, 2022 Author

jsvine Jun 15, 2022 Maintainer

petermr
Jun 10, 2022

Replies: 1 comment 1 reply

petermr
Jun 10, 2022
Author

jsvine Jun 15, 2022
Maintainer