Content spanning two or more pages (multi page spread pdf) #433

Lauler · 2021-05-23T09:37:49Z

Lauler
May 23, 2021

In certain publications, it is common to organize content in "spreads" as opposed to in pages.

For example, in pdfs generated by newspapers, the front page is usually a single page, but then

pages 2-3 are "spread nr 1"
pages 4-5 are "spread nr 2"
etc...

In practice this means that images and sometimes even text boxes can extend from one pdf page to the next.

I have attempted to extract image coordinates from a couple of newspaper pdfs in order to efficiently generate training data for an object detection model to perform segmentation and detect images in older historical newspapers. However, I've found that image coordinates that are extracted sometimes either

exceed the maximum coordinate bounds of the pdf page.
appear as negative coordinates

After some investigation I concluded that this was because the images extend over two pages. Whenever an image extends from page2 over to page3 for example, the rightward x-coodinate that is extracted will be "out of bounds" and exceed the max x dimension of page2. Likewise, when I extract coordinates for same image in page3, the leftmost x coodinate will have negative coordinates.

I was wondering whether pdfplumber or pdfminer has any functionality to treat certain pages as spreads as opposed to only using "pages" as the unit of analysis?

In my use case, I will probably concatenate together images of page spreads after the fact and write code to "correct" the coordinates to extend over the spread. But it would be nice to know if pdfplumber or pdfminer has any built-in functionality to handle non-standard layouts that extend over multiple pdf pages.

Answered by jsvine

May 26, 2021

Hi @Lauler, and sounds like an interesting conundrum. Can you share any example PDFs and code that demonstrate the situation?

View full answer

jsvine · 2021-05-26T03:09:22Z

jsvine
May 26, 2021
Maintainer

Hi @Lauler, and sounds like an interesting conundrum. Can you share any example PDFs and code that demonstrate the situation?

2 replies

Lauler May 26, 2021
Author

Here are two example pdfs (I will remove link after a couple of weeks): https://drive.google.com/drive/folders/1Sp-A9Q8_u2-I0A5av9usM8yYrXnXApwv?usp=sharing

import pdfplumber
pdf = pdfplumber.open("SVD_V_BILAGA_2021-05-11.pdf", laparams = {"line_overlap": 0.7})
page = pdf.pages[9]
im = page.to_image(resolution=72)

The advertisement image on the bottom spans over 2 pages. If we print that image box page.images[1] we get following output:

{'x0': Decimal('13.314'),
 'y0': Decimal('86.100'),
 'x1': Decimal('1499.565'),
 'y1': Decimal('376.141'),
 'width': Decimal('1486.250'),
 'height': Decimal('290.042'),
 'name': 'Im1',
 'stream': <PDFStream(282): raw=1229107, {'BitsPerComponent': 8, 'ColorSpace': <PDFObjRef:265>, 'Decode': [0, 1, 0, 1, 0, 1, 0, 1], 'Filter': /'DCTDecode', 'Height': 967, 'Intent': /'RelativeColorimetric', 'Length': 1229105, 'Metadata': <PDFObjRef:671>, 'Name': /'X', 'Subtype': /'Image', 'Type': /'XObject', 'Width': 4955}>,
 'srcsize': (Decimal('4955'), Decimal('967')),
 'imagemask': None,
 'bits': 8,
 'colorspace': [[/'ICCBased',
   <PDFStream(647): raw=592174, {'Filter': /'FlateDecode', 'Length': 592172, 'N': 4}>]],
 'object_type': 'image',
 'page_number': 10,
 'top': Decimal('754.879'),
 'bottom': Decimal('1044.920'),
 'doctop': Decimal('10934.059')}

Checking the page dimensions we can see that x1 coordinate (1499.565) exceeds page width.

f"height: {page.height}, width: {page.width}" 

# output
'height: 1131.020, width: 762.520'

Similarly, if we check the same image on the next page we instead get negative coordinates for the leftmost edges of the image:

page = pdf.pages[10]
page.images[0]

# output (truncated)
{'x0': Decimal('-749.608'),
 'y0': Decimal('86.100'),
 'x1': Decimal('736.642'),
 'y1': Decimal('376.141'),
 'width': Decimal('1486.250'),
 'height': Decimal('290.042'),
 'name': 'Im0',
 ...
}

There is a different issue thread which also observed negative coordinates and out of bound rects, where the author didn't really comment on why certain images may be out of bounds (sometimes for legitimate reasons). Here is the issue: #267

Unfortunately, newspapers do not seem to be consistent regarding this convention. The above pdf example was an advertisement supplement of the newspaper. Probably it was generated by an organization other than the newspaper itself, using a different software for layout and typesetting.

The other pdf that is shared in the folder is actual content from the newspaper itself. However, whenever an image extends over a spread there, its bounding box coordinates "cut it up" in two pieces. There is an image extending over a spread on pages 8 and 9 for example. While the image box rects are slightly out of bounds of the page, they do not actually extend the full length of the image as in the previous example.

One further comment: it would be nice if you (or pdfminer) had a convenience argument that limits/truncates out of bound coordinates. E.g. if x1 exceeds page.width then set x1 = page.width. Similarly whenever x0 and y0 take on negative coordinate values, give an option to the user that the reader automatically sets them to 0 when parsing coordinates.

jsvine Jul 10, 2021
Maintainer

Thank you for sharing this example and for your detailed explanation, @Lauler. It does help clarify things. A couple of thoughts below:

I was wondering whether pdfplumber or pdfminer has any functionality to treat certain pages as spreads as opposed to only using "pages" as the unit of analysis?

As far as I can tell, this functionality does not exist in either library. There’s no real concept of “spreads” in the PDF spec, as far as I’m aware. Creating a concept of them in pdfplumber would be possible, in theory, but feels outside the scope of the library/repository. That said, I believe the fundamental pdfplumber classes and methods could be helpful in writing external code that achieves what you desire.

One further comment: it would be nice if you (or pdfminer) had a convenience argument that limits/truncates out of bound coordinates. E.g. if x1 exceeds page.width then set x1 = page.width. Similarly whenever x0 and y0 take on negative coordinate values, give an option to the user that the reader automatically sets them to 0 when parsing coordinates.

On this and related issues above: I believe cropped = page.crop(page.bbox) would achieve this. Does it work for you / do what you’re hoping?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Content spanning two or more pages (multi page spread pdf) #433

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Content spanning two or more pages (multi page spread pdf) #433

Lauler May 23, 2021

Replies: 1 comment · 2 replies

jsvine May 26, 2021 Maintainer

Lauler May 26, 2021 Author

jsvine Jul 10, 2021 Maintainer

Lauler
May 23, 2021

Replies: 1 comment 2 replies

jsvine
May 26, 2021
Maintainer

Lauler May 26, 2021
Author

jsvine Jul 10, 2021
Maintainer