-
In certain publications, it is common to organize content in "spreads" as opposed to in pages. For example, in pdfs generated by newspapers, the front page is usually a single page, but then pages 2-3 are "spread nr 1" In practice this means that images and sometimes even text boxes can extend from one pdf page to the next. I have attempted to extract image coordinates from a couple of newspaper pdfs in order to efficiently generate training data for an object detection model to perform segmentation and detect images in older historical newspapers. However, I've found that image coordinates that are extracted sometimes either
After some investigation I concluded that this was because the images extend over two pages. Whenever an image extends from page2 over to page3 for example, the rightward x-coodinate that is extracted will be "out of bounds" and exceed the max x dimension of page2. Likewise, when I extract coordinates for same image in page3, the leftmost x coodinate will have negative coordinates. I was wondering whether pdfplumber or pdfminer has any functionality to treat certain pages as spreads as opposed to only using "pages" as the unit of analysis? In my use case, I will probably concatenate together images of page spreads after the fact and write code to "correct" the coordinates to extend over the spread. But it would be nice to know if pdfplumber or pdfminer has any built-in functionality to handle non-standard layouts that extend over multiple pdf pages. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Hi @Lauler, and sounds like an interesting conundrum. Can you share any example PDFs and code that demonstrate the situation? |
Beta Was this translation helpful? Give feedback.
Hi @Lauler, and sounds like an interesting conundrum. Can you share any example PDFs and code that demonstrate the situation?