Help extracting tables with overlapping columns. #911
Replies: 5 comments 2 replies
-
Interestingly, I saw this for the first time the other day. If you cannot share an example PDF, sample.2.pdf from #907 (reply in thread) displays this issue on page 1. https://github.com/jsvine/pdfplumber/files/11782271/sample.2.pdf The text in the |
Beta Was this translation helpful? Give feedback.
-
I have the following idea but not sure how to put into code:
Do you think this is sound logic? Its very hacky but ive tried other extractor with not much luck. |
Beta Was this translation helpful? Give feedback.
-
It's quite possible I'm totally wrong here but it seems like this is something that needs to be fixed/accounted for within pdfplumber. Using the https://github.com/jsvine/pdfplumber/files/11782271/sample.2.pdf example: import pdfplumber
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
filename = 'Downloads/sample.2.pdf'
lines = [
element for element in next(extract_pages(filename))
if isinstance(element, LTTextContainer)
]
>>> lines[16:18]
[<LTTextBoxHorizontal(16) 214.800,470.672,992.531,479.792 'Chq. No. 085900 BILL NO.133, 132, 139, ,138, 143, 140, 142, 141, 144,
156, 155, 154, 164, 163, 165, 166, 167, 176, 175, 179, 178, 186, 185, 187,
192, 193 DT.16.12.2015 RICE PAYMENT\n'>,
<LTTextBoxHorizontal(17) 598.080,470.672,648.838,479.792 '1207631 Cr.\n'>] The coords look like: The long line has Next line: 598.080 - 648.838 So their ranges intersect/overlap. pdfplumber: >>> page = pdfplumber.open(filename).pages[0]
>>> next(line.strip() for line in page.extract_text(layout=True).splitlines() if 'RICE' in line)
'13-May-16 38 BP R Chq. No. 085900 BILL NO.133, 132, 139, ,138, 143, 140, 142, 141,
124644,06155766,155,154,1064,116230,7166351,C1r6.6,167,176,175,179,178,186,185,187,
192,193DT.16.12.2015 RICEPAYMENT' The
What it looks like happens is pdfplumber sorts these by chars = lines[16]._objs[0]._objs + lines[17]._objs[0]._objs
chars = [ c for c in chars if hasattr(c, 'x0') ]
# before sort
>>> ''.join(c.get_text() for c in chars)
'Chq. No. 085900 BILL NO.133, 132, 139, ,138, 143, 140, 142, 141,
144,156,155,154,164,163,165,166,167,176,175,179,178,186,185,187,
192,193DT.16.12.2015RICEPAYMENT1207631Cr.'
# ^^^^^^^^^^
# after sort
>>> ''.join(c.get_text() for c in sorted(chars, key=lambda c: c.x0))
'Chq. No. 085900 BILL NO.133, 132, 139, ,138, 143, 140, 142, 141,
144,156,155,154,164,116230,7166351,C1r6.6,167,176,175,179,178,186,185,187,192,193DT.16.12.2015RICEPAYMENT'
# ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ It seems like there needs to be a check if there are any intersections/overlaps for text objects on the same line, and keep them separated/shift them over. It seems like it's possible to account for this, but I'm not 100% sure as I have little knowledge of PDF internals. Maybe it needs to be posted on the issues tracker as a "bug". (I can't find any issues/discussions related to this topic) Hopefully someone with more knowledge can let us know. |
Beta Was this translation helpful? Give feedback.
-
It looks like these 3 https://github.com/jsvine/pdfplumber/blob/stable/pdfplumber/utils/text.py#L219 If I remove them:
They obviously exist for a reason and removing them will break other things, but perhaps something can be done to deal with this particular issue. |
Beta Was this translation helpful? Give feedback.
-
I submitted an issue here as suggested @cmdlineluser |
Beta Was this translation helpful? Give feedback.
-
Hi, @jsvine, thanks for the awesome library.
I have been using pdfplumber to extract tables from pdf but one place i'm stuck is when the pdf has overlapping columns (i.e the column does not wrap text).
For example, the original pdf has 'aaaa b|bbb' in 1st column (the | is the line where the column is cut), and '1111' in second column
When using extract_tables() or extract_text(), the first column is cut at the line, the rest of the 1st column seem to overlap alternatively with the 2nd column. It becomes: 'aaaa b' and 'b1b1b11'
When using extract_words(use_text_flow=True), the text does not overlap but is joined, starting from the last space of 1st column: it becomes 'aaaa' and 'bbbb1111'.
Any idea how i can config for the correct extract?
Beta Was this translation helpful? Give feedback.
All reactions