Help extracting tables with overlapping columns. #911

gnadlr · 2023-06-21T17:14:31Z

gnadlr
Jun 21, 2023

Hi, @jsvine, thanks for the awesome library.

I have been using pdfplumber to extract tables from pdf but one place i'm stuck is when the pdf has overlapping columns (i.e the column does not wrap text).

For example, the original pdf has 'aaaa b|bbb' in 1st column (the | is the line where the column is cut), and '1111' in second column

When using extract_tables() or extract_text(), the first column is cut at the line, the rest of the 1st column seem to overlap alternatively with the 2nd column. It becomes: 'aaaa b' and 'b1b1b11'

When using extract_words(use_text_flow=True), the text does not overlap but is joined, starting from the last space of 1st column: it becomes 'aaaa' and 'bbbb1111'.

Any idea how i can config for the correct extract?

cmdlineluser · 2023-06-21T18:07:21Z

cmdlineluser
Jun 21, 2023

Interestingly, I saw this for the first time the other day.

If you cannot share an example PDF, sample.2.pdf from #907 (reply in thread) displays this issue on page 1.

https://github.com/jsvine/pdfplumber/files/11782271/sample.2.pdf

The text in the Particulars column is truncated on rows 2 and 4, but it is there when you .extract_text() however there is overlapping/mingling.

0 replies

gnadlr · 2023-06-21T19:12:53Z

gnadlr
Jun 21, 2023
Author

I have the following idea but not sure how to put into code:

Extract using both extract_ tables and extract_words, so we have 2 wrong versions as above and will try to manipulate them to get the correct version
Parse extract_words into table (by locating box coordinates)
Determine for each row if column 1 overlap with column 2 (comparing each row of the 2 wrong versions, if they matches there is no overlap).
If there is overlap, locate the correct 1st character of the 2nd column. it will be the 2nd character of the 2nd column of extract_tables method (keeping with my example 'aaaa b' and 'b1b1b11', this will be '1'.
Locate the 1st occurence of above character in the 2nd column of extract_words methods (in my example 'aaaa' and 'bbbb1111', it will be at the 5th position of 2nd column).
Split the string at this position, and add the front part back to the 1st column of extract_words (so add 'bbbb' back to 'aaaa' -> 'aaaa bbbb' -> supposedly correct). keep the last part '1111'
Subtract the len of 1st column extract_tables (subtract 'aaaa b' from this supposedly correct 'aaaa bbbb', we get 'bbb'
Check if 'bbb' and '1111' could be mingled alternatively to form the text column 2 of extract_tables method ('bbb' and '1111' to form 'b1b1b11'?)
If does not match, return to step 5 and try with 2nd occurence, repeat with next occurence until step 8 matches

Do you think this is sound logic? Its very hacky but ive tried other extractor with not much luck.

0 replies

cmdlineluser · 2023-06-22T03:16:59Z

cmdlineluser
Jun 22, 2023

It's quite possible I'm totally wrong here but it seems like this is something that needs to be fixed/accounted for within pdfplumber.

Using the https://github.com/jsvine/pdfplumber/files/11782271/sample.2.pdf example:

import pdfplumber
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer

filename = 'Downloads/sample.2.pdf'
lines = [ 
    element for element in next(extract_pages(filename))
    if isinstance(element, LTTextContainer)
]

>>> lines[16:18]
[<LTTextBoxHorizontal(16) 214.800,470.672,992.531,479.792 'Chq. No. 085900 BILL NO.133, 132, 139, ,138, 143, 140, 142, 141, 144, 
                                                           156, 155, 154, 164, 163, 165, 166, 167, 176, 175, 179, 178, 186, 185, 187, 
                                                           192, 193 DT.16.12.2015 RICE PAYMENT\n'>,
 <LTTextBoxHorizontal(17) 598.080,470.672,648.838,479.792 '1207631 Cr.\n'>]

The coords look like: x0,y0,x1,y1

The long line has x0 - x1 range of: 214.800 - 992.531

Next line: 598.080 - 648.838

So their ranges intersect/overlap.

pdfplumber:

>>> page = pdfplumber.open(filename).pages[0]
>>> next(line.strip() for line in page.extract_text(layout=True).splitlines() if 'RICE' in line)

'13-May-16 38 BP   R   Chq. No. 085900 BILL NO.133, 132, 139, ,138, 143, 140, 142, 141,
 124644,06155766,155,154,1064,116230,7166351,C1r6.6,167,176,175,179,178,186,185,187,
 192,193DT.16.12.2015 RICEPAYMENT'

The '1207631 Cr.' appears to be mingled in with the rest of the line.

116230,7166351,C1r6.6
 ^ ^ ^ ^ ^ ^ ^ ^ ^ ^

What it looks like happens is pdfplumber sorts these by x0 when "clustering"

chars = lines[16]._objs[0]._objs + lines[17]._objs[0]._objs
chars = [ c for c in chars if hasattr(c, 'x0') ]

# before sort
>>> ''.join(c.get_text() for c in chars)
'Chq. No. 085900 BILL NO.133, 132, 139, ,138, 143, 140, 142, 141,
 144,156,155,154,164,163,165,166,167,176,175,179,178,186,185,187,
 192,193DT.16.12.2015RICEPAYMENT1207631Cr.'
#                               ^^^^^^^^^^

# after sort 
>>> ''.join(c.get_text() for c in sorted(chars, key=lambda c: c.x0))
'Chq. No. 085900 BILL NO.133, 132, 139, ,138, 143, 140, 142, 141,
 144,156,155,154,164,116230,7166351,C1r6.6,167,176,175,179,178,186,185,187,192,193DT.16.12.2015RICEPAYMENT'
#                     ^ ^ ^ ^ ^ ^ ^ ^ ^ ^

It seems like there needs to be a check if there are any intersections/overlaps for text objects on the same line, and keep them separated/shift them over.

It seems like it's possible to account for this, but I'm not 100% sure as I have little knowledge of PDF internals.

Maybe it needs to be posted on the issues tracker as a "bug". (I can't find any issues/discussions related to this topic)

Hopefully someone with more knowledge can let us know.

2 replies

jsvine Jun 28, 2023
Maintainer

Thanks, @cmdlineluser! Yep, I think you're right that there's a bug here. As mentioned in #912 (comment) just now, I think I've diagnosed it and am hoping to confirm and push a fix soon.

cmdlineluser Jun 28, 2023

Thank you for the confirmation @jsvine - very interesting about the clipping path stuff - this is not something I knew about.

cmdlineluser · 2023-06-22T03:57:13Z

cmdlineluser
Jun 22, 2023

It looks like these 3 sorted() calls are involved:

https://github.com/jsvine/pdfplumber/blob/stable/pdfplumber/utils/text.py#L219
https://github.com/jsvine/pdfplumber/blob/stable/pdfplumber/utils/text.py#L248
https://github.com/jsvine/pdfplumber/blob/stable/pdfplumber/utils/text.py#L460

If I remove them:

>>> next(line['text'] for line in page.extract_text_lines() if 'RICE' in line['text'])
'13-May-16 38 BP R Chq. No. 085900 BILL NO.133, 132, 139, ,138, 143, 140, 142, 141,
 144,156,155,154,164,163,165, 166,167,176,175,179,178,186,185,187,192,193
 DT.16.12.2015 RICEPAYMENT26406576 0 1207631 Cr.'

They obviously exist for a reason and removing them will break other things, but perhaps something can be done to deal with this particular issue.

0 replies

gnadlr · 2023-06-22T06:17:53Z

gnadlr
Jun 22, 2023
Author

I submitted an issue here as suggested @cmdlineluser

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help extracting tables with overlapping columns. #911

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Help extracting tables with overlapping columns. #911

gnadlr Jun 21, 2023

Replies: 5 comments · 2 replies

cmdlineluser Jun 21, 2023

gnadlr Jun 21, 2023 Author

cmdlineluser Jun 22, 2023

jsvine Jun 28, 2023 Maintainer

cmdlineluser Jun 28, 2023

cmdlineluser Jun 22, 2023

gnadlr Jun 22, 2023 Author

gnadlr
Jun 21, 2023

Replies: 5 comments 2 replies

cmdlineluser
Jun 21, 2023

gnadlr
Jun 21, 2023
Author

cmdlineluser
Jun 22, 2023

jsvine Jun 28, 2023
Maintainer

cmdlineluser
Jun 22, 2023

gnadlr
Jun 22, 2023
Author