Unable to detect borderless table #931

bdthanh · 2023-07-09T17:02:47Z

bdthanh
Jul 9, 2023

Hi,
Thanks for the amazing tool, it helps me a lot with our project! However, I currently encounter a problem that I cannot detect borderless table, in the following pdf page 4 and 5
tables_test_cases.pdf
Here are the code that I am using:

import pdfplumber

# This function removes the invisible lines
def reject_2d_rects(obj):
    is_rect = obj["object_type"] == "rect"
    is_thin = obj["width"] < 1 or obj["height"] < 1
    return not (is_rect and not is_thin) 

pdf = pdfplumber.open("tables_test_cases.pdf")
page0 = pdf.pages[3]
page0 = page0.filter(reject_2d_rects)

ts = {"vertical_strategy": "lines", "horizontal_strategy": "lines"}

# This code saves the debug visual output.
im = page0.to_image(resolution=200)
im.reset().debug_tablefinder(ts)
im.save("image.png", format="PNG")

# Extract the tables.
tables = page0.extract_tables(table_settings=ts)
for table in tables:
    print()
    for row in table:
        print(row)

If I remove the code to remove invisible lines then this is what I got:

If I remove the table_settings, then I also cannot detect
How can I detect borderless table and remove invisible lines at the same time given that our documents might have both types of table?

I also tried this code to ignore invisible lines, but it didn't work either:

def keep_visible_lines(obj):
    if obj['object_type'] == 'rect':
        return obj['non_stroking_color'] is None
    return True

Answered by cmdlineluser

Jul 18, 2023

Perhaps there is a better approach, but they are similar to #934

keep_largest=True would be needed in this case and using all sides as explicit lines:

for page in pdf.pages:
    filtered_page = remove_nested_rects(page, keep_largest=True)

    for table in filtered_page.find_tables():
        table = filtered_page.crop(table.bbox).extract_table(dict(
            explicit_horizontal_lines = [table.bbox[1], table.bbox[3]],
            explicit_vertical_lines = [table.bbox[0], table.bbox[2]]
        ))
        print("-" * 42)
        for row in table:
            print(row)

------------------------------------------
['No', 'Description 1', 'Description 2', 'Description 3']
['1', 'Scenario 1…

View full answer

cmdlineluser · 2023-07-09T19:25:41Z

cmdlineluser
Jul 9, 2023

I'm not sure they are invisible lines.

It looks like each rect is wrapped inside a slightly wider rect:

1 reply

bdthanh Jul 10, 2023
Author

hmm that's weird!

samkit-jain · 2023-07-09T20:38:47Z

samkit-jain
Jul 9, 2023
Collaborator

Hi @bdthanh Appreciate your kind words and glad the library is helping you solve problems. Have you tried using the snap_tolerance property? Using it like

import pdfplumber

pdf = pdfplumber.open("tables_test_cases.pdf")
page0 = pdf.pages[3]

ts = {"vertical_strategy": "lines", "horizontal_strategy": "lines", "snap_tolerance": 10}

# This code saves the debug visual output.
im = page0.to_image(resolution=200)
im.reset().debug_tablefinder(ts)
im.save("image.png", format="PNG")

# Extract the tables.
tables = page0.extract_tables(table_settings=ts)
for table in tables:
    print()
    for row in table:
        print(row)

gives the output as


['Maecenas congue nibh et nisl egestas, et pulvinar orci rhoncus. Ut tellus mauris, ultricies eget']
['gravida ac, tristique vel velit. Sed malesuada dolor nisi, nec ornare turpis scelerisque in.']
['Praesent consectetur fringilla nisl, ac interdum orci auctor non. Nullam at arcu fermentum,']
['elementum risus quis, scelerisque est. Morbi ante enim, dapibus ac justo sed, pulvinar auctor']
['urna. Mauris sit amet lorem ante. Cras aliquam facilisis velit eu eleifend. Morbi sit amet']
['accumsan leo.']
['Curabitur sodales urna et tortor posuere, vel condimentum nulla luctus. Morbi sit amet enim']
['eget diam ornare iaculis et quis ante. Pellentesque lobortis mattis sapien, quis convallis ligula']
['scelerisque efficitur. Fusce vitae elementum sapien. Interdum et malesuada fames ac ante']
['ipsum primis in faucibus. In malesuada sapien quis accumsan rutrum. Praesent imperdiet']
['molestie lectus, nec blandit nisl lobortis eget. Praesent egestas fermentum odio, sed tristique']
['tortor. Phasellus sit amet imperdiet diam. Duis blandit pellentesque volutpat. Interdum et']
['malesuada fames ac ante ipsum primis in faucibus.']

['Quisque nec congue est. Nulla lobortis augue vitae eleifend congue. Sed leo quam, facilisis']
['vitae maximus sed, pharetra ut est. Phasellus libero metus, efficitur vel tincidunt nec, iaculis']
['at odio. Etiam ut diam neque. Ut sed risus eget ante pharetra imperdiet tempor eu mi.']
['Quisque mollis mi sit amet nulla finibus consectetur. Curabitur rutrum mauris at posuere']
['rutrum. Praesent tempor ut lacus non sollicitudin. Suspendisse pellentesque lectus eget']
['mauris varius, ac porttitor eros scelerisque. Donec et pharetra augue. Pellentesque']
['vestibulum sapien dolor, non rutrum justo congue at.']
['Ut sit amet rhoncus quam, id consectetur dui. Orci varius natoque penatibus et magnis dis']
['parturient montes, nascetur ridiculus mus. Mauris faucibus eros bibendum dolor aliquam,']
['non convallis turpis facilisis. Integer posuere augue commodo tortor pharetra, pulvinar porta']
['est tempor. Integer scelerisque non velit ut iaculis. Pellentesque sodales viverra risus, at']
['egestas nisl molestie a. Aenean non quam metus. Mauris quis nulla mollis dolor porta']
['elementum pellentesque vel risus. Suspendisse condimentum nibh magna, in suscipit ante']
['lobortis nec. Etiam sodales urna vitae porttitor tincidunt. Vestibulum ante leo, viverra feugiat']
['ipsum vel, egestas vehicula augue. Integer at facilisis ante. Aenean efficitur tincidunt nisl vel']
['commodo.']
['Third Table for Testing: Borderless']

['Column 1', 'Column 2', 'Column 3']
['Row 1 Column 1', 'Row 1 Column 2', 'Row 1 Column 3']
['Row 2 Column 1', 'Row 2 Column 2', 'Row 2 Column 3']
['Row 3 Column 1', 'Row 3 Column 2', 'Row 3 Column 3']
['Testing no border lines: This table should have 4 rows (including the header) and 3 columns', None, None]

You can remove the last line in post processing.

3 replies

bdthanh Jul 10, 2023
Author

Thanks for your response.

It works for table in page 4, but the one in page 5 is incorrectly detected:

bdthanh Jul 10, 2023
Author

With your provided code, my other table also has problem. This table is colored, so there are multiple invisible lines ( the white cell above has no problem) - see image:

samkit-jain Jul 10, 2023
Collaborator

Yes, it is very much possible for the config to be table specific. Have you tried using the explicit_vertical_lines strategy? Furthermore, if you crop the page p = p.crop((0.0 * float(p.width), 0.46 * float(p.height), 1.0 * float(p.width), 0.56 * float(p.height))) and use the extraction strategy as

{
    "vertical_strategy": "text",
    "horizontal_strategy": "text",
    "snap_tolerance": 5
}

you get the result

['Students Test 1', 'Test 2', 'Test 3', '', 'Midterm Finals', '']
['Alice', '34', '31', '40', '75', '81']
['Bob', '25', '28', '38', '78', '82']
['Charles', '19', '29', '32', '67', '79']

which even though is giving incorrect results for the header, may still be useful if you are looking for the non-header rows.

For the comment #931 (reply in thread) I am unable to find it in the PDF you shared. Could you please share the page where the table exists?

cmdlineluser · 2023-07-10T10:36:04Z

cmdlineluser
Jul 10, 2023

I'm not sure if I'm overcomplicating things or not, but the table on page 5 looks tricky because it seems to be a mixture of both rects and text?

For this example at least, you could search for the next rect below the table to use as a stopping point to crop at.

You could then use the last header row as a new cropping point and remove the nested rects using the approach from #934

You can then create these lines using the left rect vertical lines along with the left, right, and bottom chars in the cropped area.

import pdfplumber 
from   operator import itemgetter

def inside(self, other):
    return all((
        self['x0'] >= other['x0'],
        self['top'] >= other['top'],
        self['x1'] <= other['x1'],
        self['bottom'] <= other['bottom']
    ))
   
def largest_parent_rect(page, self):
    parent_rects = [other for other in page.rects if inside(self, other)]
    if parent_rects:
        parent_rect = max(parent_rects, key=itemgetter('width', 'height'))
        if self != parent_rect:
            return parent_rect
          
def remove_nested_rects(page, keep_largest=False):
    def filter_condition(other):
        if other['object_type'] == 'rect':
            return tuple(other['pts']) not in rects_to_remove
        return True

    rects_to_remove = set()

    for rect in page.rects:
        parent = largest_parent_rect(page, rect) 
        if parent is not None:
            rects_to_remove.add(tuple(rect['pts']))
            if keep_largest is False:
                rects_to_remove.add(tuple(parent['pts']))

    return page.filter(filter_condition)
  
pdf = pdfplumber.open("Downloads/tables_test_cases.pdf")

page5 = pdf.pages[4]

table = page5.find_tables()[1]

next_rect = next(
   (rect for rect in page5.rects if rect["bottom"] > table.bbox[3]), 
   None
)

if next_rect is not None:
   bottom = next_rect["top"]
else:
   bottom = page5.bbox[3] 

bbox = page5.bbox[0], table.rows[-1].bbox[1], table.bbox[2], bottom

crop = page5.within_bbox(bbox)
crop = remove_nested_rects(crop)

left_char   = min(crop.chars, key=itemgetter('x0'))
right_char  = max(crop.chars, key=itemgetter('x1'))
bottom_char = max(crop.chars, key=itemgetter('bottom'))

explicit_vertical_lines = [rect['x0'] for rect in crop.rects]
explicit_vertical_lines.extend(
    [ left_char['x0'], right_char['x1'] ]
)

explicit_horizontal_lines = [char['top'] for char in crop.chars]
explicit_horizontal_lines.append(bottom_char['bottom'])

crop.extract_table(dict(
    horizontal_strategy = "explicit",
    vertical_strategy = "explicit",
    explicit_horizontal_lines = explicit_horizontal_lines,
    explicit_vertical_lines = explicit_vertical_lines
))

[['Students', 'Test 1', 'Test 2', 'Test 3', 'Midterm', 'Finals'],
 ['Alice', '34', '31', '40', '75', '81'],
 ['Bob', '25', '28', '38', '78', '82'],
 ['Charles', '19', '29', '32', '67', '79']]

You can find out what columns fall under Test Scores

>>> page5.within_bbox(table.rows[1].bbox).extract_text()
'Test 1 Test 2 Test 3'

0 replies

bdthanh · 2023-07-18T04:49:16Z

bdthanh
Jul 18, 2023
Author

Thanks yall for your response. However, this problem is not solved. Here is another example that I am facing:

The first row of the first table is detected as 1 cell instead of 4
The second table cannot detect the first/last column as there is no border line.

Is there any solution that handle this case in general (like not specific for one problem table, but for all of them with different sizes/formats)?
The example pdf:
test.pdf

2 replies

cmdlineluser Jul 18, 2023

Perhaps there is a better approach, but they are similar to #934

keep_largest=True would be needed in this case and using all sides as explicit lines:

for page in pdf.pages:
    filtered_page = remove_nested_rects(page, keep_largest=True)

    for table in filtered_page.find_tables():
        table = filtered_page.crop(table.bbox).extract_table(dict(
            explicit_horizontal_lines = [table.bbox[1], table.bbox[3]],
            explicit_vertical_lines = [table.bbox[0], table.bbox[2]]
        ))
        print("-" * 42)
        for row in table:
            print(row)

------------------------------------------
['No', 'Description 1', 'Description 2', 'Description 3']
['1', 'Scenario 1', 'Risk level 1', 'Treatment 1']
['2', 'Scenario 2', 'Risk level 2', 'Treatment 2']
['3', 'Scenario 3', 'Risk level 3', 'Treatment 3']
------------------------------------------
['No', 'Description 1', 'Description 2', 'Description 3']
['1', 'Scenario 1', 'Risk level 1', 'Treatment 1']
['2', 'Scenario 2', 'Risk level 2', 'Treatment 2']
['3', 'Scenario 3', 'Risk level 3', 'Treatment 3']

It looks like snap_tolerance is a more built-in way to get rects to merge, and you can add the explicit lines around the table to close off the first/last rows/columns:

for table in page.find_tables(dict(snap_tolerance=6)):
    page.crop(table.bbox).extract_table(dict(
        snap_tolerance = 6,
        explicit_horizontal_lines = [table.bbox[1], table.bbox[3]],
        explicit_vertical_lines = [table.bbox[0], table.bbox[2]],
    ))

[['No', 'Description 1', 'Description 2', 'Description 3'],
 ['1', 'Scenario 1', 'Risk level 1', 'Treatment 1'],
 ['2', 'Scenario 2', 'Risk level 2', 'Treatment 2'],
 ['3', 'Scenario 3', 'Risk level 3', 'Treatment 3']]
[['No', 'Description 1', 'Description 2', 'Description 3'],
 ['1', 'Scenario 1', 'Risk level 1', 'Treatment 1'],
 ['2', 'Scenario 2', 'Risk level 2', 'Treatment 2'],
 ['3', 'Scenario 3', 'Risk level 3', 'Treatment 3']]

Edit: This has me wondering if there is a scenario where using the bbox as explicit lines wouldn't make sense?

explicit_horizontal_lines = [table.bbox[1], table.bbox[3]],
explicit_vertical_lines = [table.bbox[0], table.bbox[2]],

Answer selected by bdthanh

bdthanh Jul 18, 2023
Author

Thanks for your solution. Will try and get back to you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to detect borderless table #931

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Unable to detect borderless table #931

bdthanh Jul 9, 2023

Replies: 4 comments · 6 replies

cmdlineluser Jul 9, 2023

bdthanh Jul 10, 2023 Author

samkit-jain Jul 9, 2023 Collaborator

bdthanh Jul 10, 2023 Author

bdthanh Jul 10, 2023 Author

samkit-jain Jul 10, 2023 Collaborator

cmdlineluser Jul 10, 2023

bdthanh Jul 18, 2023 Author

cmdlineluser Jul 18, 2023

bdthanh Jul 18, 2023 Author

bdthanh
Jul 9, 2023

Replies: 4 comments 6 replies

cmdlineluser
Jul 9, 2023

bdthanh Jul 10, 2023
Author

samkit-jain
Jul 9, 2023
Collaborator

bdthanh Jul 10, 2023
Author

bdthanh Jul 10, 2023
Author

samkit-jain Jul 10, 2023
Collaborator

cmdlineluser
Jul 10, 2023

bdthanh
Jul 18, 2023
Author

bdthanh Jul 18, 2023
Author