Unable to detect borderless table #931
-
Hi, import pdfplumber
# This function removes the invisible lines
def reject_2d_rects(obj):
is_rect = obj["object_type"] == "rect"
is_thin = obj["width"] < 1 or obj["height"] < 1
return not (is_rect and not is_thin)
pdf = pdfplumber.open("tables_test_cases.pdf")
page0 = pdf.pages[3]
page0 = page0.filter(reject_2d_rects)
ts = {"vertical_strategy": "lines", "horizontal_strategy": "lines"}
# This code saves the debug visual output.
im = page0.to_image(resolution=200)
im.reset().debug_tablefinder(ts)
im.save("image.png", format="PNG")
# Extract the tables.
tables = page0.extract_tables(table_settings=ts)
for table in tables:
print()
for row in table:
print(row) If I remove the code to remove invisible lines then this is what I got: If I remove the table_settings, then I also cannot detect I also tried this code to ignore invisible lines, but it didn't work either: def keep_visible_lines(obj):
if obj['object_type'] == 'rect':
return obj['non_stroking_color'] is None
return True |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 6 replies
-
I'm not sure they are invisible lines. It looks like each rect is wrapped inside a slightly wider rect: |
Beta Was this translation helpful? Give feedback.
-
Hi @bdthanh Appreciate your kind words and glad the library is helping you solve problems. Have you tried using the import pdfplumber
pdf = pdfplumber.open("tables_test_cases.pdf")
page0 = pdf.pages[3]
ts = {"vertical_strategy": "lines", "horizontal_strategy": "lines", "snap_tolerance": 10}
# This code saves the debug visual output.
im = page0.to_image(resolution=200)
im.reset().debug_tablefinder(ts)
im.save("image.png", format="PNG")
# Extract the tables.
tables = page0.extract_tables(table_settings=ts)
for table in tables:
print()
for row in table:
print(row) gives the output as
You can remove the last line in post processing. |
Beta Was this translation helpful? Give feedback.
-
I'm not sure if I'm overcomplicating things or not, but the table on page 5 looks tricky because it seems to be a mixture of both rects and text? For this example at least, you could search for the next rect below the table to use as a stopping point to crop at. You could then use the last header row as a new cropping point and remove the nested rects using the approach from #934 You can then create these lines using the left rect vertical lines along with the left, right, and bottom chars in the cropped area. import pdfplumber
from operator import itemgetter
def inside(self, other):
return all((
self['x0'] >= other['x0'],
self['top'] >= other['top'],
self['x1'] <= other['x1'],
self['bottom'] <= other['bottom']
))
def largest_parent_rect(page, self):
parent_rects = [other for other in page.rects if inside(self, other)]
if parent_rects:
parent_rect = max(parent_rects, key=itemgetter('width', 'height'))
if self != parent_rect:
return parent_rect
def remove_nested_rects(page, keep_largest=False):
def filter_condition(other):
if other['object_type'] == 'rect':
return tuple(other['pts']) not in rects_to_remove
return True
rects_to_remove = set()
for rect in page.rects:
parent = largest_parent_rect(page, rect)
if parent is not None:
rects_to_remove.add(tuple(rect['pts']))
if keep_largest is False:
rects_to_remove.add(tuple(parent['pts']))
return page.filter(filter_condition)
pdf = pdfplumber.open("Downloads/tables_test_cases.pdf")
page5 = pdf.pages[4]
table = page5.find_tables()[1]
next_rect = next(
(rect for rect in page5.rects if rect["bottom"] > table.bbox[3]),
None
)
if next_rect is not None:
bottom = next_rect["top"]
else:
bottom = page5.bbox[3]
bbox = page5.bbox[0], table.rows[-1].bbox[1], table.bbox[2], bottom
crop = page5.within_bbox(bbox)
crop = remove_nested_rects(crop)
left_char = min(crop.chars, key=itemgetter('x0'))
right_char = max(crop.chars, key=itemgetter('x1'))
bottom_char = max(crop.chars, key=itemgetter('bottom'))
explicit_vertical_lines = [rect['x0'] for rect in crop.rects]
explicit_vertical_lines.extend(
[ left_char['x0'], right_char['x1'] ]
)
explicit_horizontal_lines = [char['top'] for char in crop.chars]
explicit_horizontal_lines.append(bottom_char['bottom'])
crop.extract_table(dict(
horizontal_strategy = "explicit",
vertical_strategy = "explicit",
explicit_horizontal_lines = explicit_horizontal_lines,
explicit_vertical_lines = explicit_vertical_lines
)) [['Students', 'Test 1', 'Test 2', 'Test 3', 'Midterm', 'Finals'],
['Alice', '34', '31', '40', '75', '81'],
['Bob', '25', '28', '38', '78', '82'],
['Charles', '19', '29', '32', '67', '79']] You can find out what columns fall under >>> page5.within_bbox(table.rows[1].bbox).extract_text()
'Test 1 Test 2 Test 3' |
Beta Was this translation helpful? Give feedback.
-
Thanks yall for your response. However, this problem is not solved. Here is another example that I am facing:
Is there any solution that handle this case in general (like not specific for one problem table, but for all of them with different sizes/formats)? |
Beta Was this translation helpful? Give feedback.
Perhaps there is a better approach, but they are similar to #934
keep_largest=True
would be needed in this case and using all sides as explicit lines: