Extracting table rows with cells that have partially missing borders #866
Replies: 2 comments 2 replies
-
With a table like this one, where you don't have to worry about interfering with other tables on the page, you should be fine with: table_settings = {
"explicit_horizontal_lines": [ edge["top"] for edge in page.edges ]
} (In Passing a list of |
Beta Was this translation helpful? Give feedback.
-
Ah, useful to know this, thanks! I will say I did simplify the example here, since the last page often has an additional short table that I am not interested in extracting. Because I am using "extract_table" (singular) I think your warning of your edge tops example interfering with other tables on a page isn't an issue then? (since I'm only extracting the biggest table on the page) |
Beta Was this translation helpful? Give feedback.
-
In this example pdf, the last row has the final horizontal border missing for some of the cells (first 4 and last 6 columns)
20230413-test.pdf
With default settings, the last row pdfplumber extracts looks like:
[None, None, None, None, '23:45', '63', '0.0051007', '5.1006799', '23:45', 'N/A', '0.0530657', None, None, None, None, None, None]
If I instead tweak the table settings and use an explicit horizontal strategy of
page.curves + page.edges
with a largerjoin_x_tolerance
Then I get the following output for the last row of the pdf:
['23:45', '56', '0.0266173', '26.617339', '23:45', '63', '0.0051007', '5.1006799', '23:45', 'N/A', '0.0530657', '53.0657', '23:45', '48', '0.010573', '10.572958', '']
which is what I expect.
It seems I can also use table settings with the same horizontal strategy as above, but instead tweaking the
intersection_x_tolerance
to a fairly large value.Is there a better way to ensure that last row gets extracted properly? What's the suggested path of minimal false positive/incorrect extraction based on the strategies I tried above if I want it used in an as general way as possible? (I know, easier said than done, heh)
Thanks!!
Beta Was this translation helpful? Give feedback.
All reactions