How do I extract table from this PDF? #445
-
Hello, I am trying the extract the data from this PDF: https://www.nasdaq.com/docs/2020/12/30/Holiday-Calendar-Commodities-Markets.pdf I have tried with table_settings like this: But it skips the last row. I have tried using the very nice debug tool: im.reset().debug_tablefinder() |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi @jakobdo Appreciate your interest in the library. I would recommend a 2 step process here. Taking the 3rd page as an example, if you use the Step 1: Extract the table using the table strategy and store the vertical coordinates as provided by the first row of the table. tables = page.find_tables()
header_row = tables[0].rows[0].cells
vertical_lines = [cell[0] for cell in header_row] + [header_row[-1][2]]
# Output -> [Decimal('48.625'), Decimal('399.503'), Decimal('684.000')] Step 2: Run table extraction using the explicit vertical lines strategy with the table_settings = {
"vertical_strategy": "explicit",
"horizontal_strategy": "lines",
"explicit_vertical_lines": vertical_lines, # [Decimal('48.625'), Decimal('399.503'), Decimal('684.000')]
} and the result will be
|
Beta Was this translation helpful? Give feedback.
Hi @jakobdo Appreciate your interest in the library. I would recommend a 2 step process here. Taking the 3rd page as an example, if you use the
debug_tablefinder()
with the lines strategy ({"vertical_strategy": "lines", "horizontal_strategy": "lines"}
) you'll notice the output asStep 1: Extract the table using the table strategy and store the vertical coordinates as provided by the first row of the table.
Step 2: Run table extraction using the explicit vertical lines strategy with …