How do I extract table from this PDF? #445

jakobdo · 2021-06-08T09:02:49Z

jakobdo
Jun 8, 2021

Hello, I am trying the extract the data from this PDF: https://www.nasdaq.com/docs/2020/12/30/Holiday-Calendar-Commodities-Markets.pdf

I have tried with table_settings like this:
table_settings = {"vertical_strategy": "text", "keep_blank_chars": True}

But it skips the last row.
And on page 3 it does "split" the rows after the name of the month.

I have tried using the very nice debug tool: im.reset().debug_tablefinder()
But I am not able to get closer to a working solution. Hope someone can point in the direction to use this library, because it is really nice.

Answered by samkit-jain

Jun 8, 2021

Hi @jakobdo Appreciate your interest in the library. I would recommend a 2 step process here. Taking the 3rd page as an example, if you use the debug_tablefinder() with the lines strategy ({"vertical_strategy": "lines", "horizontal_strategy": "lines"}) you'll notice the output as

Step 1: Extract the table using the table strategy and store the vertical coordinates as provided by the first row of the table.

tables = page.find_tables()
header_row = tables[0].rows[0].cells
vertical_lines = [cell[0] for cell in header_row] + [header_row[-1][2]]
# Output -> [Decimal('48.625'), Decimal('399.503'), Decimal('684.000')]

Step 2: Run table extraction using the explicit vertical lines strategy with …

View full answer

samkit-jain · 2021-06-08T10:20:27Z

samkit-jain
Jun 8, 2021
Collaborator

Hi @jakobdo Appreciate your interest in the library. I would recommend a 2 step process here. Taking the 3rd page as an example, if you use the debug_tablefinder() with the lines strategy ({"vertical_strategy": "lines", "horizontal_strategy": "lines"}) you'll notice the output as

Step 1: Extract the table using the table strategy and store the vertical coordinates as provided by the first row of the table.

tables = page.find_tables()
header_row = tables[0].rows[0].cells
vertical_lines = [cell[0] for cell in header_row] + [header_row[-1][2]]
# Output -> [Decimal('48.625'), Decimal('399.503'), Decimal('684.000')]

Step 2: Run table extraction using the explicit vertical lines strategy with the vertical_lines as input.

table_settings = {
    "vertical_strategy": "explicit",
    "horizontal_strategy": "lines",
    "explicit_vertical_lines": vertical_lines,  # [Decimal('48.625'), Decimal('399.503'), Decimal('684.000')]
}

and the result will be

['TRADINGALL WEEK-DAYSEXCEPT ON HOLIDAYS:', 'DATE:']
['New Year', 'January 1']
['Maundy Thursday', '']
['Good Friday', '']
['Easter Monday', '']
['Labor Day', 'May 1']
['Ascension Day', '']
['Whit Monday', '']
['National Day', 'May 17']
['Christmas Eve', 'December 24']
['Christmas Day', 'December 25']
['Boxing Day', 'December 26']
['New Years Eve', 'December 31']

1 reply

jakobdo Jun 8, 2021
Author

Thanks a lot @samkit-jain
This was very "easy". I will have to play along with this library and this solution. Thanks. :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I extract table from this PDF? #445

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

How do I extract table from this PDF? #445

jakobdo Jun 8, 2021

Replies: 1 comment · 1 reply

samkit-jain Jun 8, 2021 Collaborator

jakobdo Jun 8, 2021 Author

jakobdo
Jun 8, 2021

Replies: 1 comment 1 reply

samkit-jain
Jun 8, 2021
Collaborator

jakobdo Jun 8, 2021
Author