Extract table without vertical or horizontal lines #1060

kucoll · 2023-12-13T19:45:58Z

kucoll
Dec 13, 2023

Hello,

First, thank you for this awesome library, it solves most of my problems.

I need your help with a few questions about borderless table extraction, can you provide some advice and help?

First Case debug table

My code is below：

doc = pdfplumber.open(pdf_file)
page = doc.pages[0]
table_settings = {
    "snap_x_tolerance": 8
}
page.to_image(resolution=200).debug_tablefinder(table_settings).show()

This case is on page 1 of the pdf file. join.pdf

The first borderless table on this page is not recognized, how should I adjust the code to recognize it correctly?
How to filter out the meaningless little border in the second table?

Second Case debug table

My code is below：

doc = pdfplumber.open(pdf_file)
page = doc.pages[1]
table_settings = {}
page.to_image(resolution=200).debug_tablefinder(table_settings).show()

This case is on page 2 of the pdf file. join.pdf

I want to extract two tables from the first page, the second table is perfect, but the first one seems to need to adjust the code?

Third Case debug table

My code is below：

doc = pdfplumber.open(pdf_file)
page = doc.pages[2]

vlines = [e for e in page.vertical_edges if e["height"] > 10]
extreme_xs = [
    min(e["x0"] for e in page.horizontal_edges),
    max(e["x1"] for e in page.horizontal_edges)
]
table_settings = {
    "horizontal_strategy": "text",
    "explicit_horizontal_lines": page.curves + page.edges,
    "explicit_vertical_lines": vlines + extreme_xs,
}
page.to_image(resolution = 200).debug_tablefinder(table_settings).show()

This case is on page 3 of the pdf file. join.pdf

I don't know how to extract the two forms on the third page, and I don't need the text between the two forms？

Fourth Case：

My code is below：

doc = pdfplumber.open(pdf_file)
page = doc.pages[1]

vlines = [e for e in page.vertical_edges if e["height"] > 10]
extreme_xs = [
    min(e["x0"] for e in page.horizontal_edges),
    max(e["x1"] for e in page.horizontal_edges)
]
table_settings = {
    "vertical_strategy": "text",
    "horizontal_strategy": "text",
}
page.to_image(resolution = 200).debug_tablefinder(table_settings).show()

This case is on page 5 of the pdf file. join.pdf

How should I extract two tables without the text in the middle, should I filter out the line where the text "在建开发产品" is in python code?

Five Case：

The pdf file of this case is in join.pdf

How can I identify the table on page 3 and page 4 as a whole table, or is there any way to identify the table data across pages?

@jsvine Or others Can you give me some advice about the above five cases? I would be very happy if I could get your help。

jsvine · 2023-12-21T20:05:58Z

jsvine
Dec 21, 2023
Maintainer

Hi @kucoll, and thanks for your interest in pdfplumber, and for providing detailed explanations, as well as the relevant PDF.

First Case

The first borderless table on this page is not recognized, how should I adjust the code to recognize it correctly?

This will be a tricky table to recognize correctly, because it contains no obvious vertical separators. You can use "vertical_strategy": "text", although this will also affect the extraction of the other tables on the page.

How to filter out the meaningless little border in the second table?

The additional borders in the second and third tables appear to be caused by "invisible" lines — lines encoded in the PDF, but in a manner that make them invisible to the human eye. I'd suggest examining the results of page.lines and page.rects to determine the unique characteristics of those lines, and then use page.filter(...).extract_tables(...) to filter them out before extraction. (You can find some examples of this approach in other discussions in the forum.)

Second Case

I want to extract two tables from the first page, the second table is perfect, but the first one seems to need to adjust the code?

This is a similar situation as the first table in the prior example; I'd suggest the same approach.

Third Case / Fourth Case

This is a similar situation to the "invisible" lines discussed above; I'd suggest a similar approach (although in this case there might be some "invisible" lines you want to keep).

Fifth Case

How can I identify the table on page 3 and page 4 as a whole table, or is there any way to identify the table data across pages?

Unfortunately, there is no pdfplumber method for analyzing tables across pages. This will require you to write your own logic. There are some prior discussions that might be useful:

0 replies

Eyderoe · 2024-01-07T12:53:23Z

Eyderoe
Jan 7, 2024

@kucoll 你好，我大概看了一下，有一些小建议。提取表格不一定非要提取表格，也可以先提取words再组成一个表格(缺点就是表头不一定正常，也许要对表头单独处理)，因为这个表格很工整，缺的地方都有-来代替，同一行可以取这个词的上下边界中间值确定，一定误差范围内可以被认为是同一行。对于表格本身，我想可以通过字体大小来判断，因为表格字体好像比其他的要小一号。words提取出来["char"]["size"]可以看大小，根据这个大字号文本的分布，可以划分出表格可能分布的区间。hmm，想了下，通过判断小字中间有没有大字好像要更容易一些。这样的话，也可以解决跨页面的表格，或者判断表格第一行有多少词应该也行。

0 replies

clj55 · 2024-06-28T07:28:39Z

clj55
Jun 28, 2024

Hi I'm facing issues extracting the invisible tables too. I can't crop the page to specific coordinates of the table because I'm running the program on multiple PDFs where the table can appear in different positions.

https://github.com/user-attachments/files/15922643/Customer.Information.pdf

I tried it with these table_settings:
table_settings={ "vertical_strategy":"text", "text_keep_blank_chars":True, "horizontal_strategy":"text", }
But it recognised the paragraph text as a table too

341638091-8bef3510-9f8b-4ca3-811e-07f5dbaec5b7

I inspected the page.rects and the page.lines but
page.rects: identifies the text underlines
page.lines: empty list

Does anyone know how to solve this?

(reposted from #123 (comment))

1 reply

Eyderoe Jun 28, 2024

Hi I'm facing issues extracting the invisible tables too. I can't crop the page to specific coordinates of the table because I'm running the program on multiple PDFs where the table can appear in different positions.

https://github.com/user-attachments/files/15922643/Customer.Information.pdf

I tried it with these table_settings: table_settings={ "vertical_strategy":"text", "text_keep_blank_chars":True, "horizontal_strategy":"text", } But it recognised the paragraph text as a table too
I inspected the page.rects and the page.lines but page.rects: identifies the text underlines page.lines: empty list
Does anyone know how to solve this?

(reposted from #123 (comment))

have more sample ?
or all the files are single page and nothing below table ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract table without vertical or horizontal lines #1060

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Extract table without vertical or horizontal lines #1060

kucoll Dec 13, 2023

First Case debug table

Second Case debug table

Third Case debug table

Fourth Case：

Five Case：

Replies: 3 comments · 1 reply

jsvine Dec 21, 2023 Maintainer

First Case

Second Case

Third Case / Fourth Case

Fifth Case

Eyderoe Jan 7, 2024

clj55 Jun 28, 2024

Eyderoe Jun 28, 2024

kucoll
Dec 13, 2023

Replies: 3 comments 1 reply

jsvine
Dec 21, 2023
Maintainer

Eyderoe
Jan 7, 2024

clj55
Jun 28, 2024