Extract table without vertical or horizontal lines #1060
Replies: 3 comments 1 reply
-
Hi @kucoll, and thanks for your interest in First Case
This will be a tricky table to recognize correctly, because it contains no obvious vertical separators. You can use
The additional borders in the second and third tables appear to be caused by "invisible" lines — lines encoded in the PDF, but in a manner that make them invisible to the human eye. I'd suggest examining the results of Second Case
This is a similar situation as the first table in the prior example; I'd suggest the same approach. Third Case / Fourth CaseThis is a similar situation to the "invisible" lines discussed above; I'd suggest a similar approach (although in this case there might be some "invisible" lines you want to keep). Fifth Case
Unfortunately, there is no |
Beta Was this translation helpful? Give feedback.
-
@kucoll 你好, 我大概看了一下,有一些小建议。提取表格不一定非要提取表格,也可以先提取words再组成一个表格(缺点就是表头不一定正常,也许要对表头单独处理),因为这个表格很工整,缺的地方都有-来代替,同一行可以取这个词的上下边界中间值确定,一定误差范围内可以被认为是同一行。对于表格本身,我想可以通过字体大小来判断,因为表格字体好像比其他的要小一号。words提取出来["char"]["size"]可以看大小,根据这个大字号文本的分布,可以划分出表格可能分布的区间。hmm,想了下,通过判断小字中间有没有大字好像要更容易一些。这样的话,也可以解决跨页面的表格,或者判断表格第一行有多少词应该也行。 |
Beta Was this translation helpful? Give feedback.
-
Hi I'm facing issues extracting the invisible tables too. I can't crop the page to specific coordinates of the table because I'm running the program on multiple PDFs where the table can appear in different positions. https://github.com/user-attachments/files/15922643/Customer.Information.pdf I tried it with these table_settings: I inspected the page.rects and the page.lines but Does anyone know how to solve this? (reposted from #123 (comment)) |
Beta Was this translation helpful? Give feedback.
-
Hello,
First, thank you for this awesome library, it solves most of my problems.
I need your help with a few questions about borderless table extraction, can you provide some advice and help?
First Case debug table
My code is below:
This case is on page 1 of the pdf file. join.pdf
Second Case debug table
My code is below:
This case is on page 2 of the pdf file. join.pdf
I want to extract two tables from the first page, the second table is perfect, but the first one seems to need to adjust the code?
Third Case debug table
My code is below:
This case is on page 3 of the pdf file. join.pdf
I don't know how to extract the two forms on the third page, and I don't need the text between the two forms?
Fourth Case:
My code is below:
This case is on page 5 of the pdf file. join.pdf
How should I extract two tables without the text in the middle, should I filter out the line where the text "在建开发产品" is in python code?
Five Case:
The pdf file of this case is in join.pdf
How can I identify the table on page 3 and page 4 as a whole table, or is there any way to identify the table data across pages?
@jsvine Or others Can you give me some advice about the above five cases? I would be very happy if I could get your help。
Beta Was this translation helpful? Give feedback.
All reactions