vertical_strategy 使用文本后的表格边界的检测 #1210

ningmimg · 2024-09-22T10:46:11Z

ningmimg
Sep 22, 2024

Identify a simple table data. When I read the table, I found that when the table is large, the outermost rows are always not captured.

For example, in the image I uploaded, vertical degree 9, the last row of text is inexplicably excluded from the table. With {"vertical_strategy": "text", "horizontal_strategy": "lines"}, the module has already found the correct vertical boundaries of the table, but the boundaries did not extend further, causing the topmost and bottommost rows of text to be omitted. In another PDF, the topmost row was not omitted. Adjusting other table_setting parameters did not yield good results either.

The second small issue is: The table data is center-aligned (self-generated), so why was the distinction within rows implemented correctly in the first image, but not in the second? The third small issue is: When my table is small, I don't even need to specify vertical alignment as text to recognize the table. But when the table is larger, it can't be done, and returns empty.

I guess there might be an implicit height condition when extracting the table, which causes the table columns, even if recognized vertically via vertical_strategy as text, not to extend completely in pdfplumber.

I know that other methods (such as manually specifying lines) can recognize the table, but I think the best solution would still be to achieve complete extraction of the table through simple setting adjustments in table_setting or other parameters. I also consulted GPT-4/Claude and similar, but their suggestions didn't yield good results (mainly focusing on table_setting adjustments).

I would like to know why this issue occurs with find_table or similar methods. Is it due to my settings or incorrect usage?

Looking forward to your insights!

new2.pdf
0.pdf

pdfplumber version: 0.11.4
Python version: 3.8
OS: Windows

neoyxm · 2024-09-23T07:51:41Z

neoyxm
Sep 23, 2024

这个貌似不是国人的项目，建议用英文提问吧.

0 replies

jsvine · 2024-10-03T02:52:45Z

jsvine
Oct 3, 2024
Maintainer

Hi @ningmimg, the issue here seems to be is that the left and right borders aren't explicitly defined. A common strategy for dealing with that is to pass those positions explicitly using the "explicit_vertical_lines": [...] table setting, and deriving those positions from the other graphical elements on the page, such as the horizontal lines that span the width of the table.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vertical_strategy 使用文本后的表格边界的检测 #1210

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

vertical_strategy 使用文本后的表格边界的检测 #1210

ningmimg Sep 22, 2024

Replies: 2 comments

neoyxm Sep 23, 2024

jsvine Oct 3, 2024 Maintainer

ningmimg
Sep 22, 2024

neoyxm
Sep 23, 2024

jsvine
Oct 3, 2024
Maintainer