vertical_strategy 使用文本后的表格边界的检测 #1210
ningmimg
started this conversation in
Ask for help with specific PDFs
Replies: 2 comments
-
这个貌似不是国人的项目,建议用英文提问吧. |
Beta Was this translation helpful? Give feedback.
0 replies
-
Hi @ningmimg, the issue here seems to be is that the left and right borders aren't explicitly defined. A common strategy for dealing with that is to pass those positions explicitly using the "explicit_vertical_lines": [...] table setting, and deriving those positions from the other graphical elements on the page, such as the horizontal lines that span the width of the table. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Identify a simple table data. When I read the table, I found that when the table is large, the outermost rows are always not captured.
For example, in the image I uploaded, vertical degree 9, the last row of text is inexplicably excluded from the table. With {"vertical_strategy": "text", "horizontal_strategy": "lines"}, the module has already found the correct vertical boundaries of the table, but the boundaries did not extend further, causing the topmost and bottommost rows of text to be omitted. In another PDF, the topmost row was not omitted. Adjusting other table_setting parameters did not yield good results either.
The second small issue is: The table data is center-aligned (self-generated), so why was the distinction within rows implemented correctly in the first image, but not in the second? The third small issue is: When my table is small, I don't even need to specify vertical alignment as text to recognize the table. But when the table is larger, it can't be done, and returns empty.
I guess there might be an implicit height condition when extracting the table, which causes the table columns, even if recognized vertically via vertical_strategy as text, not to extend completely in pdfplumber.
I know that other methods (such as manually specifying lines) can recognize the table, but I think the best solution would still be to achieve complete extraction of the table through simple setting adjustments in table_setting or other parameters. I also consulted GPT-4/Claude and similar, but their suggestions didn't yield good results (mainly focusing on table_setting adjustments).
I would like to know why this issue occurs with find_table or similar methods. Is it due to my settings or incorrect usage?
Looking forward to your insights!
new2.pdf
0.pdf
Beta Was this translation helpful? Give feedback.
All reactions