Issue with a table cell containing multiple lines of text #895
Closed
codePigPig
started this conversation in
Ask for help with specific PDFs
Replies: 2 comments 6 replies
-
Hi @codePigPig Could you please attach the PDF? The link you have provided isn't working |
Beta Was this translation helpful? Give feedback.
5 replies
-
It looks similar to the issue discussed here: def reject_2d_rects(obj):
is_rect = obj["object_type"] == "rect"
is_thin = obj["width"] < 1 or obj["height"] < 1
return not (is_rect and not is_thin)
filtered = page.filter(reject_2d_rects) >>> filtered.extract_tables()[-2]
[['主要研发项目名称', '项目目的', '项目进展', '拟达到的目标', '预计对公司未来发展的影响'],
['高级舒适轿车开发',
'全新序列首款产品,提\n升市场竞争力',
'完成年度开发任务',
'开发全新电动车型,按计划\n上市销售',
'开发新产品,提升市场竞争\n力'],
['全新一代插电混动车型\n开发',
'开发新产品,提升市场\n竞争力',
'完成年度开发任务',
'开发全新一代插电混动车\n型,按计划上市销售',
'开发新产品,提升市场竞争\n力'],
['全新一代全电数字轿车\n开发',
'长安新能源首款纯电\n平台战略车型',
'完成年度开发任务',
'开发全新电动车型,按计划\n上市销售',
'开发新产品,提升市场竞争\n力'],
['智能电动数字化平台开\n发',
'打造领先的软硬件平\n台',
'完成年度研发任务',
'完成主体功能 100%开发,\n技术状态锁定',
'新汽车转型升级'],
['全新一代纯电智能整车\n平台开发',
'突破技术瓶颈,强化电\n动化、智能化产品占\n位,支撑公司中大型市\n场产品开发',
'完成年度研发任务',
'完成平台开发,实现全面平\n台化、智能化、电气化',
'加速向智能低碳出行科技公\n司转型,支撑“新汽车 新生\n态”的发展策略'],
['多动力兼容架构整车平\n台开发',
'长安兼容平台架构,支\n撑多动力产品',
'完成年度研发任务',
'拓展优化系统构架,实现多\n动力平台共平台开发',
'丰富公司产品动力选择,提\n升开发效率和成本']] |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Sorry to interrupt everyone, but I have a question and I would really appreciate your help. When dealing with tables in a PDF where a cell contains multiple lines of text, I'm having difficulty extracting the data in the desired structure. I'm not sure if I'm doing something wrong, as I've just started learning pdfplumber. I would greatly appreciate any guidance or suggestions from all of you. Thank you so much.
This is my form:
https://ibb.co/ZNQdxQH
Data read by pdfplumber:
[
['主要研发项目名称', '项目目的', '项目进展', '拟达到的目标', '预计对公司未来发展的影响'],
['', '全新序列首款产品,提\n升市场竞争力', '', '开发全新电动车型,按计划', '开发新产品,提升市场竞争'],
['高级舒适轿车开发', None, '完成年度开发任务', None, None], [None, None, None, '上市销售', '力'],
['', None, '', None, None],
['全新一代插电混动车型', '开发新产品,提升市场\n竞争力', '', '开发全新一代插电混动车', '开发新产品,提升市场竞争'],
[None, None, '完成年度开发任务', None, None],
['开发', None, None, '型,按计划上市销售', '力'],
[None, None, '', None, None],
['全新一代全电数字轿车', '长安新能源首款纯电\n平台战略车型', '', '开发全新电动车型,按计划', '开发新产品,提升市场竞争'],
[None, None, '完成年度开发任务', None, None],
['开发', None, None, '上市销售', '力'], [None, None, '', None, None],
['智能电动数字化平台开', '打造领先的软硬件平\n台', '', '完成主体功能 100%开发,', ''],
[None, None, '完成年度研发任务', None, '新汽车转型升级'],
['发', None, None, '技术状态锁定', None], [None, None, '', None, ''],
['', '突破技术瓶颈,强化电\n动化、智能化产品占\n位,支撑公司中大型市\n场产品开发', '', '', ''],
[None, None, None, None, '加速向智能低碳出行科技公'],
['全新一代纯电智能整车', None, None, '完成平台开发,实现全面平', None],
[None, None, '完成年度研发任务', None, '司转型,支撑“新汽车 新生'],
['平台开发', None, None, '台化、智能化、电气化', None],
[None, None, '', None, '态”的发展策略'],
['', None, None, '', None],
[None, None, None, None, ''],
['多动力兼容架构整车平', '长安兼容平台架构,支\n撑多动力产品', '', '拓展优化系统构架,实现多', '丰富公司产品动力选择,提'],
[None, None, '完成年度研发任务', None, None],
['台开发', None, None, '动力平台共平台开发', '升开发效率和成本'],
[None, None, '', None, None]]
The results I want:
['主要研发项目名称', '项目目的', '项目进展', '拟达到的目标', '预计对公司未来发展的影响'],
['高级舒适轿车开发', '全新序列首款产品,提升市场竞争力', '完成年度开发任务', '开发全新电动车型,按计划 上市销售', '开发新产品,提升市场竞争 力']
.....
problem:
1、 After the table extracts data, when there are multiple lines of text in one place, the content cannot be contiguous.
I look forward to your reply. Thank you.
Beta Was this translation helpful? Give feedback.
All reactions