-
Notifications
You must be signed in to change notification settings - Fork 687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
when I try to use extract_words() ,can't get some text #956
Comments
Hi @fangjiyuan - can you perhaps show what should be extracted exactly? I see one long number in the text: >>> [ word for word in words if '193' in word['text'] ]
[{'text': '19306498777',
'x0': 339.75,
'x1': 389.25,
'top': 143.61997000000008,
'doctop': 143.61997000000008,
'bottom': 152.61997000000008,
'upright': True,
'direction': 1}] But I don't understand the language to know if that is the phone number or not. |
i am sorry about that ,i try to save the one of the pdf,may be it change the pdf's format.can u have a look this pdf example .what i want to get is '19306498777'. https __order.crm.hcp.telecom.sd_rest_order_crm_BSS3_prd_receipt_2023_0718_730017088912.pdf |
FWIW, it seems that the same text is not copy/paste-able from a standard PDF viewer into a plain text editor. I haven't examined the potential cause closely, but this suggests the issue might not be solvable via It's possible (but just a guess) that this might be caused by glyph remappings, c.f., #851 (comment) |
Describe the bug
*when I try to use extract_words() ,can't get some text ,for example then phonenumber *
Have you tried repairing the PDF?
it not work.
PDF file
https __order.crm.hcp.telecom.sd_rest_order_crm_BSS3_prd_receipt_2023_0718_730017088912.pdf
Environment
Additional context
Add any other context/notes about the problem here.
The text was updated successfully, but these errors were encountered: