Not reading the pdf file #804

drnko · 2023-02-02T00:58:35Z

drnko
Feb 2, 2023

Whenever I'm converting an image to PDF and trying to extract the text from the converted PDF, the result from PDFplumber is blank.

What I'm doing wrong?

Step 1:
Converting an image(jpeg/jpg/png) to PDF using the PIL
Saving the converted pdf file.

Step 2:
Open converted/saved pdf using pdfplumber.open()
Extracting text from the loaded/opened pdf file

===============================================================

Below is the code:

image_1 = Image.open(r'D:\ocr\images\barrel.jpg')
im_1 = image_1.convert('RGB')
im_1.save(r'test.pdf')

inv_pdf = pdfplumber.open('test.pdf')
print('Result:' , inv_pdf.pages[0].extract_text())

===============================================================
Terminal:

PS D:\ocr> & "C:/Program Files/Python310/python.exe" d:/ocr/testing.py
Result:

PS D:\GitOCR\ocr>

===============================================================

Below are the files converted PDF files from image file:

test.pdf

test1.pdf

test2.pdf

version: pdfplumber 0.7.6

jsvine · 2023-02-02T12:57:14Z

jsvine
Feb 2, 2023
Maintainer

Hi @drnko, and thanks for your interest in this library. It appears you're trying to do the following:

Convert an image to a PDF
Extract text from that PDF

The PDF you create in step 1 is an "image-based PDF" (see here for context), and contains no information about the actual text it represents — it's just a picture.

You can try using OCR (optical character recognition) software to add a text layer back to the PDF, but pdfplumber will not work as well with those sorts of PDFs as it does with original-text PDFs.

1 reply

drnko Feb 7, 2023
Author

@jsvine
Thanks for the reply...
We also tried ocrmypdf to add a text layer back to the PDF, but pdfplumber does not work with every searchable sorts of PDFs(created by ocrmyPDF) as it does with original-text PDFs.

Request you to please us how we can create original-text PDFs from scanned images(jpg/jpeg/png)??

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not reading the pdf file #804

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Not reading the pdf file #804

drnko Feb 2, 2023

Replies: 1 comment · 1 reply

jsvine Feb 2, 2023 Maintainer

drnko Feb 7, 2023 Author

drnko
Feb 2, 2023

Replies: 1 comment 1 reply

jsvine
Feb 2, 2023
Maintainer

drnko Feb 7, 2023
Author