`TypeError` raised by `extract_text` method with compressed PDF file #886

jbpenrath · 2023-05-22T14:33:00Z

Bug report

Description

I'm generating PDF document through Weasyprint. Since the version 59.0 of this package, I'm not able to extract text from generated compressed PDF files with pdfminer.highlevel.extract_text method. Indeed this method raises a TypeError, invalid length. The exception is raised from a util method called nunpack.

So I first open an issue on the Weasyprint repository, but it appears the issue's source could be come from pdfminer itself.

You can take a look to the answer of Weasyprint maintainer, to understand pdfminer concern in this problem.

Steps to reproduce

from io import BytesIO
from pdfminer.high_level import extract_text
from weasyprint import HTML

html = HTML(string='<h1>Hello world</h1>')
document = html.write_pdf()
extract_text(BytesIO(document)) # 💥 TypeError: invalid length: 6

The text was updated successfully, but these errors were encountered:

liZe · 2023-05-23T20:08:22Z

Here’s a simple and uncompressed PDF to reproduce the problem, in case you’d like to avoid installing another tool 😄:
hello.pdf

The error is caused by the XRef table with /W [1 4 6]. The third field is encoded using 6 bytes, and it’s decoded here using nunpack that’s not designed to handle all integer sizes.

Instead of using struct.unpack in nunpack, it may be useful to use int.from_bytes that will automatically work for all integer sizes.

dhdaines · 2024-08-01T13:53:46Z

fixed in #1029 (and thank you for weasyprint, it is very nice software!)

jbpenrath changed the title ~~TypeError raised by `extract_text~~ TypeError raised by extract_text method with compressed PDF file May 22, 2023

jbpenrath mentioned this issue May 22, 2023

Feat/create document with weasyprint options openfun/marion#163

Merged

3 tasks

liZe mentioned this issue Jul 12, 2023

Weasyprint 59.0 incompatiliby with pdfminer.extract_text Kozea/WeasyPrint#1885

Closed

dhdaines added a commit to dhdaines/pdfminer.six that referenced this issue Aug 1, 2024

fix: support arbitrary width integers (fixes: pdfminer#886)

267ac8f

dhdaines linked a pull request Aug 1, 2024 that will close this issue

Accept arbitrary width integers in nunpack #1029

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`TypeError` raised by `extract_text` method with compressed PDF file #886

`TypeError` raised by `extract_text` method with compressed PDF file #886

jbpenrath commented May 22, 2023 •

edited

Loading

liZe commented May 23, 2023

dhdaines commented Aug 1, 2024

TypeError raised by extract_text method with compressed PDF file #886

TypeError raised by extract_text method with compressed PDF file #886

Comments

jbpenrath commented May 22, 2023 • edited Loading

Description

Steps to reproduce

liZe commented May 23, 2023

dhdaines commented Aug 1, 2024

`TypeError` raised by `extract_text` method with compressed PDF file #886

`TypeError` raised by `extract_text` method with compressed PDF file #886

jbpenrath commented May 22, 2023 •

edited

Loading