Trying to extract both annotations and text from a PDF. #342

Santoshzzz · 2021-02-01T04:43:56Z

Santoshzzz
Feb 1, 2021

Hi,
I'm trying to extract both annotations and text from a PDF. I'm using "text = p0.extract_words(x_tolerance=3, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, horizontal_ltr=True, vertical_ttb=True, extra_attrs=[])
print(text)" to extract text but unable to extract the annotations in the same way.
Any help?

jsvine · 2021-02-01T14:45:44Z

jsvine
Feb 1, 2021
Maintainer

PDFs represent annotations in a distinctly different way than standard text. You can access them via my_page.annots but, as such, are represented differently. See Section 8.4 of the official PDF specification for details.

2 replies

Santoshzzz Feb 1, 2021
Author

Thank you. Can you help me with how can I extract them with the page numbers in the same program? Because I want to transform this into a dataframe and merge both text and annotation extracts.

jsvine Feb 4, 2021
Maintainer

Each annotation's page number is embedded within the annotation object itself. E.g.,:

for annot in pdf.annots:
    print(annot["page_number"])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to extract both annotations and text from a PDF. #342

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Trying to extract both annotations and text from a PDF. #342

Santoshzzz Feb 1, 2021

Replies: 1 comment · 2 replies

jsvine Feb 1, 2021 Maintainer

Santoshzzz Feb 1, 2021 Author

jsvine Feb 4, 2021 Maintainer

Santoshzzz
Feb 1, 2021

Replies: 1 comment 2 replies

jsvine
Feb 1, 2021
Maintainer

Santoshzzz Feb 1, 2021
Author

jsvine Feb 4, 2021
Maintainer