Add possibility to deactivate OCR #2467

thomascerbelaud · 2024-01-29T14:20:19Z

Is your feature request related to a problem? Please describe.
I am working on large text-based PDF files, and would like to parse them as fats as possible, while keeping a high resolution (strategy="hi_res"). I am interested in extracting tables, and tables from pictures, however I would like to deactivate OCR for images detected as such. The first obvious reason is speed. But also images do not matter much.

Describe the solution you'd like
A keyword argument that would enable or disable OCR would be the most easy thing to code I guess and would be a nice additional feature, especially if it can differentiate between tables and images. Another nice feature would be to not perform OCR on tables for text-based regions, in order to speed the partition process.

Describe alternatives you've considered
Adding a new OCRMode.NO_OCR

Additional context
I am not interested in images s.a. graphs or photos.

The text was updated successfully, but these errors were encountered:

dhdaines · 2024-12-17T19:10:22Z

Yes, not only is OCR super slow, but table contents are frequently mangled because infer_table_structure uses OCR rather than any existing text.

This seems like an important problem for people who care about the quality of their data (so, I guess, not anyone using Generative AI, but...)

The Aryn partitioner does not insist on OCRing everything, but unfortunately it is tightly bound to a different framework which is not as nice as Unstructured for casual use cases...

thomascerbelaud added the enhancement New feature or request label Jan 29, 2024

Masterchen09 mentioned this issue Oct 16, 2024

feat: skip ocr for certain element types (Issue #3163) #3182

Open

dhdaines pushed a commit to dhdaines/unstructured that referenced this issue Dec 17, 2024

feat: allow disabling OCR in hi_res mode (fixes: Unstructured-IO#2467)

8cf7d27

dhdaines pushed a commit to dhdaines/unstructured that referenced this issue Dec 17, 2024

feat: allow disabling OCR in hi_res mode (fixes: Unstructured-IO#2467)

73f6c39

dhdaines linked a pull request Dec 17, 2024 that will close this issue

feat: Allow deactivating OCR entirely with hi_res strategy #3839

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add possibility to deactivate OCR #2467

Add possibility to deactivate OCR #2467

thomascerbelaud commented Jan 29, 2024

dhdaines commented Dec 17, 2024 •

edited

Loading

Add possibility to deactivate OCR #2467

Add possibility to deactivate OCR #2467

Comments

thomascerbelaud commented Jan 29, 2024

dhdaines commented Dec 17, 2024 • edited Loading

dhdaines commented Dec 17, 2024 •

edited

Loading