You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
I am working on large text-based PDF files, and would like to parse them as fats as possible, while keeping a high resolution (strategy="hi_res"). I am interested in extracting tables, and tables from pictures, however I would like to deactivate OCR for images detected as such. The first obvious reason is speed. But also images do not matter much.
Describe the solution you'd like
A keyword argument that would enable or disable OCR would be the most easy thing to code I guess and would be a nice additional feature, especially if it can differentiate between tables and images. Another nice feature would be to not perform OCR on tables for text-based regions, in order to speed the partition process.
Describe alternatives you've considered
Adding a new OCRMode.NO_OCR
Additional context
I am not interested in images s.a. graphs or photos.
The text was updated successfully, but these errors were encountered:
Yes, not only is OCR super slow, but table contents are frequently mangled because infer_table_structure uses OCR rather than any existing text.
This seems like an important problem for people who care about the quality of their data (so, I guess, not anyone using Generative AI, but...)
The Aryn partitioner does not insist on OCRing everything, but unfortunately it is tightly bound to a different framework which is not as nice as Unstructured for casual use cases...
dhdaines
pushed a commit
to dhdaines/unstructured
that referenced
this issue
Dec 17, 2024
Is your feature request related to a problem? Please describe.
I am working on large text-based PDF files, and would like to parse them as fats as possible, while keeping a high resolution (
strategy="hi_res"
). I am interested in extracting tables, and tables from pictures, however I would like to deactivate OCR for images detected as such. The first obvious reason is speed. But also images do not matter much.Describe the solution you'd like
A keyword argument that would enable or disable OCR would be the most easy thing to code I guess and would be a nice additional feature, especially if it can differentiate between tables and images. Another nice feature would be to not perform OCR on tables for text-based regions, in order to speed the partition process.
Describe alternatives you've considered
Adding a new
OCRMode.NO_OCR
Additional context
I am not interested in images s.a. graphs or photos.
The text was updated successfully, but these errors were encountered: