You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jul 11, 2023. It is now read-only.
Spent many hours experimenting with the best way to extract text data from PDFs. Tried a couple different libraries - they all had problems preserving whitespace. This ended up being pretty problematic when I went to query embeddings of this text. The incorrect formatting would be preserved in the answers, which won't do.
The best solution in practice came out to be converting the PDFs to images then using OCR to extract text from the images.
I have this implemented in python for now but will be rewriting in TS for the production app so can contribute that code in the future if someone else doesn't already pick it up
The text was updated successfully, but these errors were encountered:
I have the same use case as @rikuthinks , another use case is extracting specific informations, in case of something like an invoice, to extract names, adresses, etc..
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
For PDFs:
https://github.com/kartik1998/pdf-images
https://github.com/naptha/tesseract.js#tesseractjs
Spent many hours experimenting with the best way to extract text data from PDFs. Tried a couple different libraries - they all had problems preserving whitespace. This ended up being pretty problematic when I went to query embeddings of this text. The incorrect formatting would be preserved in the answers, which won't do.
The best solution in practice came out to be converting the PDFs to images then using OCR to extract text from the images.
I have this implemented in python for now but will be rewriting in TS for the production app so can contribute that code in the future if someone else doesn't already pick it up
The text was updated successfully, but these errors were encountered: