Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] [BUG] Language selection for Tesseract/text extraction #1064

Open
bondjimbond opened this issue Nov 19, 2024 · 1 comment
Open

Comments

@bondjimbond
Copy link

This is both a bug (language parameter not being passed to Tesseract when Tesseract has the ability to work in different languages) and a feature request (creating that behaviour in Islandora).

Overview of feature request

Problem: When documents, paged content, etc. are ingested into Islandora, the Tesseract microservice runs OCR. Tesseract does seem to be installed with a handful of other languages, but Islandora natively only sends documents in English -- there is no way, in the normal ingest processes, to specify a different language. This means that documents with non-English characters (e.g. accents, different alphabets) do not get proper OCR.

Request: a method (possibly using Contexts?) to:

  1. Identify the language via the Repository Item's Language field (would have to be configurable)
  2. Transform the language term into the correct format for Tesseract
  3. Pass the language as a parameter to Tesseract as part of the text extraction process

What kind of user is the feature intended for?

Anyone ingesting content

What inspired the request?

Ingested a Swedish newspaper, only to find that the machine-generated OCR was not recognizing any of the accented characters. Investigated, turns out there is no way to activate non-English language text extraction in Islandora natively, only through a special shell command to Tesseract.

What existing behavior do you want changed?

Send the language of the document as a parameter to Tesseract when extracting text.

Any brand new behavior do you want to add to Islandora?

Provide a context for producing this behaviour, and perhaps a configuration to identify the Language field in the Repository Item content type.

Any related open or closed issues to this feature request?

None I have identified.

@joecorall
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants