[FEATURE] [BUG] Language selection for Tesseract/text extraction #1064

bondjimbond · 2024-11-19T15:12:15Z

This is both a bug (language parameter not being passed to Tesseract when Tesseract has the ability to work in different languages) and a feature request (creating that behaviour in Islandora).

Overview of feature request

Problem: When documents, paged content, etc. are ingested into Islandora, the Tesseract microservice runs OCR. Tesseract does seem to be installed with a handful of other languages, but Islandora natively only sends documents in English -- there is no way, in the normal ingest processes, to specify a different language. This means that documents with non-English characters (e.g. accents, different alphabets) do not get proper OCR.

Request: a method (possibly using Contexts?) to:

Identify the language via the Repository Item's Language field (would have to be configurable)
Transform the language term into the correct format for Tesseract
Pass the language as a parameter to Tesseract as part of the text extraction process

What kind of user is the feature intended for?

Anyone ingesting content

What inspired the request?

Ingested a Swedish newspaper, only to find that the machine-generated OCR was not recognizing any of the accented characters. Investigated, turns out there is no way to activate non-English language text extraction in Islandora natively, only through a special shell command to Tesseract.

What existing behavior do you want changed?

Send the language of the document as a parameter to Tesseract when extracting text.

Any brand new behavior do you want to add to Islandora?

Provide a context for producing this behaviour, and perhaps a configuration to identify the Language field in the Repository Item content type.

Any related open or closed issues to this feature request?

None I have identified.

joecorall · 2024-11-19T15:18:53Z

Islandora/documentation#1957

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] [BUG] Language selection for Tesseract/text extraction #1064

[FEATURE] [BUG] Language selection for Tesseract/text extraction #1064

bondjimbond commented Nov 19, 2024

joecorall commented Nov 19, 2024

[FEATURE] [BUG] Language selection for Tesseract/text extraction #1064

[FEATURE] [BUG] Language selection for Tesseract/text extraction #1064

Comments

bondjimbond commented Nov 19, 2024

joecorall commented Nov 19, 2024