You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is both a bug (language parameter not being passed to Tesseract when Tesseract has the ability to work in different languages) and a feature request (creating that behaviour in Islandora).
Overview of feature request
Problem: When documents, paged content, etc. are ingested into Islandora, the Tesseract microservice runs OCR. Tesseract does seem to be installed with a handful of other languages, but Islandora natively only sends documents in English -- there is no way, in the normal ingest processes, to specify a different language. This means that documents with non-English characters (e.g. accents, different alphabets) do not get proper OCR.
Request: a method (possibly using Contexts?) to:
Identify the language via the Repository Item's Language field (would have to be configurable)
Transform the language term into the correct format for Tesseract
Pass the language as a parameter to Tesseract as part of the text extraction process
What kind of user is the feature intended for?
Anyone ingesting content
What inspired the request?
Ingested a Swedish newspaper, only to find that the machine-generated OCR was not recognizing any of the accented characters. Investigated, turns out there is no way to activate non-English language text extraction in Islandora natively, only through a special shell command to Tesseract.
What existing behavior do you want changed?
Send the language of the document as a parameter to Tesseract when extracting text.
Any brand new behavior do you want to add to Islandora?
Provide a context for producing this behaviour, and perhaps a configuration to identify the Language field in the Repository Item content type.
Any related open or closed issues to this feature request?
None I have identified.
The text was updated successfully, but these errors were encountered:
This is both a bug (language parameter not being passed to Tesseract when Tesseract has the ability to work in different languages) and a feature request (creating that behaviour in Islandora).
Overview of feature request
Problem: When documents, paged content, etc. are ingested into Islandora, the Tesseract microservice runs OCR. Tesseract does seem to be installed with a handful of other languages, but Islandora natively only sends documents in English -- there is no way, in the normal ingest processes, to specify a different language. This means that documents with non-English characters (e.g. accents, different alphabets) do not get proper OCR.
Request: a method (possibly using Contexts?) to:
What kind of user is the feature intended for?
Anyone ingesting content
What inspired the request?
Ingested a Swedish newspaper, only to find that the machine-generated OCR was not recognizing any of the accented characters. Investigated, turns out there is no way to activate non-English language text extraction in Islandora natively, only through a special shell command to Tesseract.
What existing behavior do you want changed?
Send the language of the document as a parameter to Tesseract when extracting text.
Any brand new behavior do you want to add to Islandora?
Provide a context for producing this behaviour, and perhaps a configuration to identify the Language field in the Repository Item content type.
Any related open or closed issues to this feature request?
None I have identified.
The text was updated successfully, but these errors were encountered: