Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test to check encoding of message properties #6332

Closed
wants to merge 1 commit into from

Conversation

solth
Copy link
Member

@solth solth commented Nov 29, 2024

Resolves #6331

This PR adds a test that loads the German and Spanish message property files used in Kitodo.Production and checks if they their contents are encoded with UTF-8. This is done by loading specific messages known to contain unicode special characters and decode them with UTF-8. The results are then compared with strings containing those special characters. The tests fail if the decoded message does not match the expected strings.

This PR also sets the encoding of Kitodo/src/main/resources/messages/password_es.properties and Kitodo/src/main/resources/messages/errors_de.properties to UTF-8 and therefore fixes #6324

Note: this PR does not yet replace all the codepoints like \u00FC with the actual special characters like ü in the message files. That should be done in a follow-up pull request.

@solth solth requested review from stweil and thomaslow and removed request for stweil November 29, 2024 11:48
@thomaslow
Copy link
Collaborator

@solth Looks good. Another idea would be to check the whole file with a regular expression containing only allowed characters for the respective language?

Copy link
Member

@stweil stweil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes for the property files look good. I suggest to apply them in a separate commit.

But what is the purpose of the new Java test code? Don't we want to catch any new contribution which uses a wrong encoding (!= UTF-8), no matter whether it is a property or any other non binary file? It's also sufficient to run the encoding test for files which are affected by a push or a pull request.

@stweil
Copy link
Member

stweil commented Dec 2, 2024

@henning-gerhardt
Copy link
Collaborator

Would this workflow which checks all files in pull requests or push operations be better?

You can use file to detect the file content encoding but you must extend the usage of used file parameters. Quoting form the manual page of file:

file tests each argument in an attempt to classify it.  There are three sets of tests, performed in this order: filesystem tests, magic tests, and language tests.  The  first test that succeeds causes the file type to be printed.

So, you must at least use the parameter -k or --keep-going to not to stop on the first result. The sub results should / must be evalated too as after one encoding an other encoding in the same file can occur. Quoting the man page:

     -k, --keep-going
             Don't stop at the first match, keep going.  Subsequent matches will be have the string ‘\012- ’ prepended.  
             (If you want a newline, see the -r option.)  The magic pattern with the highest strength (see the -l option) 
             comes first.

You should even know how many bytes on default are read by file to determinate the encoding and file type. Quoting:

     -P, --parameter name=value
             Set various parameter limits.

                   Name         Default    Explanation
                   bytes        1048576    max number of bytes to read from file
                   elf_notes    256        max ELF notes processed
                   elf_phnum    2048       max ELF program sections processed
                   elf_shnum    32768      max ELF sections processed
                   encoding     65536      max number of bytes to scan for encoding evaluation
                   indir        50         recursion limit for indirect magic
                   name         50         use count limit for name/use magic
                   regex        8192       length limit for regex searches

Only the first 65.536 bytes of a file are used for encoding detecting. Our files - even only the message resource files - are larger than this limit, so you must extend the limits to get the whole file and not only the first 65.536 bytes.

@solth
Copy link
Member Author

solth commented Dec 6, 2024

But what is the purpose of the new Java test code?

The main intention was to provide an automatic way to detect when the encoding of the property files has been changed to something that results in broken special characters being displayed in the frontend and thus would resolve #6331. The purpose was not to detect encoding changes to any source code files, as in my memory that didn't cause any otherwise undetected problems in the past (if the source files contain non-ASCII special characters - outside of String literals, for example in variable names - the whole compilation of the code will probably fail if the encoding is changed to something incompatible with UTF-8).

@solth Looks good. Another idea would be to check the whole file with a regular expression containing only allowed characters for the respective language?

This is a good suggestion and would probably be a better alternative to my implementation, since it would cover the whole files and not just sample messages from them. Do you know if there are predefined regular expressions covering all valid characters for specific languages?

Would this workflow which checks all files in pull requests or push operations be better?

Yes, that is probably a better way to detect file encoding changes in any source file, and thus would cover #6324.

@solth solth closed this Dec 6, 2024
@thomaslow
Copy link
Collaborator

This is a good suggestion and would probably be a better alternative to my implementation, since it would cover the whole files and not just sample messages from them. Do you know if there are predefined regular expressions covering all valid characters for specific languages?

Not really, no. I googled a bit and found this: LocaleData.getExemplarSet of the ICU4J library, which provides all valid characters of a language. But I have not used this library before. For German, English and Spanish it is probably not that difficult to specify regular expressions manually.

@stweil
Copy link
Member

stweil commented Dec 6, 2024

PR #6341 adds a CI action which checks the whole file (all which are affected by a push or pull request).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unit test for character encodings in property files Source files with wrong encoding
4 participants