Add test to check encoding of message properties #6332

solth · 2024-11-29T11:48:51Z

Resolves #6331

This PR adds a test that loads the German and Spanish message property files used in Kitodo.Production and checks if they their contents are encoded with UTF-8. This is done by loading specific messages known to contain unicode special characters and decode them with UTF-8. The results are then compared with strings containing those special characters. The tests fail if the decoded message does not match the expected strings.

This PR also sets the encoding of Kitodo/src/main/resources/messages/password_es.properties and Kitodo/src/main/resources/messages/errors_de.properties to UTF-8 and therefore fixes #6324

Note: this PR does not yet replace all the codepoints like \u00FC with the actual special characters like ü in the message files. That should be done in a follow-up pull request.

thomaslow · 2024-12-02T11:54:00Z

@solth Looks good. Another idea would be to check the whole file with a regular expression containing only allowed characters for the respective language?

stweil

The changes for the property files look good. I suggest to apply them in a separate commit.

But what is the purpose of the new Java test code? Don't we want to catch any new contribution which uses a wrong encoding (!= UTF-8), no matter whether it is a property or any other non binary file? It's also sufficient to run the encoding test for files which are affected by a push or a pull request.

stweil · 2024-12-02T18:55:39Z

Would this workflow which checks all files in pull requests or push operations be better?

henning-gerhardt · 2024-12-03T08:52:59Z

Would this workflow which checks all files in pull requests or push operations be better?

You can use file to detect the file content encoding but you must extend the usage of used file parameters. Quoting form the manual page of file:

file tests each argument in an attempt to classify it.  There are three sets of tests, performed in this order: filesystem tests, magic tests, and language tests.  The  first test that succeeds causes the file type to be printed.

So, you must at least use the parameter -k or --keep-going to not to stop on the first result. The sub results should / must be evalated too as after one encoding an other encoding in the same file can occur. Quoting the man page:

     -k, --keep-going
             Don't stop at the first match, keep going.  Subsequent matches will be have the string ‘\012- ’ prepended.  
             (If you want a newline, see the -r option.)  The magic pattern with the highest strength (see the -l option) 
             comes first.

You should even know how many bytes on default are read by file to determinate the encoding and file type. Quoting:

     -P, --parameter name=value
             Set various parameter limits.

                   Name         Default    Explanation
                   bytes        1048576    max number of bytes to read from file
                   elf_notes    256        max ELF notes processed
                   elf_phnum    2048       max ELF program sections processed
                   elf_shnum    32768      max ELF sections processed
                   encoding     65536      max number of bytes to scan for encoding evaluation
                   indir        50         recursion limit for indirect magic
                   name         50         use count limit for name/use magic
                   regex        8192       length limit for regex searches

Only the first 65.536 bytes of a file are used for encoding detecting. Our files - even only the message resource files - are larger than this limit, so you must extend the limits to get the whole file and not only the first 65.536 bytes.

solth · 2024-12-06T12:46:08Z

But what is the purpose of the new Java test code?

The main intention was to provide an automatic way to detect when the encoding of the property files has been changed to something that results in broken special characters being displayed in the frontend and thus would resolve #6331. The purpose was not to detect encoding changes to any source code files, as in my memory that didn't cause any otherwise undetected problems in the past (if the source files contain non-ASCII special characters - outside of String literals, for example in variable names - the whole compilation of the code will probably fail if the encoding is changed to something incompatible with UTF-8).

@solth Looks good. Another idea would be to check the whole file with a regular expression containing only allowed characters for the respective language?

This is a good suggestion and would probably be a better alternative to my implementation, since it would cover the whole files and not just sample messages from them. Do you know if there are predefined regular expressions covering all valid characters for specific languages?

Would this workflow which checks all files in pull requests or push operations be better?

Yes, that is probably a better way to detect file encoding changes in any source file, and thus would cover #6324.

thomaslow · 2024-12-06T13:18:10Z

This is a good suggestion and would probably be a better alternative to my implementation, since it would cover the whole files and not just sample messages from them. Do you know if there are predefined regular expressions covering all valid characters for specific languages?

Not really, no. I googled a bit and found this: LocaleData.getExemplarSet of the ICU4J library, which provides all valid characters of a language. But I have not used this library before. For German, English and Spanish it is probably not that difficult to specify regular expressions manually.

stweil · 2024-12-06T14:30:09Z

PR #6341 adds a CI action which checks the whole file (all which are affected by a push or pull request).

solth requested review from stweil and thomaslow and removed request for stweil November 29, 2024 11:48

stweil reviewed Dec 2, 2024

View reviewed changes

Add test to check encoding of message properties

a7cba1a

solth force-pushed the encoding-test branch from da628a4 to a7cba1a Compare December 3, 2024 08:06

solth closed this Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add test to check encoding of message properties #6332

Add test to check encoding of message properties #6332

solth commented Nov 29, 2024

thomaslow commented Dec 2, 2024

stweil left a comment

stweil commented Dec 2, 2024

henning-gerhardt commented Dec 3, 2024

solth commented Dec 6, 2024

thomaslow commented Dec 6, 2024

stweil commented Dec 6, 2024

Add test to check encoding of message properties #6332

Add test to check encoding of message properties #6332

Conversation

solth commented Nov 29, 2024

thomaslow commented Dec 2, 2024

stweil left a comment

Choose a reason for hiding this comment

stweil commented Dec 2, 2024

henning-gerhardt commented Dec 3, 2024

solth commented Dec 6, 2024

thomaslow commented Dec 6, 2024

stweil commented Dec 6, 2024