Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tolerate additional entries in obsolete messages #79

Open
mathieucaroff opened this issue Jun 5, 2023 · 10 comments
Open

Tolerate additional entries in obsolete messages #79

mathieucaroff opened this issue Jun 5, 2023 · 10 comments

Comments

@mathieucaroff
Copy link

mathieucaroff commented Jun 5, 2023

I have a .po file whose translations are handled with Poedit and I get obsolete entries in my .po files. The parser fails when it encounters them.

Reproduction

en.po

#, fuzzy
#~| msgid "Latest version"
#~ msgid "Latest Version"
#~ msgstr "Latest version"

Parsing the content of en.po with gettext-parser (running gettextParser.po.parse(content)) produces the following error:

node_modules\gettext-parser\lib\poparser.js:224
            throw err;
            ^

SyntaxError: Error parsing PO data: Invalid key name "|" at line 2. This can be caused by an unescaped quote character in a msgid or msgstr value.

Expected behavior

The parser should not crash. It should either ignore the entry, or properly parse it to data.

Misc

I believe this issue was introduced by the following PR: #64 -- released in version 5.0.0. 4.2.0 is the last version before 5.0.0.

Have a good day !

@smhg
Copy link
Owner

smhg commented Jun 6, 2023

Thank you for reporting this! After some initial investigation any additional character after an obsolete marker (#~) seems to lead to this error. This is unrelated to the fuzzy flag.
So currently only basic obsolete messages are supported. Obsolete messages currently cannot contain previous messages (|), references (:),...
It would definitely be good to support this. A PR would be very much appreciated.

@smhg smhg changed the title Tolerate fuzzy entries in obsolete messages Tolerate additional entries in obsolete messages Jun 6, 2023
@smhg
Copy link
Owner

smhg commented Jun 6, 2023

@KapitanOczywisty would you be able to have a look at this?

@KapitanOczywisty
Copy link
Contributor

So currently only basic obsolete messages are supported. Obsolete messages currently cannot contain previous messages (|), references (:),...

@smhg Yes, they can, by using #| and #:. I don't know where #~| came from, but It's not included anywhere in gettext documentation. All comments are connected to their msgid, so if that msgid is obsolete there is no need to mark comments as obsolete. Reversing that: why not #~, fuzzy? You never see #~,, likely because It's popular enough that people seen it correctly used and implemented.

@smhg
Copy link
Owner

smhg commented Jun 8, 2023

@KapitanOczywisty If I understand correctly #| means a previous translation and #~| is an obsolete previous translation?
I can't tell whether that is actually usefull to programs, but moving from non-obsolete to obsolete just adds the ~ so it makes sense that's the result?
And obsolete message parsing currently doesn't look at the 3rd character correctly I think?

@KapitanOczywisty
Copy link
Contributor

If I understand correctly #| means a previous translation and #~| is an obsolete previous translation?

I only found that: https://github.com/winlibs/gettext/blob/24b0fe071cff79acf4a3904591b481328732ddc1/source/gettext-tools/src/po-lex.c#L903-L906

And obsolete message parsing currently doesn't look at the 3rd character correctly I think?

I thinks parsing ignores whitespace and searches for msgid keyword. I can make this work wit #~| though this creates some questions:

  • should #~| be treated as #|?
  • if what should comment be named?
  • should this also be parsed for non-obsolete messages?

@mathieucaroff
Copy link
Author

I believe the simplest, non-surprising behavior would be to treat as a regular comment, if that makes sense with regards to the way gettext-parser currently works.

@smhg
Copy link
Owner

smhg commented Jun 17, 2023

Basically #~| should be treated as if it were #~#|, right?
It should be listed like this one I think. These are already parsed out.

But looking a the code, this does not look straighforward.
First this @KapitanOczywisty: do you remember why you grouped the obsolete state with the none state? Wouldn't it be better to separate those? Because different parts of the checks under those are applicable to different states, no?

@KapitanOczywisty
Copy link
Contributor

do you remember why you grouped the obsolete state with the none state?

At the time I was not aware of #~| so only correct use would be with keywords e.g. #~ msgid. I can fix that, just we need to decide how #~| should be handled. I'd prefer to treat it as #|, which means parse and compilation would yield different results than input, but all other solutions have major problems with compatibility anyway.

@smhg
Copy link
Owner

smhg commented Jun 17, 2023

As an example: if you prepend it to this line, it would appear on this line. Right?
If not, can you give an example when it gets ambiguous?

@KapitanOczywisty
Copy link
Contributor

In my suggestion, this file:

#, fuzzy
#~| msgid "Latest version"
#~ msgid "Latest Version"
#~ msgstr "Latest version"

after parse would become:

{
    // ...
    "obsolete": {
        "": {
            "Latest Version": {
                "msgid": "Latest Version",
                "comments": {
                    "flag": "fuzzy",
                    "previous": "msgid \"Latest version\""
                },
                "msgstr": [
                    "Latest version"
                ]
            }
        }
    }
}

and then after compilation:

#, fuzzy
#| msgid "Latest version"
#~ msgid "Latest Version"
#~ msgstr "Latest version"

reosarevok added a commit to metabrainz/po2json that referenced this issue May 30, 2024
Versions above this break with doubly escaped comments
such as #~| which weblate uses with fuzzy matching.
smhg/gettext-parser#79
reosarevok added a commit to reosarevok/musicbrainz-server that referenced this issue May 30, 2024
Versions above this break with doubly escaped comments
such as #~| which weblate uses with fuzzy matching.
smhg/gettext-parser#79
reosarevok added a commit to reosarevok/musicbrainz-server that referenced this issue May 30, 2024
Versions above this break with doubly escaped comments
such as #~| which weblate uses with fuzzy matching.
smhg/gettext-parser#79
reosarevok added a commit to reosarevok/musicbrainz-server that referenced this issue May 30, 2024
Versions above this break with doubly escaped comments
such as #~| which weblate uses with fuzzy matching.
smhg/gettext-parser#79
reosarevok added a commit to reosarevok/musicbrainz-server that referenced this issue May 30, 2024
Versions above this break with doubly escaped comments
such as #~| which weblate uses with fuzzy matching.
smhg/gettext-parser#79
reosarevok added a commit to reosarevok/musicbrainz-server that referenced this issue Jun 4, 2024
Versions above this break with doubly escaped comments
such as #~| which weblate uses with fuzzy matching.
smhg/gettext-parser#79
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants