Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make MarkdownFields translatable #102

Open
jeriox opened this issue Jul 5, 2022 · 10 comments
Open

Make MarkdownFields translatable #102

jeriox opened this issue Jul 5, 2022 · 10 comments

Comments

@jeriox
Copy link

jeriox commented Jul 5, 2022

Currently, when using wagtail-localize, a MarkdownField cannot be translated in an easy way, as the whole content of the field is put into one translation segment. For a long page with a markdown body, this is not feasible. I'd like to have the MarkdownField split up in several translation segments (like with StreamFields), so I can translate them separately.
I wrote a hacky solution for that some time ago, but it breaks with the current version. I'd be happy if we could find a way to support that properly.

My old code for reference:

import html2text
from django.db.models import TextField
from wagtail_localize.segments import (
    OverridableSegmentValue,
    StringSegmentValue,
    TemplateSegmentValue,
)
from wagtail_localize.segments.extract import quote_path_component
from wagtail_localize.segments.ingest import organise_template_segments
from wagtail_localize.strings import extract_strings, restore_strings

from wagtailmarkdown.utils import render_markdown
from wagtailmarkdown.widgets import MarkdownTextarea


class MarkdownField(TextField):
    def formfield(self, **kwargs):
        defaults = {"widget": MarkdownTextarea}
        defaults.update(kwargs)
        return super(MarkdownField, self).formfield(**defaults)

    def get_translatable_segments(self, value):
        template, strings = extract_strings(render_markdown(value))

        # Find all unique href values
        hrefs = set()
        for string, attrs in strings:
            for tag_attrs in attrs.values():
                if "href" in tag_attrs:
                    hrefs.add(tag_attrs["href"])

        return (
            [TemplateSegmentValue("", "html", template, len(strings))]
            + [StringSegmentValue("", string, attrs=attrs) for string, attrs in strings]
            + [OverridableSegmentValue(quote_path_component(href), href) for href in sorted(hrefs)]
        )

    def restore_translated_segments(self, value, field_segments):
        format, template, strings = organise_template_segments(field_segments)
        return html2text.html2text(restore_strings(template, strings))
@zerolab
Copy link
Member

zerolab commented Jul 6, 2022

Hey @jeriox,

thank you for sharing this. Had a few requests for making this localize-compatible, so the code snippet is very handy!

@jeriox
Copy link
Author

jeriox commented Aug 29, 2022

I got it working again with the code above, we will use that for now. Still feels a bit hacky to me, so we'd be happy if there was a better alternative built in :)

@zerolab
Copy link
Member

zerolab commented Aug 29, 2022

This would need a bit of thinking. e.g.

I'd like to have the MarkdownField split up in several translation segments (like with StreamFields), so I can translate them separately.

Where do you draw the line and split things? is it at every link? every paragraph? every heading? given we can allow raw html in there too, how should we handle that?

@jeriox
Copy link
Author

jeriox commented Aug 29, 2022

This would need a bit of thinking. e.g.

I'd like to have the MarkdownField split up in several translation segments (like with StreamFields), so I can translate them separately.

Where do you draw the line and split things? is it at every link? every paragraph? every heading? given we can allow raw html in there too, how should we handle that?

Currently, my approach works as follows: as there is already a lot of thought going into how to split up StreamFields, I tried to reuse that as much as possible. Therefor, I render the markdown to HTML and use the existings extract_strings() method. This also ensures that links are treated appropriatly. For the other direction, using html2text works quite well. I didn't test with raw HTML though. I think that every paragraph and every heading is a good split, as it ensures that one doesn't need to re-translate it if the page didn't change.

@jeriox
Copy link
Author

jeriox commented Oct 15, 2024

Hey @zerolab, did you have a chance to look at this any further?

@zerolab
Copy link
Member

zerolab commented Oct 15, 2024

@jeriox to be honest this completely flew under my radar 🙈

I think your version is better than what we currently have (i.e. nothing). Do you have the capacity to submit a PR? We'd want the logic in get_translatable_segments and restore_translated_segments to live in its own module (say wagtail_localize.py and be conditionally loaded if localize is installed

@jeriox
Copy link
Author

jeriox commented Oct 16, 2024

I think your version is better than what we currently have (i.e. nothing).

While this is true, I'm not sure if it is good enough to include it in the library. We have been using this solution in our project for two years now, and there are several problems:

If those are okay for you, I can open a PR. We'd like to do the splitting on our own instead of relying on converting to HTML, but we didn't have the capacity to do so yet

@zerolab
Copy link
Member

zerolab commented Oct 16, 2024

Thank you for the additional context on real-life usage. Absolutely fantastic to know.

What if we make the get_translateable_segments bit pluggabable (i.e. you can change it to your own project's method that does what you want it to do?

The images question is outside of wagtail-markdown's purview, I'm afraid. We definitely need to solve this more centrally.

@jeriox
Copy link
Author

jeriox commented Oct 16, 2024

I'm not sure that it needs to be specifically plugabble, as you could just subclass the provided MarkdownField and change the get_translateable_segments if you are not happy with it, this is the same approach that we are currently using to implement it in the first place.

So I guess we could just include my current solution as the default, especially if images are out of scope anyways and maybe the issue with duplicate segments gets fixed centrally as well. The other things are just small issues IMO and could just be mentioned in the docs

@asennoussi
Copy link

I'm interested in any updates around this topic.
I also have articles with markdown and the translation input is a text area that doesn't support new lines so it makes it really not handy to translate. (I have to add the
myself but haven't tried it myself either)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants