Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add names for 2 letter codes #572

Conversation

tovmasharrison
Copy link
Contributor

Description

I have added the names for two letter language codes. Afterward, I modified how the unlisted_langauges list is populated and resolved the duplicate articles for 3-letter codes bug. #567

Fixes #571

Checklist:

"ie"
],
"names": [
"Interlingue"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Interlingue", "Occidental"

generate.py Outdated
@@ -213,7 +213,7 @@ def normalize(code):
base_code = base_language_code(normalized_code)
if base_code in language['codes']:
language_name = language.get('names', [None])[0]
language_slug = slugify(language_name) if language_name else code
language_slug = slugify(language_name) if language_name else base_code
break
if api_id not in [ 'alibaba', 'baidu', 'niutrans' ] and len(base_code) == 2 and not language_name:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this stricter now?

e.g.

if len(base_code) == 2 and not language_name:

@@ -213,7 +213,7 @@ def normalize(code):
base_code = base_language_code(normalized_code)
if base_code in language['codes']:
language_name = language.get('names', [None])[0]
language_slug = slugify(language_name) if language_name else code
language_slug = slugify(language_name) if language_name else base_code
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this change mean? For which languages does it change something? e.g Chinese?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, the change doesn't make any difference since I have also added the names for 2-letter codes. However, I just changed it to base_code since the validation was done with base_code.

@@ -6707,272 +6714,317 @@
},
{
"codes": [
"huyu"
"ch"
Copy link
Collaborator

@bittlingmayer bittlingmayer Nov 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, we could easily add the 3-letter codes for every language - we have the mapping from 3-letter codes to 2-letter codes in api_languages.json, could reverse it to add all these.

@@ -5,10 +5,12 @@ nav_order: 998
nav_exclude: true
parent: Languages
layout: language
title: <code>kl</code>
description: Machine translation for <code>kl</code>
title: Kalaallisut,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comma is somehow in the .json?

@@ -5,17 +5,19 @@ nav_order: 999
nav_exclude: true
parent: Languages
layout: language
title: <code>vo</code>
description: Machine translation for <code>vo</code>
title: "Volap\xFCk"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not like this in the .json file. Maybe something with your Python setup? We should just use Unicode.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And our slugify function for the page name should probably remove diacritics - umlauts, accent marks etc. languages/volapuk.md not languages/volapük.md

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not like this in the .json file. Maybe something with your Python setup? We should just use Unicode.

I'll open a separate issue for handling such characters in .md files properly.

Copy link
Collaborator

@bittlingmayer bittlingmayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, thanks! Just a few small comments and questions

@tovmasharrison
Copy link
Contributor Author

@bittlingmayer jan, All done.

@cefoo
Copy link
Collaborator

cefoo commented Nov 13, 2023

Hey @tovmasharrison!
Works well on my local copy. Thank you for your work!!

@tovmasharrison
Copy link
Contributor Author

Hey @tovmasharrison!
Works well on my local copy. Thank you for your work!!

Hi @cefoo!

It's great to hear that it's working as expected!

@bittlingmayer
Copy link
Collaborator

This seems wrong.

Screenshot 2023-11-17 at 23 06 18

@tovmasharrison
Copy link
Contributor Author

This seems wrong.

Screenshot 2023-11-17 at 23 06 18

Fixed @bittlingmayer

@bittlingmayer bittlingmayer merged commit d2e2e09 into machinetranslate:master Dec 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature: Add the names of the unlisted languages that have 2 letter codes.
3 participants