From 749866a1a2e442ab47aae0a0853b9e3377831a40 Mon Sep 17 00:00:00 2001 From: Chris Pyle <118906070+chpy04@users.noreply.github.com> Date: Tue, 3 Sep 2024 11:43:04 -0400 Subject: [PATCH] CLDR-17566 Converting Updating Codes P1 (#4005) --- .../external-version-metadata.md | 29 +++++++ .../likelysubtags-and-default-content.md | 24 ++++++ .../updating-codes/update-currency-codes.md | 62 ++++++++++++++ .../update-language-script-info.md | 41 ++++++++++ .../language-script-description.md | 23 ++++++ .../update-languagescriptregion-subtags.md | 82 +++++++++++++++++++ .../update-time-zone-data-for-zoneparser.md | 35 ++++++++ 7 files changed, 296 insertions(+) create mode 100644 docs/site/development/updating-codes/external-version-metadata.md create mode 100644 docs/site/development/updating-codes/likelysubtags-and-default-content.md create mode 100644 docs/site/development/updating-codes/update-currency-codes.md create mode 100644 docs/site/development/updating-codes/update-language-script-info.md create mode 100644 docs/site/development/updating-codes/update-language-script-info/language-script-description.md create mode 100644 docs/site/development/updating-codes/update-languagescriptregion-subtags.md create mode 100644 docs/site/development/updating-codes/update-time-zone-data-for-zoneparser.md diff --git a/docs/site/development/updating-codes/external-version-metadata.md b/docs/site/development/updating-codes/external-version-metadata.md new file mode 100644 index 00000000000..b0ab87557d9 --- /dev/null +++ b/docs/site/development/updating-codes/external-version-metadata.md @@ -0,0 +1,29 @@ +--- +title: Updating External Version Metadata +--- + +# Updating External Version Metadata + +## Updating Metadata + +[CLDR\-15005](https://unicode-org.atlassian.net/browse/CLDR-15005) is for updating the process for external metadata versions. The following table is out of date with [common/properties/external\_data\_versions.tsv](https://github.com/unicode-org/cldr/blob/main/common/properties/external_data_versions.tsv) + +### TODO: Need to add instructions for updating external metadata + +~~The following tells how to get the version info for imported data used in a CLDR release.~~ + +| Data | File | Version Info | Date | +|---|---|---|---| +| UN literacy data | [un_literacy.csv](https://github.com/unicode-org/cldr/blob/master/tools/java/org/unicode/cldr/util/data/external/un_literacy.csv) | Date at top | 2012-08 | +| Worldbank data | [world_bank_data.csv](https://github.com/unicode-org/cldr/blob/master/tools/java/org/unicode/cldr/util/data/external/world_bank_data.csv) | Date at bottom | 2020-12-16 | +| Factbook data | [factbook_population.txt](https://github.com/unicode-org/cldr/blob/master/tools/java/org/unicode/cldr/util/data/external/factbook_population.txt) | record when downloaded in TBD | | +| ISO 636 (language) data | [iso-639-3-version.tab](https://github.com/unicode-org/cldr/blob/master/tools/java/org/unicode/cldr/util/data/iso-639-3-version.tab) | Date in YYYYMMDD format | 2021-02-02 | +| ISO subdivision codes | iso subdivision codes | record when downloaded in TBD | | +| ISO subdivision names | iso subdivision names | record when downloaded in TBD | | +| ISO currency data | iso currency data | record when downloaded in TBD | | +| Timezone IDs (tzdb) | timezones (tz) | Release date on [IANA time zone DB](https://www.iana.org/time-zones) | 2021-01-24 (2021a) | +| Top level domains | [tlds-alpha-by-domain.txt](https://github.com/unicode-org/cldr/blob/master/tools/java/org/unicode/cldr/util/data/tlds-alpha-by-domain.txt) | Date at top | 2021-02-17 | +| Language Groups | TBD | Record when downloaded in TBD | | +| UN / EU Codes | TBD | Record when downloaded in TBD | | + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file diff --git a/docs/site/development/updating-codes/likelysubtags-and-default-content.md b/docs/site/development/updating-codes/likelysubtags-and-default-content.md new file mode 100644 index 00000000000..16da0776523 --- /dev/null +++ b/docs/site/development/updating-codes/likelysubtags-and-default-content.md @@ -0,0 +1,24 @@ +--- +title: LikelySubtags and Default Content +--- + +# LikelySubtags and Default Content + +1. First make sure that you do [Update Language/Script/Region Subtags](https://cldr.unicode.org/development/updating-codes/update-languagescriptregion-subtags) first +2. Run GenerateMaximalLocales with VM argument ```-DCLDR_DIR``` set to your cldr directory to generate the likely subtag data **AND** the default content locales. + 1. If you are trying to debug, add the VM argument ```-DGenerateMaximalLocalesDebug``` +3. Input data: + 1. Data comes from territory/language information in supplemental data. + 1. However, it is supplemented by **LANGUAGE\_OVERRIDES** in GenerateMaximalLocales.java + 1. If there is no territory/language information in supplemental data for a language, add it to **LANGUAGE\_OVERRIDES**. + 2. If the mapping changes when it shouldn't (there are some special cases), add to **LANGUAGE\_OVERRIDES.** +4. Output: + 1. Creates {CLDR\_DIR}/../Generated/cldr/supplemental/likelySubtags.xml and {CLDR\_DIR}/../Generated/cldr/supplemental/supplementalMetadata.xml + 2. Diff with {CLDR\_DIR}/common/supplemental/likelySubtags.xml and {CLDR\_DIR}/common/supplemental/supplementalMetadata.xml + 3. Be very careful to diff everything and check for errors. + 1. Watch especially for backwards incompatible changes; that is, changes rather than just additions. + 2. Look at the above to handle that with **LANGUAGE\_OVERRIDES.** + 4. Run tests, fix input data, and iterate as necessary. + 1. Copy into the svn workspace and commit. + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file diff --git a/docs/site/development/updating-codes/update-currency-codes.md b/docs/site/development/updating-codes/update-currency-codes.md new file mode 100644 index 00000000000..00113ec744e --- /dev/null +++ b/docs/site/development/updating-codes/update-currency-codes.md @@ -0,0 +1,62 @@ +--- +title: Update Currency Codes +--- + +# Update Currency Codes + +- Go to https://www.six-group.com/en/products-services/financial-information/data-standards.html#scrollTo=currency-codes +- Take the link for "Current Currency and Funds": ["List one (XML)"](https://www.six-group.com/dam/download/financial-information/data-center/iso-currrency/amendments/lists/list_one.xml) +- Save the page as {cldr}/tools/cldr\-code/src/main/resources/org/unicode/cldr/util/data/dl\_iso\_table\_a1\.xml +- ```curl 'https://www.six-group.com/dam/download/financial-information/data-center/iso-currrency/lists/list_one.xml' > tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/dl_iso_table_a1.xml``` +- Take the link for "Historic denominations": "[List three (XML)](https://www.six-group.com/dam/download/financial-information/data-center/iso-currrency/amendments/lists/list_three.xml)" +- Save the page as {cldr}/tools/cldr\-code/src/main/resources/org/unicode/cldr/util/data/dl\_iso\_table\_a3\.xml +- ```curl 'https://www.six-group.com/dam/download/financial-information/data-center/iso-currrency/lists/list_three.xml' > tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/dl_iso_table_a3.xml``` +- **Use git diff to sanity check the two XML files against the old, and check them in.** + - **"git diff \-w" is helpful to ignore whitespace. If there are only whitespace changes, there's no need to check them in.** +- **Check the** [**ISO amendments**](https://www.six-group.com/en/products-services/financial-information/data-standards.html#scrollTo=amendments) **to get changes that will happen during the current cycle.** + - Example: https://www.six-group.com/dam/download/financial-information/data-center/iso-currrency/amendments/dl_currency_iso_amendment_170.pdf + - It appears right now like there is no good way to collect all the amendments that are applicable, except to change "170" in the above link by incrementing until error \#404 results. So: + - *Review all amendments that are dated after the previous update , and patch the XML files and the* ```supplementalData.xml``` *as below.* + - *Record the last number viewed in the URL above.* + - *(There is a "download all amendments" link now that has a spreadsheet summary.)* + - **Record the version: See** [**Updating External Metadata**](https://cldr.unicode.org/development/updating-codes/external-version-metadata) + - If there are no diffs in the two iso tables, and no relevant changes in the amendments, you are done. + - Run ```CountItems -Dmethod=generateCurrencyItems``` to generate the new currency list. + - If any currency is missing from ISO4217\.txt, the program will throw an exception and will print a list of items at the end that need to be added to the ISO4217\.txt file. Add as described below. + - Once the necessary codes are added to ISO4217\.txt, repeat the CountItems \-Dmethod\=generateCurrencyItems until it runs cleanly. + - If any country changes the use of a currency, verify that there is a corresponding entry in SupplementalData + - Since ISO doesn't publish the exact date change (usually just a month), you may need to do some additional research to see if you can determine the exact date when a new currency becomes active, or when an old currency becomes inactive. If you can't find the exact date, use the last day of the month ISO publishes for an old currency expiring. + - For new stuff, see below. + - Adding a currency: + - Make sure the new code exists in common/bcp47/currency.xml. The currency code should be in lower case, and make sure the "since" release corresponds to the next release of CLDR that will publish using this data. + - In SupplementalData: + - If it has unusual rounding or number of digits, add to: + - \ + - \ + - ... + - For each country in which it comes into use, add a line for when it becomes valid + - \ + - \ + - Add the code to the file java/org/unicode/cldr/util/data/ISO4217\.txt. This is important, since it is used to get the valid codes for the survey tool. + - Example: + - currency \| TRY \| new Turkish Lira \| TR \| TURKEY \| C + - Mark the old code in java/org/unicode/cldr/util/data/ISO4217\.txt as deprecated. + - currency \| TRL \| Old Turkish Lira \| TR \| TURKEY \| O + - Changing currency. + - If the currency goes out of use in a country, then add the last day of use, such as: + - \ + - \ + - \=\> + - \ + - \ + - Edit common/main/en.xml to add the new names (or change old ones) based on the descriptions. + - If there is a collision between a new and old name, the old one typically changes to the currency name with the date range + - "currency\_name (1983\-2003\)". + - Check in your changes + - common/bcp47/currency.xml + - tools/java/org/unicode/cldr/util/data/ISO4217\.txt + - common/main/en.xml + - common/supplemental/supplementalData.xml +- ***Note: We no longer maintain the list of currency in supplementalMetadata.xml (***[***\#4298***](http://unicode.org/cldr/trac/ticket/4298)***). The list is currently maintained by bcp47/currency.xml. We need to move the code used for checking list of ISO currency (and its numeric code mapping) currently in ICU tools repository (http://source.icu-project.org/repos/icu/tools/trunk/currency/).*** + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file diff --git a/docs/site/development/updating-codes/update-language-script-info.md b/docs/site/development/updating-codes/update-language-script-info.md new file mode 100644 index 00000000000..fd663ee9a73 --- /dev/null +++ b/docs/site/development/updating-codes/update-language-script-info.md @@ -0,0 +1,41 @@ +--- +title: Update Language Script Info +--- + +# Update Language Script Info + +### Main + +1. https://github.com/unicode-org/cldr/tree/main/tools/cldr-code/src/main/resources/org/unicode/cldr/util/data has files with this form: + 1. **country\_language\_population.tsv** + 2. **language\_script.tsv** + 3. For a descriptions of the contents, see [Language Script Guidelines](https://cldr.unicode.org/development/updating-codes/update-language-script-info/language-script-description) + 1. Do not edit the above files with a plain text editor; they are tab\-delimited UTF\-8 with many fields and should be imported/edited with a spreadsheet editor. Excel or Google sheets should also work fine. +2. The world bank, un, and factbook data should be updated as per [Updating Population, GDP, Literacy](https://cldr.unicode.org/development/updating-codes/updating-population-gdp-literacy) +3. Note that there is an auxiliary file **util/data/external/other\_country\_data.txt**, which contains data that supplements the others. If there are errors below because the country population is less than the language population, then that file may need updating. + 1. Run the tool **ConvertLanguageData**. + 1. \-DADD\_POP\=**true**; for error messages. + 1. If there are any different country names, you'll get an error:  edit external/alternate\_country\_names.txt to add them. + 2. Look for failures in the language vs script data, following the line: + - Problems in **language\_script.tsv** + 3. Look for Territory Language data, following the line: + - **Possible Failures ...** + - In Basic Data but not Population \> 20% + - and the reverse. + 4. Look for general problems, following the line: + - **Failures in Output.** + - It will also warn if a country doesn't have an official or de facto official language. + 5. Work until resolved. + 2. *The tool updates in place* **{cldrdata}/common/supplemental/supplementalData.xml** + 3. Carefully diff + 4. Then run QuickCheck to verify that the DTD is in order, and commit. + +### Update the supplementalData.xml \ + +1. For UN M.49 codes, see [Updating UN Codes](https://cldr.unicode.org/development/updating-codes/updating-un-codes) +2. For the UN, go to https://www.un.org/en/member-states/index.html. Copy the table, and paste into util/data/external/un\_member\_states\_raw.txt. Diff with old. **BROKEN LINK** +3. For the EU, see instructions on [Updating UN Codes](https://cldr.unicode.org/development/updating-codes/updating-un-codes) +4. For the EZ, do the same with , into util/data/external/ez\_member\_states\_raw.txt  **BROKEN LINK** + 1. If there are changes, update \ + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file diff --git a/docs/site/development/updating-codes/update-language-script-info/language-script-description.md b/docs/site/development/updating-codes/update-language-script-info/language-script-description.md new file mode 100644 index 00000000000..777ff0abc60 --- /dev/null +++ b/docs/site/development/updating-codes/update-language-script-info/language-script-description.md @@ -0,0 +1,23 @@ +--- +title: Language Script Description +--- + +# Language Script Description + +The language\_script spreadsheet should list all of the language / script combinations that are in common modern use. The countries are not important, since their function has been overtaken by the country\_language\_population spreadsheet. + +1. If the language and script are both modern, and the script is a major way to write the language in some country, then we should see that line marked as **primary**. +2. Otherwise it should be marked **secondary**. + +Every language that is in official use in any country according to country\_language\_population  should have at least one primary script in the language\_script spreadsheet. + +If a language has multiple primary scripts, then it should not appear without the script tag in the country\_language\_population.tsv. For example, we should not see "az", but rather "az\_Cyrl", "az\_Latn", and so on. For each country where the language is used, we should see figures on the script\-specific values. The values may overlap, that is, we may see az\_Cyrl at 60% and az\_Latn at 55%. However, the combination with the predominantly used script **must** have a larger figure than the others. + +This is also reflected in CLDR main: languages with multiple scripts will have that reflected in their structure (eg sr\-Cyrl\-RS), with aliases for the language\-region combinations. + +Files in https://github.com/unicode-org/cldr/tree/main/tools/cldr-code/src/main/resources/org/unicode/cldr/util/data + +1. country\_language\_population.tsv +2. language\_script.tsv + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file diff --git a/docs/site/development/updating-codes/update-languagescriptregion-subtags.md b/docs/site/development/updating-codes/update-languagescriptregion-subtags.md new file mode 100644 index 00000000000..51b35bd05e7 --- /dev/null +++ b/docs/site/development/updating-codes/update-languagescriptregion-subtags.md @@ -0,0 +1,82 @@ +--- +title: Update Language/Script/Region Subtags +--- + +# Update Language/Script/Region Subtags + +### Updated 2021\-02\-17 by Yoshito Umaoka + +### This updates language codes, script codes, and territory codes. + +- First get the latest ISO 639\-3 from https://iso639-3.sil.org/code_tables/download_tables + - Download the zip file containing the UTF\-8 tables, it will have a name like iso\-639\-3\_Code\_Tables\_20210202\.zip + - Unpack the zip file and update files below with the latest version: + - {CLDR}/tools/cldr\-code/src/main/resources/org/unicode/cldr/util/data/iso\-639\-3\.tab + - {CLDR}/tools/cldr\-code/src/main/resources/org/unicode/cldr/util/data/iso\-639\-3\_Name\_Index.tab + - {CLDR}/tools/cldr\-code/src/main/resources/org/unicode/cldr/util/data/iso\-639\-3\-macrolanguages.tab + - {CLDR}/tools/cldr\-code/src/main/resources/org/unicode/cldr/util/data/iso\-639\-3\_Retirements.tab + - Take the **latest** version number of the zip files (e.g. iso\-639\-3\_Code\_Tables\_**20210202**.zip), and paste into + - {CLDR}/tools/cldr\-code/src/main/resources/org/unicode/cldr/util/data/iso\-639\-3\-version.tab +- Go to http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry + - (you can set up a watch for changes in this page with http://www.watchthatpage.com ) + - Save as {CLDR}/tools/cldr\-code/src/main/resources/org/unicode/cldr/util/data/language\-subtag\-registry +- Go to http://data.iana.org/TLD/ + - Right\-click on [tlds\-alpha\-by\-domain.txt](http://data.iana.org/TLD/tlds-alpha-by-domain.txt) save as + - {{CLDR}/tools/cldr\-code/src/main/resources/org/unicode/cldr/util//data/[tlds\-alpha\-by\-domain.txt](http://data.iana.org/TLD/tlds-alpha-by-domain.txt) +- If using Eclipse, refresh the files +- Diff each with the old copy to check for consistency + - Certain of the steps below require that you note certain differences. +- Check if there is a new macrolanguage (marked with M in the second column of the iso\-639\-3\.tab file). (Should automate this, but there typically aren't that many new/changed entries). +- **Update tools/cldr\-code/src/main/resources/org/unicode/cldr/util/data/external/iso\_3166\_status.txt** + - Go to https://www.iso.org/obp/ui/#iso:pub:PUB500001:en + - Click **Full List of Country Codes** + - Run the tool **CompareIso3166\_1Status** + - Click on the "Officially Assigned" code type and also the "Other Codes" code type + - Compare total counts with tool output:  example "*formerly\_used \|\|  22*"  coinciding with 22 Formerly Used codes + - If something is wrong, you'll have to scroll through the code list and/or dig around for the updates +- Check if ISO has done something destabilizing with codes: you need to handle it specially. +- **Record the version: See [Updating External Metadata](https://cldr.unicode.org/development/updating-codes/external-version-metadata)** +- Do validity checks and regenerate: for details see [Validity](https://cldr.unicode.org/development/updating-codes/update-validity-xml) + - You'll have to do this again in [Updating Subdivision Codes](https://cldr.unicode.org/development/updating-codes/updating-subdivision-codes). +- Edit common/main/en.xml to add any new names, based on the Descriptions in the registry file. + - *You only need to add new languages and scripts that we add to supplementalMetaData.* + - But you need all territories. + - Any new macrolanguages need a language alias. + - Diff for sanity check +- If the code becomes deprecated, then add to supplementalMetadata under \ + - If there is a single replacement add it. + - Territories can have multiple replacements. Put them in population order. +- There are a few territories that don't yet have a top level domain (TLD) assigned, such as "BQ" or "SS". + - If there are new ones added in tlds\-alpha\-by\-domain.txt for a territory already in CLDR, update {cldrdata}\\tools\\java\\org\\unicode\\cldr\\util\\data\\territory\_codes.txt with the new TLD (usually the same as the country code. +- For new territories (regions) **// TODO: automate this more** + - Add to the territoryContainment in supplementalData.xml + - The data for that is at the UN site: + - With data from the EU at + - Add to territory\_codes.txt + - Use the UN mapping above for the 3letter and 3number codes. + - FIPS is a withdrawn standard as of 2008, so any new territories won't have a FIPS10 code. + - Look at tlds\-alpha\-by\-domain.txt to see if the new territory has a TLD assigned yet. + - rerun CountItems above. + - Add metazone mappings as needed. (Usually John \- requires research) + - Add the country/lang/population data (Usually Rick \- requires research) + - Add the currency data (Usually John \- requires research) + - ~~Update util/data/territory\_codes.txt~~ + - ~~This step will be different once the data is moved into SupplementalData.xml~~ + - ~~Todo: fix GenerateEnums around Utility.getUTF8Data("territory\_codes.txt");~~ +- Then run GenerateEnums.java, and make sure it completes with no exceptions. Fix any necessary results. + - Missing alpha3 for: xx, or "In RFC 4646 but not in CLDR: \[EA, EZ, IC, UN]" + - Ignore if it is {EA, EZ, IC, UN} Otherwise means you needed to do "For new territories" above +- Collision with: xx + - Ignore if it is {{MM, BU, 104}, {TP, TL, 626}, {YU, CS, 891}, {ZR, CD, 180}} +- Not in World but in CLDR: \[002, 003, 005, 009, 011, 013, 014, 015, 017\... Ignore 3\-digit coes + - (should have exception lists in tool for the Ignore's above) +- Run **ConsoleCheckCLDR \-f en \-z FINAL\_TESTING \-e** + - If you missed any codes, you will get error message: "Unexpected Attribute Value" +- Run all the unit tests. + - If you get a failure in LikelySubtagsTest because of a new region, you can hack around it with something like: + - \ + - \ + - You may also have to fix the coverageLevels.txt file for an error like: + - Error: (TestCoverageLevel.java:604\) Comprehensive \& no exception for path \=\> //ldml/localeDisplayNames/territories/territory\[@type\="202"] + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file diff --git a/docs/site/development/updating-codes/update-time-zone-data-for-zoneparser.md b/docs/site/development/updating-codes/update-time-zone-data-for-zoneparser.md new file mode 100644 index 00000000000..f1e7c6381c2 --- /dev/null +++ b/docs/site/development/updating-codes/update-time-zone-data-for-zoneparser.md @@ -0,0 +1,35 @@ +--- +title: Update Time Zone Data for ZoneParser +--- + +# Update Time Zone Data for ZoneParser + +Note: This is usually done as a part of full time zone data update process. + +1. Download the latest version of IANA Time Zone Database page: https://www.iana.org/time\-zones + - There are 3 links available for latest version. Select the complete distribution tzdb\-\.tar.lz (e.g. tzdb\-2021a.tar.lz). + - Extract entire contents to a work directory. + - **Note**: The data only distribution contains minimum set of files you really need. However, you cannot use a convenient make target without codes. The complete distribution package contains the codes. +2. Run make target \- rearguard\_tarballs\_version + - This target creates "rearguard" version of zoneinfo files under directory: tzdataunknown\-rearguard.dir. + - **Note**: If you specify a version (e.g. VERSION\=2021\) when invoking the target, "unknown" will be replaced with the specified version (e.g. tzdata2021a\-rearguard.dir), but it's not important in this instruction. + - A standard zoneinfo file may use negative daylight saving time offsets. CLDR code currently can not handle negative daylight saving time offsets. The "rearguard" version is designed for tools without negative daylight saving time support. +3. Copy files generated by previous step to {CLDR\_DIR}/tools/cldr\-code/src/main/resources/org/unicode/cldr/util/data + - Below the list of files to be include: + - africa + - antarctica + - asia + - australasia + - backward + - etcetera + - europe + - leapseconds + - northamerica + - southamerica + - zone.tab + - **Note**: leapseconds might be removed from the list later. +4. Edit the file {CLDR\_DIR}}/tools/cldr\-code/src/main/resources/org/unicode/cldr/util/data/tzdb\-version.txt + - This file contains just one line text specifying a version of Time Zone Database, e.g. 2021a. +5. **Record the version: See** [**Updating External Metadata**](https://cldr.unicode.org/development/updating-codes/external-version-metadata) + +![Unicode copyright](https://www.unicode.org/img/hb_notice.gif) \ No newline at end of file