feat(common/models): add Trie string-encoding + decoding methods 💾 #11088

jahorton · 2024-03-27T06:41:04Z

This PR aims to help address #10336, though not sufficient for a full fix.

So far, this PR establishes methods useful to "compress" / encode our lexical model Tries into a notably more compact format and to "decompress" / decode them from that format, one piece at a time.

It does not currently:

link these methods into our existing Trie definitions
use them at run-time
use them when compiling lexical models

So, there's obviously more work that would be needed, but it's a solid start in the right direction.

Possibly worth considering: do we build in some reserved ranges now in case we need future extensibility? See #11088 (comment).

@keymanapp-test-bot skip

keymanapp-test-bot · 2024-03-27T06:41:09Z

User Test Results

Test specification and instructions

User tests are not required

Test Artifacts

Android
Developer
- Keyman Developer - build pending
- Compiler Regression Tests - build pending
- kmcomp.zip - build pending
iOS
- Keyman for iOS (simulator image)
- FirstVoices Keyboards for iOS (simulator image)
- TestFlight internal PR build version - 18.0.96 (0.11088.11904)
Keyboards
- Test Keyboards - build pending
Web
- KeymanWeb Test Home
Windows

jahorton · 2024-03-27T06:43:46Z

common/models/templates/src/tries/compression.ts

+// Offsetting by even just 0x0020 avoids control-code chars + avoids VS Code not liking the encoding.
+const ENCODED_NUM_BASE = 0x0000;
+const SINGLE_CHAR_RANGE = Math.pow(2, 16) - ENCODED_NUM_BASE;


A notable differentiation from the pseudo-spec established in #10336. It does restrict the data ranges slightly, but it makes a big, positive difference in the encoding and how IDEs interpret the resulting encoded Trie data when written to files.

Note: the unit tests established later currently do not adjust for alternate ENCODED_NUM_BASE values. This shouldn't be too tricky to establish for reasonable value selections, though.

jahorton · 2024-03-27T06:45:13Z

common/models/templates/test/fixtures/tries/english-1000.json

A fixed version of the fixture, utilizing #11074. My issues using this fixture during development of this PR were the cause of #11073's discovery.

jahorton · 2024-03-27T06:48:20Z

common/models/templates/test/trie-compression.js

+    }`;
+
+    const compression = {
+      // Achieves FAR better compression than JSON.stringify, which \u-escapes most chars.


Unless using ENCODED_NUM_BASE=0x0020 or similar. JSON.stringify likes to \u-escape control characters, which leads to notable string-bloat when the control characters are utilized - they tend to represent values that appear with high frequency in the encoding.

> JSON.stringify(String.fromCharCode(1)) '"\\u0001"'

With ENCODED_NUM_BASE=0...

note that most leaf notes will have low, single-digit .entries counts, all of which would be represented by control codes and thus subject to \u-escaping. The word lengths will usually be notably less than 32 chars and thus would also subject to the same effects... leading to most encoded entries having length less than 32 chars.

also, most near-leaf internal nodes will have but a few legal values leading to child nodes, once again using control codes for their representation.

JSON.stringify use is much prettier and straightforward with ENCODED_NUM_BASE=0x0020 or above, as this bypasses the control-code range with one exception: 0x007f (DEL).

…ssion Aims for a UCS-2 encoded string and does not shy away from unpaired surrogates in the encoding.

… code When a leaf node exists at the same Trie location as an internal node, it should be a child of that internal node using SENTINEL_CODE_UNIT (\ufdd0). The fixture was using null/undefined instead!

…pressor

…ructure

… for compressed-Trie code

…e components

mcdurdin · 2024-08-26T06:51:00Z

do we build in some reserved ranges now in case we need future extensibility?

No, I think we just ensure that the data structure is versioned. Then we can change format if we need to by bumping the version.

jahorton · 2024-08-26T07:20:13Z

do we build in some reserved ranges now in case we need future extensibility?

No, I think we just ensure that the data structure is versioned. Then we can change format if we need to by bumping the version.

This would mean that future new 'versions' are assumed 'unreadable' by older versions. A bit of work would avoid that issue, letting old-version compatible code to read the parts of the data structure it recognizes while ignoring the parts it hasn't yet seen.

We're already talking about a breaking-change format moving to 18.0, where 17.0 and before won't be able to load models with a Trie-encoding format based on these changes. I would prefer to minimize the risk of causing that again.

That said, we can always leave it as a future work item, linking my thoughts in that direction above as a note for implementation before 18.0 release.

mcdurdin · 2024-08-26T08:43:41Z

This would mean that future new 'versions' are assumed 'unreadable' by older versions. A bit of work would avoid that issue, letting old-version compatible code to read the parts of the data structure it recognizes while ignoring the parts it hasn't yet seen.

We're already talking about a breaking-change format moving to 18.0, where 17.0 and before won't be able to load models with a Trie-encoding format based on these changes. I would prefer to minimize the risk of causing that again.

This is premature design. Do you have a specific use case that you are designing for where you can demonstrate that this will be sufficient to support that feature? If not, then reserving the space may not end up being adequate anyway, and we may still need a newer version of Keyman to support the new functionality -- and we've done this premature design for nothing.

We had to play these games in the past when upgrading was more difficult. The less we have to do it going forward, the better.

Keyman is becoming increasingly evergreen, and reserving space for an unknown future use is not the right thing to do -- either in terms of file formats, or in terms of APIs, or in internal design. Design for use now. Refactor when needed. Have good version management -- and that is sufficient.

jahorton · 2024-08-26T09:22:06Z

This is premature design. Do you have a specific use case that you are designing for where you can demonstrate that this will be sufficient to support that feature? If not, then reserving the space may not end up being adequate anyway, and we may still need a newer version of Keyman to support the new functionality -- and we've done this premature design for nothing.

We had to play these games in the past when upgrading was more difficult. The less we have to do it going forward, the better.

Example 1: localized emoji-oriented models

Noting the structure of the package I referenced for #905, they actually include use of GitHub and Slack style emoji code input: :emoji-name-here:. That works well... for English. These would probably be localized for different languages, and likely should be possible to look up based on their localized name. This is also separate from their actual encoded emoji character, the primary associated output.

To be clear, that package also supports keywords associated with the emojis - the prefix of any associated keyword will suggest the associated emoji. We'd likely have multiple entries per emoji - one per keyword for the emoji + its full-length standard tag - all of which should also provide the associated emoji... or the English tag for that emoji. Either version - the emoji codepoint or the English tag - is not closely associated with the key or the localized emoji tag - and will likely motivate an extra field.

If desired, we should be able to create Crowdin localization files based on the backing data's associated JSON structure and programmatically construct them if using the "English tag" variant. (That "English tag" would serve as the string's localization key.) This would allow progressive localization of emojis as a group effort by each language community, in theory, as a potential approach.

Example 2: Agglutinative models

In agglutinative models, we'll likely want to store affixes in tries for lookup. Granted, it's unclear if we'll want one Trie per affix type or all affixes in the same Trie... but if it's the latter, we may wish to 'tag' the affix's type with its entry.

Even if not that, I imagine word roots will have some sort of list of legal associated affixes - and that would likely need to be tagged onto each root's Trie entry.

Keyman is becoming increasingly evergreen, and reserving space for an unknown future use is not the right thing to do -- either in terms of file formats, or in terms of APIs, or in internal design. Design for use now. Refactor when needed. Have good version management -- and that is sufficient.

You say this, but we've taken great pains to maintain backward compatibility for keyboards. Last I checked, we still actively support KMW 1.0 and 2.0 keyboards! Why are models being treated differently?

jahorton · 2024-08-27T03:12:51Z

I guess I'm probably conflating reuse of the same code paths, with extensions for specific versions, with a desire for backward compatibility. Could always have a "new version" later that uses the same functions where they apply, then adds extra sections to work with the new data.

What I'm really hoping for with that: to avoid the need for major variations in the encoding patterns as we make more specific versions in the future. Not so much for the 18.0 release to actually use extended data from Trie versions that don't yet exist.

mcdurdin · 2024-08-27T03:16:10Z

You say this, but we've taken great pains to maintain backward compatibility for keyboards. Last I checked, we still actively support KMW 1.0 and 2.0 keyboards! Why are models being treated differently?

That's backward compatibility, not forward compatibility

common/models/templates/src/trie-compression.ts

ermshiperete · 2024-08-27T03:34:26Z

common/models/templates/src/trie-compression.ts

+  return compressed;
+}
+
+const ENTRY_HEADER_WIDTH = NODE_SIZE_WIDTH + WEIGHT_WIDTH;


What's NODE_SIZE_WIDTH? ENTRYLEN_WIDTH?

Doc-comments have been added.

common/models/templates/src/trie-compression.ts

ermshiperete · 2024-08-27T04:59:30Z

common/models/templates/src/trie-compression.ts

+  const weight = decompressNumber(str, baseIndex + NODE_SIZE_WIDTH, baseIndex + NODE_SIZE_WIDTH + WEIGHT_WIDTH);
+
+  // Assumes string-subsection size check has passed.
+  const childCnt = decompressNumber(str, baseIndex + NODE_TYPE_INDEX, baseIndex + NODE_TYPE_INDEX + 1);


Where do we call compressNumber for the count of children?

Line ~~195~~ 211 for internal nodes.

const valueCntAndType = values.length;

common/models/templates/test/test-trie-compression.js

ermshiperete · 2024-08-27T05:28:17Z

common/models/templates/test/test-trie-compression.js

+      // Requires full Trie integration, which isn't included yet.
+    });
+  });
+});


Do we need a test for decompressing \ufdd0?

Co-authored-by: Eberhard Beilharz <[email protected]>

jahorton added this to the 18.0 milestone Mar 27, 2024

keymanapp-test-bot bot added the user-test-missing User tests have not yet been defined for the PR label Mar 27, 2024

github-actions bot added common/ common/models/ common/web/ common/models/types/ common/models/templates/ feat labels Mar 27, 2024

jahorton commented Mar 27, 2024

View reviewed changes

jahorton mentioned this pull request Apr 9, 2024

feat(web): optimize the wordbreaker data table for filesize and ease of first-load parsing ⚡ #10692

Merged

jahorton added 8 commits July 2, 2024 11:24

feat(common/models/templates): initial trie node compression/decompre…

ef5a6f2

…ssion Aims for a UCS-2 encoded string and does not shy away from unpaired surrogates in the encoding.

fix(common/models/templates): early test-fixture did not use sentinel…

84a36bc

… code When a leaf node exists at the same Trie location as an internal node, it should be a child of that internal node using SENTINEL_CODE_UNIT (\ufdd0). The fixture was using null/undefined instead!

feat(common/models/templates): basic full-trie compression test

af2b7ac

fix(common/models/templates): model-compiler error correction for com…

be66dff

…pressor

chore(common/models/templates): doc on probable error location

56fd030

change(common/models/templates): encoded-num range-offset experiment

c0febba

chore(common/models/templates): reverts offset but leaves its infrast…

94900c5

…ructure

docs(common/models/templates): minor notes on noted editor complaints…

467efd4

… for compressed-Trie code

jahorton changed the base branch from master to change/common/models/templates/trie-results-through-traversal July 2, 2024 04:25

jahorton force-pushed the feat/common/models/templates/trie-compression-start branch from b40b7c7 to 467efd4 Compare July 2, 2024 04:25

github-actions bot added common/models/types/ common/web/ and removed common/web/ common/models/types/ labels Jul 2, 2024

Base automatically changed from change/common/models/templates/trie-results-through-traversal to master July 8, 2024 07:37

jahorton mentioned this pull request Jul 31, 2024

chore(web): move common/web/utils → web/src/engine/common/utils/ 🏗️ #12034

Closed

refactor(common/models): split trie-model implementation into separat…

7644d38

…e components

jahorton changed the base branch from epic/user-dict to chore/models/seed-epic-model-encoding August 23, 2024 05:01

keymanapp-test-bot bot added the epic-model-encoding label Aug 23, 2024

jahorton changed the title ~~feat(common/models): add Trie string-encoding + decoding methods 💾 📖~~ feat(common/models): add Trie string-encoding + decoding methods 💾 Aug 23, 2024

jahorton removed the epic-user-dict label Aug 23, 2024

mcdurdin modified the milestones: 18.0, A18S9 Aug 26, 2024

mcdurdin approved these changes Aug 26, 2024

View reviewed changes

Base automatically changed from chore/models/seed-epic-model-encoding to epic/model-encoding August 27, 2024 03:06

ermshiperete approved these changes Aug 27, 2024

View reviewed changes

chore(common/models): Apply suggestions from code review

b351cd1

Co-authored-by: Eberhard Beilharz <[email protected]>

github-actions bot added developer/ developer/compilers/ common/models/types/ common/web/ and removed common/web/ common/models/types/ labels Aug 27, 2024

chore(common/models): addresses a number of PR comments

2e92a7f

github-actions bot added common/models/types/ common/web/ and removed common/web/ common/models/types/ labels Aug 27, 2024

jahorton merged commit 2093ef9 into epic/model-encoding Aug 28, 2024
16 of 17 checks passed

jahorton deleted the feat/common/models/templates/trie-compression-start branch August 28, 2024 07:02

jahorton mentioned this pull request Aug 28, 2024

feat(web): prototype KMX+ TouchLayout generator (in TS) #12305

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(common/models): add Trie string-encoding + decoding methods 💾 #11088

feat(common/models): add Trie string-encoding + decoding methods 💾 #11088

jahorton commented Mar 27, 2024 •

edited

Loading

keymanapp-test-bot bot commented Mar 27, 2024 •

edited

Loading

jahorton Mar 27, 2024

jahorton Mar 27, 2024

jahorton Mar 27, 2024

jahorton Mar 27, 2024

mcdurdin commented Aug 26, 2024

jahorton commented Aug 26, 2024 •

edited

Loading

mcdurdin commented Aug 26, 2024

jahorton commented Aug 26, 2024 •

edited

Loading

jahorton commented Aug 27, 2024 •

edited

Loading

mcdurdin commented Aug 27, 2024

ermshiperete Aug 27, 2024

jahorton Aug 27, 2024

ermshiperete Aug 27, 2024

jahorton Aug 27, 2024 •

edited

Loading

ermshiperete Aug 27, 2024

feat(common/models): add Trie string-encoding + decoding methods 💾 #11088

feat(common/models): add Trie string-encoding + decoding methods 💾 #11088

Conversation

jahorton commented Mar 27, 2024 • edited Loading

keymanapp-test-bot bot commented Mar 27, 2024 • edited Loading

User Test Results

Test Artifacts

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcdurdin commented Aug 26, 2024

jahorton commented Aug 26, 2024 • edited Loading

mcdurdin commented Aug 26, 2024

jahorton commented Aug 26, 2024 • edited Loading

jahorton commented Aug 27, 2024 • edited Loading

mcdurdin commented Aug 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jahorton Aug 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jahorton commented Mar 27, 2024 •

edited

Loading

keymanapp-test-bot bot commented Mar 27, 2024 •

edited

Loading

jahorton commented Aug 26, 2024 •

edited

Loading

jahorton commented Aug 26, 2024 •

edited

Loading

jahorton commented Aug 27, 2024 •

edited

Loading

jahorton Aug 27, 2024 •

edited

Loading